The Soren platform currently provides enterprise-level content embedding capabilities for vector databases. This project aims to significantly enhance these features by introducing the following improvements: User-Supplied LLM API Tokens: Allow users to bring their own access keys for services like OpenAI or Gemini. Prompt-Based Preprocessing: Add a configurable step to enrich content before embedding using LLMs. Task-Specific Model Selection: Enable users to choose the most suitable embedding model depending on content type (e.g., code vs. general text). Multi-Vector Metadata Storage: Integrate deeply with Qdrant to store multiple types of vectors — including raw content, summaries, and structured metadata — to support hybrid search strategies. Together, these features will make the embedding pipeline smarter, more customizable, and better suited for complex enterprise use cases such as document intelligence, code indexing, and semantic search.
Task description
Design and implement a modular embedding pipeline with the following capabilities: LLM Token Injection: Build support for securely accepting user-supplied API tokens (e.g., OpenAI, Gemini) via UI or API. Prompt-Based Preprocessing: Introduce an optional step where users can define a prompt to: Summarize long documents Extract metadata or tags Convert formats (e.g., tabular text → JSON) Model Selection by Task Type: Support selecting between models such as: Gemini Code Model / OpenAI Code Embeddings for source code OpenAI Text Embeddings / domain-specific models for text Multi-Vector Embedding & Storage: Generate separate embeddings for: Primary content Preprocessed summaries Extracted metadata
Store these in Qdrant under vector labels (e.g., primary, summary, tags) to support hybrid retrieval. Retrieval Workflow: Implement a matching retrieval process: Accept user query + retrieval mode (semantic or hybrid) Embed query using same model type as insert Perform multi-vector search and combine results using relevance scoring Upsert Logic: Ensure upsert flow checks for existing document ID in Qdrant and updates or inserts vectors accordingly.