Advanced configuration
Last updated
Last updated
After uploading content to the knowledge base, it needs to undergo chunking and data cleaning. This stage can be understood as content preprocessing and structuring.
Two strategies are supported:
Automatic mode
Custom mode
Automatic
The Automated mode is designed for users unfamiliar with segmentation and preprocessing techniques. In this mode, Vord automatically segments and sanitizes content files, streamlining the document preparation process.
You need to choose the indexing method for the text to specify the data matching method. The indexing strategy is often related to the retrieval method, and you need to choose the appropriate retrieval settings according to the scenario.
High-Quality Mode
Economical Mode
Q&A Mode
In High-Quality mode, the system first leverages an configurable Embedding model (which can be switched) to convert chunk text into numerical vectors. This process facilitates efficient compression and persistent storage of large-scale textual data, while simultaneously enhancing the accuracy of LLM-user interactions.
The High-Quality indexing method offers three retrieval settings: vector retrieval, full-text retrieval, and hybrid retrieval. For more details on retrieval settings, please check "Retrieval Settings".
In high-quality indexing mode, Vord offers three retrieval settings:
Vector Search
Full-Text Search
Hybrid Search
Vector Search
Definition: The system vectorizes the user's input query to generate a query vector. It then computes the distance between this query vector and the text vectors in the knowledge base to identify the most semantically proximate text chunks.
Vector Search Settings:
Rerank Model: After configuring the API key for the Rerank model on the "Model Provider" page, you can enable the “Rerank Model” in the retrieval settings. The system will then perform semantic reordering of the retrieved document results after hybrid retrieval, optimizing the ranking results. Once the Rerank model is established, the TopK and Score Threshold settings will only take effect during the reranking step.
TopK: This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Score Threshold: This parameter sets the similarity threshold for filtering text chucks. Only text chucks that exceed the specified score will be recalled. By default, this setting is off, meaning there will be no filtering of similarity values for recalled text chucks. When enabled, the default value is 0.5. A higher value is likely to yield fewer recalled texts.
The TopK and Score configurations are only effective during the Rerank phase. Therefore, to apply either of these settings, it is necessary to add and enable a Rerank model.
In the Economical indexing mode, Vord offers a single retrieval setting:
Inverted Index:
TopK:
This parameter filters the text chucks that are most similar to the user's question. The system dynamically adjusts the number of snippets based on the context window size of the selected model. The default value is 3, meaning a higher value results in more text segments being retrieved.
Optional ETL Configuration
Unstructured can efficiently extract and transform your data into clean data for subsequent steps.
ETL solution choices in different versions of Vord:
The SaaS version defaults to using Unstructured ETL and cannot be changed;
The community version defaults to using Vord ETL but can enable Unstructured ETL through environment variables;
Differences in supported file formats for parsing:
txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv
txt, markdown, md, pdf, html, htm, xlsx, xls, docx, csv, eml, msg, pptx, ppt, xml, epub
Embedding Model
Embedding transforms discrete variables (words, sentences, documents) into continuous vector representations, mapping high-dimensional data to lower-dimensional spaces. This technique preserves crucial semantic information while reducing dimensionality, enhancing content retrieval efficiency.
Embedding models, specialized large language models, excel at converting text into dense numerical vectors, effectively capturing semantic nuances for improved data processing and analysis.
An inverted index is an index structure designed for rapid keyword retrieval in documents. Its fundamental principle involves mapping keywords from documents to lists of documents containing those keywords, thereby enhancing search efficiency. For a detailed explanation of the underlying mechanism, please refer to the .
In production-level applications of RAG, to achieve better data recall, multi-source data needs to be preprocessed and cleaned, i.e., ETL (extract, transform, load). To enhance the preprocessing capabilities of unstructured/semi-structured data, Vord supports optional ETL solutions: Vord ETL and .
Different ETL solutions may have differences in file extraction effects. For more information on Unstructured ETL’s data processing methods, please refer to the .