In practical applications, RAG systems often need to perform efficient retrieval in corpora containing millions or even tens of millions of documents. To this end, UltraRAG supports embedding encoding of raw corpora and building efficient indexes to support large-scale semantic retrieval needs.This section will introduce how to write the corresponding pipeline based on UltraRAG and complete the process of embedding and indexing the corpus.
Step 2: Build the Pipeline and Configure Parameters
Run the following command to build the pipeline:
Copy
ultrarag build examples/corpus_index.yaml
A parameter configuration file will be generated, which can be modified according to specific needs. For example:
examples/parameter/corpus_index_parameter.yaml
Copy
retriever: corpus_path: data/sample_hotpotqa_corpus_5.jsonl # Input corpus path (JSONL format) retriever_path: openbmb/MiniCPM-Embedding-Light # Retrieval model name or path embedding_path: embedding/embedding.npy # Path to save embedding vectors index_path: index/index.index # Path to save index file faiss_use_gpu: true # Whether to enable GPU acceleration index_chunk_size: 50000 # Chunk size when building the index cuda_devices: 0,1 # GPU device IDs to use overwrite: false # Whether to overwrite existing files infinity_kwargs: # Embedding engine related configuration batch_size: 1024 bettertransformer: false device: cuda pooling_method: auto
Step 3: Run the Pipeline for Encoding and Indexing
Execute the following command to run the corpus processing workflow:
Copy
ultrarag run examples/corpus_index.yaml
📌 Tip: Since the encoding and indexing process may involve large-scale corpora and take a long time, it is recommended to use screen or nohup to run the task in the background, for example:
Copy
nohup ultrarag run examples/corpus_index.yaml > log.txt 2>&1 &