Corpus Embedding and Indexing

In practical applications, RAG systems often need to perform efficient retrieval in corpora containing millions or even tens of millions of documents. To this end, UltraRAG supports embedding encoding of raw corpora and building efficient indexes to support large-scale semantic retrieval needs. This section will introduce how to write the corresponding pipeline based on UltraRAG and complete the process of embedding and indexing the corpus.

Step 1: Write the Pipeline YAML File

Create a new YAML file (e.g., corpus_index.yaml) in the examples/ directory with the following content:

examples/corpus_index.yaml

# MCP Server
servers:
  retriever: servers/retriever

# MCP Client Pipeline
pipeline:
- retriever.retriever_init
- retriever.retriever_embed
- retriever.retriever_index

Step 2: Build the Pipeline and Configure Parameters

Run the following command to build the pipeline:

ultrarag build examples/corpus_index.yaml

A parameter configuration file will be generated, which can be modified according to specific needs. For example:

examples/parameter/corpus_index_parameter.yaml

retriever:
  corpus_path: data/sample_hotpotqa_corpus_5.jsonl       # Input corpus path (JSONL format)
  retriever_path: openbmb/MiniCPM-Embedding-Light        # Retrieval model name or path
  embedding_path: embedding/embedding.npy                # Path to save embedding vectors
  index_path: index/index.index                          # Path to save index file
  faiss_use_gpu: true                                    # Whether to enable GPU acceleration
  index_chunk_size: 50000                                # Chunk size when building the index
  cuda_devices: 0,1                                      # GPU device IDs to use
  overwrite: false                                       # Whether to overwrite existing files
  infinity_kwargs:                                       # Embedding engine related configuration
    batch_size: 1024
    bettertransformer: false
    device: cuda
    pooling_method: auto

Step 3: Run the Pipeline for Encoding and Indexing

Execute the following command to run the corpus processing workflow:

ultrarag run examples/corpus_index.yaml

📌 Tip: Since the encoding and indexing process may involve large-scale corpora and take a long time, it is recommended to use screen or nohup to run the task in the background, for example:

nohup ultrarag run examples/corpus_index.yaml > log.txt 2>&1 &

Getting Started

Developer Guide

Corpus Embedding and Indexing

Step 1: Write the Pipeline YAML File

Step 2: Build the Pipeline and Configure Parameters

Step 3: Run the Pipeline for Encoding and Indexing

Getting Started

Developer Guide

​Step 1: Write the Pipeline YAML File

​Step 2: Build the Pipeline and Configure Parameters

​Step 3: Run the Pipeline for Encoding and Indexing

Step 1: Write the Pipeline YAML File

Step 2: Build the Pipeline and Configure Parameters

Step 3: Run the Pipeline for Encoding and Indexing