chroma

rag Mar 26, 2026 4 min read

chroma is an open-source embedding database. use this if you want to self-host your search backend or already have data in chroma.

when to use chroma

  • you want to self-host your search infrastructure
  • you already have data indexed in chroma
  • you want vector, lexical (BM25), or hybrid search with full control over the server

1. create your corpus

use ChromaChunkSource to upload and index your documents:

from trainer.corpus.chroma.source import ChromaChunkSource

source = ChromaChunkSource(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

# upload from a local folder
source.populate_from_folder("./docs/")

# or upload pre-chunked data
source.populate_from_chunks(chunks)

search modes

chroma auto-detects which search modes are available based on the server’s capabilities:

moderequiresdescription
vectoralways availableembedding-based similarity search
lexicalBM25 via Chroma Search APIkeyword matching
hybridboth vector + BM25reciprocal rank fusion of lexical + vector

if BM25 isn’t available on the server, chroma gracefully falls back to vector-only search.

connection modes

chroma supports multiple connection modes, but client-server mode is required for training since the model needs network access to the server during remote training:

# client-server mode (required for training)
source = ChromaChunkSource(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

# local persistent mode (development only)
source = ChromaChunkSource(
    collection_name="my-docs",
    path="./chroma-data",
)

ChromaChunkSource parameters

parameterdefaultdescription
collection_namerequiredChroma collection name
hostNoneChroma server hostname (required for training)
port8000Chroma server port
pathNonelocal directory for persistent storage (development only)
embed_fnNonecustom embedding function
content_attrNonemetadata fields to concatenate as document content
distance_metric"cosine"distance metric for the collection
enable_bm25Trueattempt to use BM25 if available on server

2. generate QA data

pass your ChromaChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()

3. train with SearchEnv

create a ChromaSearch client and pass it to SearchEnv:

from trainer.corpus.chroma.search import ChromaSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = ChromaSearch(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=result["train_dataset"],
    eval_dataset=result["eval_dataset"],
    prefix="chroma-search",
    api_key="sk_...",
)

ChromaSearch parameters

parameterdefaultdescription
collection_namerequiredChroma collection name
hostrequiredChroma server hostname
port8000Chroma server port
embed_fnNonecustom embedding function
enable_bm25Trueattempt to use BM25 search if available on server
content_attrNonemetadata fields to concatenate as document content

notes

  • client-server required for training: during remote training, the model needs network access to query the chroma server. local/in-memory mode won’t work. make sure your chroma server is reachable from the training cluster
  • hybrid search uses client-side reciprocal rank fusion (RRF) to merge lexical and vector results
  • graceful degradation: if BM25 isn’t available on the server, chroma automatically falls back to vector-only search without errors
  • ChromaSearch is pickle-safe. it stores only connection parameters and reconstructs the client lazily after unpickling
  • ChromaChunkSource and ChromaSearch are separate: the source handles corpus creation and QA generation, the search client handles training