chroma is an open-source embedding database. use this if you want to self-host your search backend or already have data in chroma.
when to use chroma
- you want to self-host your search infrastructure
- you already have data indexed in chroma
- you want vector, lexical (BM25), or hybrid search with full control over the server
1. create your corpus
use ChromaChunkSource to upload and index your documents:
from trainer.corpus.chroma.source import ChromaChunkSource
source = ChromaChunkSource(
collection_name="my-docs",
host="chroma.example.com",
port=8000,
)
# upload from a local folder
source.populate_from_folder("./docs/")
# or upload pre-chunked data
source.populate_from_chunks(chunks)
search modes
chroma auto-detects which search modes are available based on the server’s capabilities:
| mode | requires | description |
|---|---|---|
vector | always available | embedding-based similarity search |
lexical | BM25 via Chroma Search API | keyword matching |
hybrid | both vector + BM25 | reciprocal rank fusion of lexical + vector |
if BM25 isn’t available on the server, chroma gracefully falls back to vector-only search.
connection modes
chroma supports multiple connection modes, but client-server mode is required for training since the model needs network access to the server during remote training:
# client-server mode (required for training)
source = ChromaChunkSource(
collection_name="my-docs",
host="chroma.example.com",
port=8000,
)
# local persistent mode (development only)
source = ChromaChunkSource(
collection_name="my-docs",
path="./chroma-data",
)
ChromaChunkSource parameters
| parameter | default | description |
|---|---|---|
collection_name | required | Chroma collection name |
host | None | Chroma server hostname (required for training) |
port | 8000 | Chroma server port |
path | None | local directory for persistent storage (development only) |
embed_fn | None | custom embedding function |
content_attr | None | metadata fields to concatenate as document content |
distance_metric | "cosine" | distance metric for the collection |
enable_bm25 | True | attempt to use BM25 if available on server |
2. generate QA data
pass your ChromaChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:
from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline
cfg = CgftPipelineConfig(
platform=PlatformConfig(api_key="sk_..."),
corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()
pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()
3. train with SearchEnv
create a ChromaSearch client and pass it to SearchEnv:
from trainer.corpus.chroma.search import ChromaSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train
search = ChromaSearch(
collection_name="my-docs",
host="chroma.example.com",
port=8000,
)
experiment_id = train(
env_class=SearchEnv,
env_args={"search": search},
train_dataset=result["train_dataset"],
eval_dataset=result["eval_dataset"],
prefix="chroma-search",
api_key="sk_...",
)
ChromaSearch parameters
| parameter | default | description |
|---|---|---|
collection_name | required | Chroma collection name |
host | required | Chroma server hostname |
port | 8000 | Chroma server port |
embed_fn | None | custom embedding function |
enable_bm25 | True | attempt to use BM25 search if available on server |
content_attr | None | metadata fields to concatenate as document content |
notes
- client-server required for training: during remote training, the model needs network access to query the chroma server. local/in-memory mode won’t work. make sure your chroma server is reachable from the training cluster
- hybrid search uses client-side reciprocal rank fusion (RRF) to merge lexical and vector results
- graceful degradation: if BM25 isn’t available on the server, chroma automatically falls back to vector-only search without errors
ChromaSearchis pickle-safe. it stores only connection parameters and reconstructs the client lazily after unpicklingChromaChunkSourceandChromaSearchare separate: the source handles corpus creation and QA generation, the search client handles training