chroma | castform docs

chroma is an open-source embedding database. use this if you want to self-host your search backend or already have data in chroma.

when to use chroma

you want to self-host your search infrastructure
you already have data indexed in chroma
you want vector, lexical (BM25), or hybrid search with full control over the server

1. create your corpus

use ChromaChunkSource to upload and index your documents:

from trainer.corpus.chroma.source import ChromaChunkSource

source = ChromaChunkSource(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

# upload from a local folder
source.populate_from_folder("./docs/")

# or upload pre-chunked data
source.populate_from_chunks(chunks)

search modes

chroma auto-detects which search modes are available based on the server’s capabilities:

mode	requires	description
`vector`	always available	embedding-based similarity search
`lexical`	BM25 via Chroma Search API	keyword matching
`hybrid`	both vector + BM25	reciprocal rank fusion of lexical + vector

if BM25 isn’t available on the server, chroma gracefully falls back to vector-only search.

connection modes

chroma supports multiple connection modes, but client-server mode is required for training since the model needs network access to the server during remote training:

# client-server mode (required for training)
source = ChromaChunkSource(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

# local persistent mode (development only)
source = ChromaChunkSource(
    collection_name="my-docs",
    path="./chroma-data",
)

ChromaChunkSource parameters

parameter	default	description
`collection_name`	required	Chroma collection name
`host`	`None`	Chroma server hostname (required for training)
`port`	`8000`	Chroma server port
`path`	`None`	local directory for persistent storage (development only)
`embed_fn`	`None`	custom embedding function
`content_attr`	`None`	metadata fields to concatenate as document content
`distance_metric`	`"cosine"`	distance metric for the collection
`enable_bm25`	`True`	attempt to use BM25 if available on server

2. generate QA data

pass your ChromaChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()

3. train with SearchEnv

create a ChromaSearch client and pass it to SearchEnv:

from trainer.corpus.chroma.search import ChromaSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = ChromaSearch(
    collection_name="my-docs",
    host="chroma.example.com",
    port=8000,
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=result["train_dataset"],
    eval_dataset=result["eval_dataset"],
    prefix="chroma-search",
    api_key="sk_...",
)

ChromaSearch parameters

parameter	default	description
`collection_name`	required	Chroma collection name
`host`	required	Chroma server hostname
`port`	`8000`	Chroma server port
`embed_fn`	`None`	custom embedding function
`enable_bm25`	`True`	attempt to use BM25 search if available on server
`content_attr`	`None`	metadata fields to concatenate as document content

notes

client-server required for training: during remote training, the model needs network access to query the chroma server. local/in-memory mode won’t work. make sure your chroma server is reachable from the training cluster
hybrid search uses client-side reciprocal rank fusion (RRF) to merge lexical and vector results
graceful degradation: if BM25 isn’t available on the server, chroma automatically falls back to vector-only search without errors
ChromaSearch is pickle-safe. it stores only connection parameters and reconstructs the client lazily after unpickling
ChromaChunkSource and ChromaSearch are separate: the source handles corpus creation and QA generation, the search client handles training