turbopuffer is a third-party vector database that supports lexical, vector, and hybrid search. use this if you already have data in turbopuffer or want vector/hybrid retrieval beyond BM25. castform does not provide turbopuffer access; you’ll need your own api key.
when to use turbopuffer
- you already have data indexed in turbopuffer
- you need vector or hybrid search (e.g., for code, jargon-heavy content, or multilingual corpora where keyword matching alone falls short)
- you want lexical search without embeddings (turbopuffer supports BM25 natively)
1. create your corpus
use TpufChunkSource to upload and index your documents:
from trainer.corpus.turbopuffer.source import TpufChunkSource
# lexical-only (no embeddings needed)
source = TpufChunkSource(
api_key="tpuf_...",
namespace="my-docs",
)
source.populate_from_folder("./docs/")
to enable vector and hybrid search, pass an embedding function:
from trainer.corpus.turbopuffer.source import TpufChunkSource
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("all-MiniLM-L6-v2")
source = TpufChunkSource(
api_key="tpuf_...",
namespace="my-docs",
embed_fn=model.encode,
)
source.populate_from_folder("./docs/")
you can also upload pre-chunked data:
source.populate_from_chunks(chunks)
search modes
| mode | requires embed_fn | description |
|---|---|---|
lexical | no | BM25 keyword matching |
vector | yes | approximate nearest neighbor with embeddings |
hybrid | yes | reciprocal rank fusion of lexical + vector |
2. generate QA data
pass your TpufChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:
from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline
cfg = CgftPipelineConfig(
platform=PlatformConfig(api_key="sk_..."),
corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()
pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()
3. train with SearchEnv
create a TpufSearch client and pass it to SearchEnv:
from trainer.corpus.turbopuffer.search import TpufSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train
search = TpufSearch(
api_key="tpuf_...",
namespace="my-docs",
embed_fn=model.encode, # same embed_fn used for corpus
)
experiment_id = train(
env_class=SearchEnv,
env_args={"search": search},
train_dataset=result["train_dataset"],
eval_dataset=result["eval_dataset"],
prefix="tpuf-search",
api_key="sk_...",
)
TpufSearch parameters
| parameter | default | description |
|---|---|---|
api_key | required | turbopuffer API key |
namespace | required | turbopuffer namespace |
region | "aws-us-east-1" | turbopuffer region |
content_attr | None | metadata fields to concatenate as content |
embed_fn | None | embedding function; required for vector/hybrid modes |
vector_attr | "vector" | attribute name for vector storage |
distance_metric | "cosine_distance" | distance metric for vector search |
notes
- hybrid search uses client-side reciprocal rank fusion (RRF) to merge lexical and vector results
TpufSearchis pickle-safe. it stores only connection parameters and reconstructs the SDK client lazily after unpicklingTpufChunkSourceandTpufSearchare separate: the source handles corpus creation and QA generation, the search client handles training