the castform rag pipeline takes your documents and produces a trained search model that understands your corpus.
pipeline stages
- chunking: split your documents into retrieval-sized pieces. built-in chunkers for markdown and email threads, or bring your own.
- corpus upload: index chunks for search. castform’s corpus api (BM25), turbopuffer, pinecone, or chroma.
- QA generation: generate synthetic question-answer pairs from your corpus. handles generation, filtering, and transformation.
- search environment: define the tools and reward signals for training. the model gets a search tool over your corpus and is rewarded for retrieving the right chunks.
quickstart: documents to trained model
from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline
from trainer.trainer.pipeline import train
from trainer.corpus.corpora.search import CorporaSearch
from trainer.envs.search_env import SearchEnv
# 1. generate training data
cfg = CgftPipelineConfig(
platform=PlatformConfig(api_key="sk_..."),
corpus=CorpusConfig(corpus_name="my-docs", docs_path="./my-docs"),
targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()
pipeline = CgftPipeline(cfg)
result = pipeline.run()
# 2. launch training
search = CorporaSearch(
api_key="sk_...",
corpus_name="my-docs",
base_url="https://app.castform.com",
)
experiment_id = train(
env_class=SearchEnv,
env_args={"search": search},
train_dataset=result["train_dataset"],
eval_dataset=result["eval_dataset"],
prefix="my-search-model",
api_key="sk_...",
)
monitor progress in the console.
this quickstart uses the castform corpus api (BM25). for third-party backends, see the integration guides: turbopuffer, pinecone, chroma.
next steps
- QA generation: how
CgftPipelineworks, full config reference - chunking: customize how documents are split
- search environment: how the RL training environment works