qa generation | castform docs

CgftPipeline takes an indexed corpus and produces a synthetic QA dataset for training. chunking, corpus upload, and training are handled separately.

how it works

the pipeline is configured through CgftPipelineConfig, which has a few top-level sections:

platform / corpus: your api keys and which corpus to generate from. if you provide a docs_path, the pipeline chunks and indexes the documents for you. if you already have an indexed corpus, pass corpus_id instead.

corpus_context: optional metadata about your corpus (a description, example queries). helps the LLM generate more relevant and diverse questions.

targets: how many QA pairs to generate and the distribution across question types (single-hop lookup, multi-hop, reasoning chains, etc).

the pipeline then runs three stages:

generation: samples seed chunks from your corpus, links them to related chunks, and uses an LLM to produce question-answer pairs grounded in those chunks.
filtering: runs quality checks on the generated pairs. rejects ones with format issues, ungrounded answers, or trivially easy retrieval. pairs that fail can be regenerated with feedback through a refinement loop.
transformation: rewrites questions into different styles (keyword, natural language, expert jargon) and optionally adds noise (typos, abbreviations) to make the training data more realistic.

the output is a train/eval split in JSONL format.

usage

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./my-docs"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg)
result = pipeline.run()

train_data = result["train_dataset"]
eval_data = result["eval_dataset"]

the pipeline is resumable. rerun with the same output directory and it picks up from the last checkpoint.

full config example

most defaults are sensible. here’s everything:

random_seed: 42
verbose: true
resume: true

platform:
    api_key: 'sk_...'
    base_url: 'https://app.castform.com'

corpus:
    docs_path: './my-docs'
    corpus_name: 'my-docs'
    min_chunk_chars: 400

corpus_context:
    enabled: true
    description: 'internal engineering documentation for acme corp'
    example_queries:
        - 'how do I configure the auth middleware?'
        - "what's the retry policy for failed jobs?"
    num_top_level_samples: 4
    num_random_samples: 4
    generate_entity_patterns: true

targets:
    total_samples: 200
    qa_type_distribution:
        lookup: 0.333
        co_located_multi_hop: 0.200
        cross_document_multi_hop: 0.333
        sequential_reasoning: 0.133
        synthesis: 0.0

linker:
    type: 'structural'
    structural:
        bm25_enrichment_queries: 3
        bm25_enrichment_top_k: 5
        max_related_refs: 3
        search_mode: 'auto'

generation:
    mode: 'llm_direct'
    llm_direct:
        model: 'gpt-5.4'
        max_completion_tokens: 4096
        max_concurrent: 8
        batch_enabled: true

filtering:
    deterministic_guards:
        enabled: true
        min_question_chars: 12
        min_answer_chars: 24
        min_reference_chunks: 1
    filters:
        - 'grounding_llm'
        - 'retrieval_too_easy_llm'
    grounding_llm:
        judge_model: 'gpt-5.4'
    retrieval_llm:
        judge_model: 'gpt-5.4'
        overlap_threshold: 0.5
        too_easy_confidence_threshold: 0.75

refinement:
    enabled: true
    max_refinements_per_item: 2
    max_same_seed_attempts_before_reanchor: 3
    max_rounds: 4

transformation:
    noise_level: 'light'
    style_distribution:
        keyword: 0.33
        natural: 0.34
        expert: 0.33
    validation_enabled: true
    preserve_original_in_metadata: true

split:
    train_ratio: 0.8
    stratify_by: ['qa_type', 'style']
    seed: 42

output:
    dir: 'outputs/castform'
    train_jsonl: 'train.jsonl'
    eval_jsonl: 'eval.jsonl'

generation

the pipeline samples seed chunks, links them to related chunks, and uses an LLM to generate questions that require those chunks to answer. each QA pair is a <question, answer, reference_chunks> triplet.

qa types

type	description
`lookup`	single-chunk fact lookup
`co_located_multi_hop`	multi-hop within the same document
`cross_document_multi_hop`	multi-hop across different documents
`sequential_reasoning`	step-by-step reasoning chains
`synthesis`	summarization across multiple sources (disabled by default)

control the mix with targets.qa_type_distribution. single-hop is cheaper (one LLM call). multi-hop requires chunk linking first.

corpus context

before generating, the pipeline profiles your corpus by summarizing content, extracting entity patterns, and expanding your example queries. this helps the LLM understand your domain.

field	default	description
`description`	`""`	plain-text description of your corpus
`example_queries`	`[]`	example search queries users would ask
`generate_entity_patterns`	`true`	extract entity patterns (names, ids, jargon)

providing description and example_queries significantly improves QA diversity.

chunk linkers

linkers find related chunks for multi-hop questions.

structural (default) uses file-structure neighbors and BM25 enrichment. no LLM calls. good for well-structured docs.

field	default	description
`bm25_enrichment_queries`	`3`	BM25 queries per chunk
`max_related_refs`	`3`	max related chunks to link
`search_mode`	`"auto"`	`auto` / `lexical` / `hybrid` / `vector`

llm_guided has the LLM generate search queries to find semantically related chunks. better for unstructured corpora, more expensive.

adaptive starts structural and falls back to LLM-guided when enrichment signals are weak.

generators

llm_direct (default) makes a direct LLM call per QA pair. fast.

field	default	description
`model`	`"gpt-5.4"`	generation model
`max_concurrent`	`8`	parallel generation requests
`batch_enabled`	`true`	enable batch processing

llm_env generates QA through an RL environment rollout where the model uses tools to search interactively. more expensive but produces higher quality multi-hop questions.

tips

chunk size: 1024-2048 chars works well. too small gives low context, too big gives noisy questions.
spread seeds: more chunks with fewer questions each beats fewer chunks with many questions.
start with llm_direct. only switch to llm_env if you need higher quality multi-hop.

filters run in sequence. each marks items as passed, rejected, or needs_refinement. items that need refinement get regenerated with feedback.

filter chain

runs cheapest to most expensive:

1. deterministic guards catch format and length issues: empty answers, single-word questions, missing references.

field	default	description
`min_question_chars`	`12`	minimum question length
`min_answer_chars`	`24`	minimum answer length
`min_reference_chunks`	`1`	minimum reference chunks

2. grounding_llm uses an LLM judge to check whether the answer is actually supported by the reference chunks. the most important filter.

3. retrieval_too_easy_llm checks if naive BM25 can already find the answer. if so, the question won’t teach the model anything via RL. marks as needs_refinement rather than rejecting.

field	default	description
`overlap_threshold`	`0.5`	chunk overlap threshold for flagging
`too_easy_confidence_threshold`	`0.75`	confidence above this = too easy

4. env_rollout runs the QA pair through the actual RL environment. most expensive, most accurate. optional.

failed items get regenerated with the failure reason as feedback:

filters run on all QA pairs
needs_refinement items get regenerated with feedback
regenerated items go through filters again
repeat until all pass or budget runs out

field	default	description
`max_refinements_per_item`	`2`	max fix attempts per pair
`max_same_seed_attempts_before_reanchor`	`3`	failures before switching to a different seed chunk
`max_rounds`	`4`	max filter-refine cycles
`max_total_regenerations`	`total_samples * 2`	global budget cap

if a seed chunk keeps producing bad questions, reanchoring to a different chunk is more productive than retrying.

checkpointing

results are saved after each filter round. resume: true (the default) picks up from the last completed round on restart.

transformation

rewrites questions to mimic real-world search patterns. users don’t always write clean, well-formed queries.

style distribution

style	default weight	example
`keyword`	33%	`k8s pod memory limits`
`natural`	34%	`how do I set memory limits on kubernetes pods?`
`expert`	33%	`configure resource requests and limits in pod spec`

adjust weights to match how your users actually search.

noise levels

level	behavior
`none`	no modification
`light`	minor typos, abbreviations, casual phrasing
`moderate`	dropped words, shorthand, spelling errors

start with light.

validation

an LLM validates the transformed question still maps to the same answer. if the restyling changed the meaning, the original is kept.

email normalization

email_normalization: true strips names, dates, and email headers that would leak context not available in a real search query.

output

field	default	description
`split.train_ratio`	`0.8`	fraction of data for training
`split.stratify_by`	`["qa_type", "style"]`	balanced splits across these columns
`output.dir`	`"outputs/castform"`	output directory
`output.train_jsonl`	`"train.jsonl"`	training data filename
`output.eval_jsonl`	`"eval.jsonl"`	eval data filename

after transformation and splitting, your dataset is ready for training. see launching a training run to start a training job.