pinecone

rag Mar 26, 2026 3 min read

pinecone is a managed vector database. use this if you already have data in pinecone or want managed vector search without running your own infrastructure. castform does not provide pinecone access; you’ll need your own api key.

when to use pinecone

  • you already have data indexed in pinecone
  • you want managed vector search with no servers to run
  • you want to use pinecone’s hosted inference for embeddings (no need to provide your own embedding function)

1. create your corpus

use PineconeChunkSource to upload and index your documents:

from trainer.corpus.pinecone.source import PineconeChunkSource

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="my-docs",
)

# upload from a local folder
source.populate_from_folder("./docs/")

# or upload pre-chunked data
source.populate_from_chunks(chunks)

embeddings

by default, pinecone uses its hosted inference API (multilingual-e5-large) for embeddings. to use a custom embedding function:

from sentence_transformers import SentenceTransformer

model = SentenceTransformer("all-MiniLM-L6-v2")

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="my-docs",
    embed_fn=model.encode,
)

bring your own index (BYOI)

if you already have data in pinecone with custom metadata field names, use field_mapping to map them to castform’s expected fields:

source = PineconeChunkSource(
    api_key="pc_...",
    index_name="existing-index",
    field_mapping={
        "content": "body_text",     # your field name for chunk content
        "file_path": "source_file", # your field name for file path
    },
)

PineconeChunkSource parameters

parameterdefaultdescription
api_keyrequiredPinecone API key
index_namerequiredname of the Pinecone index
index_hostNonedirect host URL (skips index lookup)
namespace""Pinecone namespace to use
embed_fnNonecustom embedding function; overrides hosted inference
embed_model"multilingual-e5-large"Pinecone hosted inference model (used when no embed_fn)
field_mappingNonemaps custom metadata field names for BYOI

2. generate QA data

pass your PineconeChunkSource to the QA generation pipeline as usual. it implements the standard ChunkSource interface:

from trainer.qa_generation.cgft_models import CgftPipelineConfig, PlatformConfig, CorpusConfig, TargetsConfig
from trainer.qa_generation.cgft_pipeline import CgftPipeline

cfg = CgftPipelineConfig(
    platform=PlatformConfig(api_key="sk_..."),
    corpus=CorpusConfig(corpus_name="my-docs", docs_path="./docs/"),
    targets=TargetsConfig(total_samples=200),
)
cfg.resolve_api_keys()

pipeline = CgftPipeline(cfg, corpus_source=source)
result = pipeline.run()

3. train with SearchEnv

create a PineconeSearch client and pass it to SearchEnv:

from trainer.corpus.pinecone.search import PineconeSearch
from trainer.envs.search_env import SearchEnv
from trainer.trainer.pipeline import train

search = PineconeSearch(
    api_key="pc_...",
    index_name="my-docs",
    # uses Pinecone hosted inference by default
    # or pass embed_fn for custom embeddings
)

experiment_id = train(
    env_class=SearchEnv,
    env_args={"search": search},
    train_dataset=result["train_dataset"],
    eval_dataset=result["eval_dataset"],
    prefix="pinecone-search",
    api_key="sk_...",
)

PineconeSearch parameters

parameterdefaultdescription
api_keyrequiredPinecone API key
index_namerequiredname of the Pinecone index
index_hostNonedirect host URL (skips index lookup)
namespace""Pinecone namespace
embed_fnNonecustom embedding function
embed_model"multilingual-e5-large"Pinecone hosted inference model (used when no embed_fn)
field_mappingNonemaps custom metadata field names for BYOI (bring your own index)

notes

  • vector-only search: pinecone supports vector search only. if you need lexical or hybrid modes, consider turbopuffer or chroma
  • hosted inference: when no embed_fn is provided, pinecone uses its own inference API with embed_model (default: multilingual-e5-large). this means you don’t need to run or manage an embedding model
  • PineconeSearch is pickle-safe. it stores only connection parameters and reconstructs the SDK client lazily after unpickling
  • PineconeChunkSource and PineconeSearch are separate: the source handles corpus creation and QA generation, the search client handles training