Skip to main content

Agent Experimentation

Try multiple approaches and keep the best

Fork the data, let each agent try a different approach, compare outcomes, promote the winner.

You want to try three different embedding models, or two chunking strategies, or a new prompt template against the old one. Each variant needs to run against the same data without stepping on the others. Copying the dataset per experiment is slow and multiplies your storage bill.

Tigris bucket forks give each variant its own writable copy of the data with no upfront cost. A fork is a copy-on-write clone: instant to create, zero storage until something new gets written. Unlike sandboxes, which focus on giving agents isolated environments, the experimentation pattern adds a comparison step at the end: run the same task multiple ways, score the outputs, keep the winner.

Snapshots and forks →

Experiment SwimlanesBaseline Bucketshared datasetsnapshotSnapshot v1zero-copy forkFork Astrategy AFork Bstrategy BFork Cstrategy Cresults/results/results/Compare + Pickpromote winnerEach fork shares the baseline — you only store new writes per experiment

Benefits

Non-destructive writes via fork isolation

Each experiment runs inside its own fork. If an agent corrupts the data or produces garbage, the original dataset is untouched. Delete the fork and start over.

Per-experiment S3 namespace

Multiple agents can work on the same data at the same time. Each fork is its own S3 namespace, so there's no locking and no path-prefix conventions to manage. Writes in one fork don't show up in any other.

Copy-on-write storage sharing

Forks share the baseline data through copy-on-write. You only pay for bytes each experiment actually writes. If your experiments add scores or labels on top of the original data, overhead is small. If they rewrite most of the data (like re-embedding an entire corpus), each fork uses more storage.

Snapshot-pinned baselines

Snapshot the dataset before you start. Every fork branches from the same snapshot, so results are directly comparable. Weeks later, re-run any experiment by forking from the same snapshot version.

tigris snapshots take my-dataset baseline-v1
Collision-free output paths

Each fork is its own S3 namespace. Every agent can write to results/scores.json without colliding. Collecting results across experiments is a loop over bucket names, not a query against a shared database.

Patterns

Prompt and model evaluation

Test the same task across different models or prompt templates. Fork the test set, let each agent run its variant, and compare output quality. Each agent reads inputs from its fork and writes scores to a known path.

# Pin the test set
tigris snapshots take eval-set "pre-eval"

# One fork per variant
tigris buckets create eval-gpt4o --fork-of eval-set
tigris buckets create eval-claude --fork-of eval-set
tigris buckets create eval-llama --fork-of eval-set

# Each agent reads s3://eval-{model}/inputs/ and writes to
# s3://eval-{model}/results/scores.json

Data preparation and enrichment

Agents that clean, label, or transform datasets can try different approaches in parallel and keep the output that scores highest. The original raw data is shared across forks; each agent writes its transformed output on top.

tigris buckets create cleaned-aggressive --fork-of raw-dataset
tigris buckets create cleaned-conservative --fork-of raw-dataset
tigris buckets create cleaned-llm-assisted --fork-of raw-dataset

# Each agent reads the original files (shared via copy-on-write)
# and writes transformed output to s3://{fork}/processed/

RAG pipeline tuning

Try different retrieval configurations against the same knowledge base. Each fork starts with the same source documents; each agent builds its own index inside the fork. Since re-indexing rewrites most of the data, the storage savings come from sharing the source documents, not the indices.

tigris buckets create rag-chunk-256 --fork-of knowledge-base
tigris buckets create rag-chunk-512 --fork-of knowledge-base
tigris buckets create rag-with-reranker --fork-of knowledge-base

# Each agent reads s3://{fork}/documents/ (shared, zero-copy)
# and writes its index to s3://{fork}/index/ (new per fork)

Rollout safety

Before deploying a new agent version, fork production data and let the new version process it. Compare its output against the current version's results without touching live data.

# Snapshot current production state
tigris snapshots take prod-data "pre-rollout-$(date +%s)"

# Fork for the new version to run against
tigris buckets create rollout-candidate --fork-of prod-data

# New agent version reads from and writes to the fork
# Diff s3://rollout-candidate/results/ against s3://prod-data/results/

Compare, promote, and clean up

After all agents finish, collect results, keep the winner, and delete the rest.

1. Pull results from each fork. Since every fork is an S3 bucket and agents write to the same relative path, collecting outputs is a loop over bucket names.

export AWS_ENDPOINT_URL="https://t3.storage.dev"

for fork in eval-gpt4o eval-claude eval-llama; do
aws s3 cp "s3://${fork}/results/scores.json" "./results/${fork}.json"
done

2. Snapshot the winner. This creates an immutable record of the experiment state that you can fork from later if you want to build on the result.

tigris snapshots take eval-claude "promoted-$(date +%s)"

3. Delete the losing forks. Only each fork's unique writes are reclaimed. The shared baseline is unaffected.

tigris rm -f eval-gpt4o
tigris rm -f eval-llama

Iterative refinement

An agent can fork its own fork to try a variation without losing intermediate state. If the variation doesn't work, delete it and try again from the same parent.

# Agent B has good results; try a refinement
tigris buckets create rag-chunk-512-v2 --fork-of rag-chunk-512

# If the refinement works, promote it
tigris snapshots take rag-chunk-512-v2 "promoted"

# If it doesn't, throw it away; the parent fork is unchanged
tigris rm -f rag-chunk-512-v2