Checkpoint, Restore, & Fork
Durable checkpoints with instant forking
Snapshot model state mid-training. Resume anywhere. Fork for experiments.
Long training jobs fail. Spot instances get preempted, nodes crash, and hyperparameters need revising. If you're not checkpointing to durable storage, you restart from scratch.
Writing checkpoints to a Tigris bucket with snapshots enabled gives you two things at once: resume from the latest checkpoint on any machine with no cross-region prefetch, and fork from a known-good checkpoint into parallel experiments instantly via copy-on-write.
Benefits
Resume on any machine
When a job dies or you migrate to cheaper hardware, the orchestrator restarts on a new machine and loads the latest checkpoint from the nearest replica — no cross-region prefetch, no egress cost. The same checkpoint works whether you resume on Lambda, CoreWeave, or a hyperscaler.
Instant copy-on-write forks
When you want to branch from a known-good checkpoint to run parallel experiments, each fork gets its own copy-on-write view of the bucket instantly. Mutations in one fork never affect another, and the source checkpoint stays immutable. No data copying required.
Scoped credentials per job
Scope each training job's credentials with a fine-grained IAM policy: read-only to the dataset bucket, read-only to the base model, write-only to the output bucket, with optional time-window and IP restrictions.
How it works
Create the bucket with X-Tigris-Enable-Snapshot: true at creation time (this
must be set at creation and cannot be changed afterward). Have your training
loop write checkpoints on a fixed cadence — every N steps or at epoch
boundaries.
Store the snapshot version ID alongside run metadata in your experiment tracker. To resume, pass the version to the new job. To sweep hyperparameters, fork the snapshot once per configuration and let each fork write independently.
# Take a snapshot after checkpoint write
tigris snapshots take my-checkpoints
# Fork for parallel experiments
tigris forks create my-checkpoints --name experiment-lr-1e-4
tigris forks create my-checkpoints --name experiment-lr-3e-5
tigris forks create my-checkpoints --name experiment-lr-1e-3