Skip to main content

Checkpoint, Restore, & Fork

Durable checkpoints with instant forking

Snapshot model state mid-training. Resume anywhere. Fork for experiments.

Long training jobs fail. Spot instances get preempted, nodes crash, and hyperparameters need revising. If you're not checkpointing to durable storage, you restart from scratch.

Writing checkpoints to a Tigris bucket with snapshots enabled gives you two things at once: resume from the latest checkpoint on any machine with no cross-region prefetch, and fork from a known-good checkpoint into parallel experiments instantly via copy-on-write.

Snapshots and forks →

Training Jobsave checkpointCheckpoint Bucketsnapshots enabledsnapshotSnapshot @ Epoch NimmutableresumeResumed JobFork A · lr=1e-4Fork B · lr=3e-5Copy-on-write forks — no data copied, each writes independently

Benefits

Resume on any machine

When a job dies or you migrate to cheaper hardware, the orchestrator restarts on a new machine and loads the latest checkpoint from the nearest replica — no cross-region prefetch, no egress cost. The same checkpoint works whether you resume on Lambda, CoreWeave, or a hyperscaler.

Instant copy-on-write forks

When you want to branch from a known-good checkpoint to run parallel experiments, each fork gets its own copy-on-write view of the bucket instantly. Mutations in one fork never affect another, and the source checkpoint stays immutable. No data copying required.

Scoped credentials per job

Scope each training job's credentials with a fine-grained IAM policy: read-only to the dataset bucket, read-only to the base model, write-only to the output bucket, with optional time-window and IP restrictions.

How it works

Create the bucket with X-Tigris-Enable-Snapshot: true at creation time (this must be set at creation and cannot be changed afterward). Have your training loop write checkpoints on a fixed cadence — every N steps or at epoch boundaries.

Store the snapshot version ID alongside run metadata in your experiment tracker. To resume, pass the version to the new job. To sweep hyperparameters, fork the snapshot once per configuration and let each fork write independently.

# Take a snapshot after checkpoint write
tigris snapshots take my-checkpoints

# Fork for parallel experiments
tigris forks create my-checkpoints --name experiment-lr-1e-4
tigris forks create my-checkpoints --name experiment-lr-3e-5
tigris forks create my-checkpoints --name experiment-lr-1e-3