Store and Serve Models
Accelerated access: Tigris + TAG
Serve models faster. Without changing your code.
Serving ML models at scale means loading large weight files quickly, repeatedly, and from wherever your GPUs happen to be. The bottleneck is almost always the same: getting gigabytes of model data from storage to GPU memory as fast as possible.
TAG is a high-performance S3-compatible caching proxy purpose-built for ML workloads. It sits between your inference servers and Tigris, caches model weights on local NVMe/SSD, and serves subsequent reads at near-local-disk speed. Your framework doesn't know TAG exists — it just sees a faster S3 endpoint.
Benefits
Cold start elimination
When you deploy a new inference pod, it typically downloads the full model from object storage before it can serve requests — minutes of GPU idle time for large models. With TAG deployed as a sidecar or node-level cache, the model weights are already on local disk after the first pod fetches them. Subsequent pods on the same node get cache hits and start immediately.
Request coalescing for simultaneous pod scaling
When you scale from 1 to 10 inference pods at once, all 10 would normally send identical requests for the same model. TAG's request coalescing means only one upstream request goes to Tigris — the other 9 get the data streamed from the single in-flight request. This is especially valuable for large model checkpoints.
Range request optimization
ML frameworks (PyTorch, HuggingFace safetensors, etc.) often load models using range requests — fetching specific tensor shards rather than the whole file. TAG detects these and triggers a background full-object fetch while serving the range, so subsequent range requests hit the local cache instead of roundtripping to Tigris each time.
Multi-node inference clusters
TAG deploys as a Kubernetes StatefulSet with gossip-based cluster discovery. Nodes share cache metadata, so if node A already cached a model, node B knows about it and can forward requests via gRPC. This avoids redundant downloads across your fleet.
Read-only credential separation
TAG only needs read-only Tigris credentials for its own cache operations. Your inference servers pass their own credentials through transparently via SigV4 re-signing. This fits a typical pattern where model weights are stored in a shared read-only bucket.
Direct access: Tigris as your model store
Point your inference framework directly at Tigris. Any framework that loads models from S3 works out of the box — no code changes, no custom integrations.
Upload weights once to a global bucket and inference nodes read from the nearest replica automatically. For frameworks that expect file paths, mount with TigrisFS.
Benefits
Zero egress fees
Loading the same 70 GB model across 10 replicas costs nothing in transfer fees. Tigris doesn't charge for egress, so scaling your inference fleet doesn't scale your storage bill.
Global low-latency reads
A global bucket automatically serves weights from the nearest replica. No per-region buckets to manage, no sync jobs to maintain.
Version-aware deploys
Write weights to a versioned key like
models/{model}/{run_id}/weights.safetensors. Conditional GetObject calls
with If-None-Match let nodes skip the download entirely when they already have
the current version — useful for rolling deploys where most nodes are already
warm.
Which approach?
Both patterns store your models durably in Tigris — TAG is purely a read acceleration layer. Start with direct access, then add TAG when load times become a bottleneck. No application code changes either way.
| Direct access | With TAG | |
|---|---|---|
| Setup | Set endpoint URL | Run TAG alongside your stack |
| Cold starts | Network speed to Tigris | First same, all subsequent near-instant |
| Best for | Small fleets, infrequent restarts | Large fleets, frequent scaling, serverless |
| Model swaps | Full download each time | Instant if cached |
| Code changes | None | None |