Skip to main content

Store and Serve Models

Accelerated access: Tigris + TAG

Serve models faster. Without changing your code.

Serving ML models at scale means loading large weight files quickly, repeatedly, and from wherever your GPUs happen to be. The bottleneck is almost always the same: getting gigabytes of model data from storage to GPU memory as fast as possible.

TAG is a high-performance S3-compatible caching proxy purpose-built for ML workloads. It sits between your inference servers and Tigris, caches model weights on local NVMe/SSD, and serves subsequent reads at near-local-disk speed. Your framework doesn't know TAG exists — it just sees a faster S3 endpoint.

Get started with TAG →

Accelerated access: Tigris + TAGYour Cluster, Any CloudInference Pod 1Inference Pod 2Inference Pod 3Inference Pod 4TAGSDXLLLaMAQwenCached onlocal NVMeget modelserve modelonly reads onceTigris Object StorageSDXLLLaMAQwenMistralGemmaDeepSeekDurable storeEpoch 1 fetches from Tigris · Epoch 2+ reads from local NVMe at disk speed

Benefits

Cold start elimination

When you deploy a new inference pod, it typically downloads the full model from object storage before it can serve requests — minutes of GPU idle time for large models. With TAG deployed as a sidecar or node-level cache, the model weights are already on local disk after the first pod fetches them. Subsequent pods on the same node get cache hits and start immediately.

Request coalescing for simultaneous pod scaling

When you scale from 1 to 10 inference pods at once, all 10 would normally send identical requests for the same model. TAG's request coalescing means only one upstream request goes to Tigris — the other 9 get the data streamed from the single in-flight request. This is especially valuable for large model checkpoints.

Range request optimization

ML frameworks (PyTorch, HuggingFace safetensors, etc.) often load models using range requests — fetching specific tensor shards rather than the whole file. TAG detects these and triggers a background full-object fetch while serving the range, so subsequent range requests hit the local cache instead of roundtripping to Tigris each time.

Multi-node inference clusters

TAG deploys as a Kubernetes StatefulSet with gossip-based cluster discovery. Nodes share cache metadata, so if node A already cached a model, node B knows about it and can forward requests via gRPC. This avoids redundant downloads across your fleet.

Multi-node inference clustersTAG ClusterInference Pod AInference Pod BInference Pod Crequest LLaMA modelTAG Node ALLaMATAG Node BTAG Node CgossipgossipgRPC forward requestserveLLaMA modelfrom cacheserve LLaMA modelcache missTigris Object StorageNodes share metadata — avoid redundant downloads
Read-only credential separation

TAG only needs read-only Tigris credentials for its own cache operations. Your inference servers pass their own credentials through transparently via SigV4 re-signing. This fits a typical pattern where model weights are stored in a shared read-only bucket.

Direct access: Tigris as your model store

Point your inference framework directly at Tigris. Any framework that loads models from S3 works out of the box — no code changes, no custom integrations.

Upload weights once to a global bucket and inference nodes read from the nearest replica automatically. For frameworks that expect file paths, mount with TigrisFS.

Get started →

Direct Access: Tigris as Model StoreClient RequestInference ServerGetObjectTigris BucketZero egress, global readsReplica 2Replica 3Each replica downloads independently

Benefits

Zero egress fees

Loading the same 70 GB model across 10 replicas costs nothing in transfer fees. Tigris doesn't charge for egress, so scaling your inference fleet doesn't scale your storage bill.

Global low-latency reads

A global bucket automatically serves weights from the nearest replica. No per-region buckets to manage, no sync jobs to maintain.

Version-aware deploys

Write weights to a versioned key like models/{model}/{run_id}/weights.safetensors. Conditional GetObject calls with If-None-Match let nodes skip the download entirely when they already have the current version — useful for rolling deploys where most nodes are already warm.

Which approach?

Both patterns store your models durably in Tigris — TAG is purely a read acceleration layer. Start with direct access, then add TAG when load times become a bottleneck. No application code changes either way.

Direct accessWith TAG
SetupSet endpoint URLRun TAG alongside your stack
Cold startsNetwork speed to TigrisFirst same, all subsequent near-instant
Best forSmall fleets, infrequent restartsLarge fleets, frequent scaling, serverless
Model swapsFull download each timeInstant if cached
Code changesNoneNone