Skip to main content

TAG Deployment Guide

Ready to run TAG in production? This guide walks you through sizing, choosing between single-node and cluster topologies, and setting up monitoring and alerting. If you just want to try TAG out first, start with the Quick Start.

Configuration

TAG is configured through environment variables, a YAML config file, or both. Command-line flags take precedence over environment variables, which take precedence over config file values.

TAG requires its own Tigris credentials (AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY) with read-only access to all buckets. Client applications authenticate separately with their own credentials.

For the full list of environment variables, YAML config options, CLI flags, and example configurations, see the Configuration Reference.

Production Deployment: Single Node

Start here if your working set fits on one machine's storage and a single node's network bandwidth can handle your read throughput. Most workloads start with a single node and scale out only when needed.

Sizing Guidelines

TAG is typically NVMe-bound for large objects and CPU-bound for small objects. Benchmark reference points (single node, cache-warm):

Object SizeOps/secBandwidth
1 KiB~75,000~74 MiB/s
100 KiB~33,000~3.2 GiB/s
1 MiB~11,000~10.7 GiB/s

CPU utilization at peak throughput is around 12%, so a modest machine (4-8 cores) is sufficient for most workloads. Memory is used primarily by RocksDB block cache and in-flight request buffers. 4-8 GiB is a reasonable starting point. NVMe storage is strongly recommended.

For full benchmark methodology, thread scaling, and environment details, see Benchmarks.

Deployment

For step-by-step deployment instructions:

For production, run TAG under a process supervisor (systemd, supervisord, etc.) to handle restarts.

Example Production Config (Single Node)

server:
http_port: 8080
bind_ip: "0.0.0.0"

upstream:
endpoint: "https://t3.storage.dev"
max_idle_conns_per_host: 100

cache:
enabled: true
ttl: 60m
size_threshold: 1073741824 # 1 GiB
disk_path: "/var/cache/tag"
max_disk_usage_bytes: 107374182400 # 100 GiB
node_id: "tag-prod-1"

log:
level: "info"
format: "json"

Production Deployment: Multi-Node Cluster

When you outgrow a single node — either you need more cache capacity or higher aggregate throughput — you can deploy a multi-node cluster. TAG nodes form the cluster automatically: each node owns a subset of cache keys via consistent hashing, and requests for remote keys are forwarded over gRPC without any manual routing on your part.

How Clustering Works

Discovery: Nodes find each other via memberlist gossip protocol (port 7000). Configure one or more seed nodes; new nodes join by contacting any seed.

Key routing: Each cache key is hashed to determine its owner node. GET requests check local cache first; if the key belongs to a remote node, the request is forwarded through gRPC (port 9000).

Consistency: Write-through invalidation and tombstone markers ensure cache coherence. Tombstones prevent in-flight background writes from resurrecting deleted objects.

Deployment

For step-by-step cluster deployment instructions:

  • Docker Cluster — 3-node cluster via Docker Compose
  • Kubernetes — StatefulSet with autoscaling (recommended for production clusters)

TLS Configuration

TAG supports HTTPS with TLS certificates for encrypted client connections. Both a certificate file and private key file must be provided together.

For setup instructions covering self-signed certificates, Docker, Kubernetes, and native binary deployments, see TLS/HTTPS.

Monitoring

TAG exposes Prometheus metrics at GET /metrics in Prometheus exposition format. For the complete metrics reference, PromQL examples, and scrape configuration, see Metrics Reference.

Key Metrics to Alert On

Error rate:

rate(tag_requests_total{status="error"}[5m])
/ rate(tag_requests_total[5m])

Alert if error rate exceeds 1% sustained over 5 minutes.

Cache hit ratio:

rate(tag_cache_hits_total[5m])
/ (rate(tag_cache_hits_total[5m]) + rate(tag_cache_misses_total[5m]))

A healthy hit ratio depends on your workload. For read-heavy workloads with a bounded working set, expect 80%+ after warmup.

Upstream latency:

histogram_quantile(0.99, rate(tag_upstream_request_duration_seconds_bucket[5m]))

Alert if p99 upstream latency exceeds your SLO, which may indicate Tigris connectivity issues.

Authentication failures:

rate(tag_auth_failures_total[5m])

Spikes indicate credential misconfiguration or unauthorized access attempts.

Health Check

GET /health returns 200 OK when TAG is ready to accept requests. Use this for load balancer health checks, container orchestrator probes, and uptime monitoring.

Troubleshooting

TAG Won't Start

"missing AWS credentials" — Set both AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY. These are TAG's own credentials, not your application's. In Kubernetes, verify the secret exists: kubectl get secret -n tag tag-credentials

"invalid upstream endpoint" — TAG only allows connections to localhost, *.tigris.dev, or *.storage.dev. Check TAG_UPSTREAM_ENDPOINT.

"TLS certificate or key file not found" — If either TAG_TLS_CERT_FILE or TAG_TLS_KEY_FILE is set, both must point to valid files.

Connection refused — Verify TAG is running: curl http://localhost:8080/health

Cache Not Working

All responses show X-Cache: MISS — Check that caching is enabled (TAG_CACHE_DISABLED is not true) and that the cache directory is writable. Set TAG_LOG_LEVEL=debug and look for cache write errors. In Kubernetes, check logs with kubectl logs -n tag tag-0 and verify the cache PVC is bound with kubectl get pvc -n tag.

Objects not being cached — Objects must return HTTP 200 and be within the size threshold (default 1 GiB). Objects with Cache-Control: no-store or private are not cached.

Authentication Errors

403 on first request — Verify your client credentials are valid for the requested bucket on Tigris and belong to the same Tigris organization as TAG's credentials. TAG forwards the first request to Tigris, which performs authentication.

403 after credential rotation — TAG caches derived signing keys for up to 48 hours and authorization decisions for 10 minutes. After rotating credentials, either wait for TTL expiry or restart TAG to clear caches.

Client Errors

405 on bucket creation — You're using virtual-hosted style addressing. TAG requires path-style. Set addressing_style: 'path' in your S3 client config.

Timeout on large files — Increase client-side timeouts. For example, in boto3:

from botocore.config import Config

config = Config(
connect_timeout=30,
read_timeout=300,
s3={'addressing_style': 'path'},
)

Cluster Issues

Nodes not discovering each other — Verify seed nodes are reachable on port 7000 (gossip). In Kubernetes, ensure the headless service resolves correctly:

nslookup tag-headless.tag.svc.cluster.local

gRPC routing failures — Verify port 9000 is open between nodes. Check that TAG_CACHE_ADVERTISE_ADDR is set to an address reachable by other nodes (not localhost).

High Latency

High p99 latency — Check tag_upstream_request_duration_seconds to determine whether latency comes from Tigris or TAG. High request coalescing (tag_broadcast_shared_total) is normal and reduces upstream load. High tag_broadcast_slow_consumers_total indicates clients are reading too slowly. In Kubernetes, also check disk I/O performance on the storage class.

Debug Mode

Set TAG_LOG_LEVEL=debug for detailed request-level logging. This is verbose; use it only during active debugging.

In Kubernetes, update the StatefulSet:

env:
- name: TAG_LOG_LEVEL
value: "debug"