Bundle API

The Bundle API lets you fetch multiple objects from a bucket as a streaming tar archive in a single HTTP request. Instead of making one request per object, you send a list of keys and receive a tar stream — assembled on the fly with no server-side buffering.

This is designed for ML training workloads where dataloaders need to fetch thousands of images or samples per batch. The Bundle API eliminates per-object HTTP overhead and removes the need to pre-materialize shard files (tarballs, parquet files, etc.).

SDK examples

Python
Go
JavaScript

Install the Tigris boto3 extension:

pip install tigris-boto3-ext

Basic usage

import tarfile
from tigris_boto3_ext import bundle_objects

response = bundle_objects(s3_client, "my-bucket", [
    "dataset/train/img_001.jpg",
    "dataset/train/img_002.jpg",
])

with tarfile.open(fileobj=response, mode="r|") as tar:
    for member in tar:
        if member.name == "__bundle_errors.json":
            continue
        f = tar.extractfile(member)
        if f is not None:
            image_bytes = f.read()

bundle_objects returns a BundleResponse that works as a context manager for automatic connection cleanup:

with bundle_objects(s3_client, "my-bucket", keys) as response:
    with tarfile.open(fileobj=response, mode="r|") as tar:
        for member in tar:
            if member.name == "__bundle_errors.json":
                continue
            f = tar.extractfile(member)
            if f is not None:
                image_bytes = f.read()

Error handling

By default, missing objects are silently skipped and listed in a __bundle_errors.json entry at the end of the archive. Set on_error=BUNDLE_ON_ERROR_FAIL to raise an error when any key is missing:

from tigris_boto3_ext import bundle_objects, BundleError, BUNDLE_ON_ERROR_FAIL

try:
    response = bundle_objects(
        s3_client, "my-bucket", keys, on_error=BUNDLE_ON_ERROR_FAIL
    )
except BundleError as e:
    print(f"Bundle failed (HTTP {e.status_code}): {e.body}")

Response metadata

After consuming the tar stream, BundleResponse exposes metadata about the bundle:

response = bundle_objects(s3_client, "my-bucket", keys)

with tarfile.open(fileobj=response, mode="r|") as tar:
    for member in tar:
        pass  # consume the stream

print(response.object_count)   # number of objects in the bundle
print(response.bundle_bytes)   # total bytes streamed
print(response.skipped_count)  # number of skipped keys (skip mode)

Install the SDK:

go get github.com/tigrisdata/storage-go

import (
    "archive/tar"
    "io"
    "log"

    storage "github.com/tigrisdata/storage-go"
)

output, err := client.BundleObjects(ctx, &storage.BundleObjectsInput{
    Bucket: "my-bucket",
    Keys: []string{
        "dataset/train/img_001.jpg",
        "dataset/train/img_002.jpg",
        "dataset/train/img_003.jpg",
    },
})
if err != nil {
    log.Fatal(err)
}
defer output.Body.Close()

tr := tar.NewReader(output.Body)
for {
    hdr, err := tr.Next()
    if err == io.EOF {
        break
    }
    if err != nil {
        log.Fatal(err)
    }
    if hdr.Name == "__bundle_errors.json" {
        continue
    }

    data, _ := io.ReadAll(tr)
    // process hdr.Name, data
}

npm install @tigrisdata/storage tar-stream

import { bundle } from "@tigrisdata/storage/server";
import tar from "tar-stream"; // npm install tar-stream

const result = await bundle("my-bucket", [
  "dataset/train/img_001.jpg",
  "dataset/train/img_002.jpg",
]);

if (result.error) {
  throw result.error;
}

// Pipe the streaming response through a tar parser
const extract = tar.extract();

extract.on("entry", (header, stream, next) => {
  if (header.name === "__bundle_errors.json") {
    stream.resume();
    next();
    return;
  }

  const chunks = [];
  stream.on("data", (chunk) => chunks.push(chunk));
  stream.on("end", () => {
    const data = Buffer.concat(chunks);
    console.log(`${header.name}: ${data.length} bytes`);
    next();
  });
  stream.resume();
});

// Convert ReadableStream to Node stream and pipe
const { Readable } = await import("stream");
Readable.fromWeb(result.data.body).pipe(extract);

PyTorch DataLoader integration

The Bundle API integrates naturally with PyTorch dataloaders. Instead of fetching one image per __getitem__ call, fetch a batch at a time:

import random
import tarfile
from io import BytesIO

import torch
from PIL import Image
from tigris_boto3_ext import bundle_objects


def build_batches(metadata_path, batch_size):
    """Load a list of object keys from a metadata file and split into batches.

    Returns a list of lists, where each inner list is a batch of dicts
    with at least a "key" field pointing to the object key in the bucket.
    """
    ...


class TigrisBundleDataset(torch.utils.data.IterableDataset):
    def __init__(self, s3_client, metadata_path, bucket, batch_size=32, prefetch=20):
        self.s3_client = s3_client
        self.bucket = bucket
        self.batch_size = batch_size
        self.prefetch = prefetch
        self.batches = build_batches(metadata_path, batch_size)

    def __iter__(self):
        worker_info = torch.utils.data.get_worker_info()
        if worker_info is None:
            my_batches = self.batches
        else:
            my_batches = self.batches[worker_info.id::worker_info.num_workers]
        random.shuffle(my_batches)

        for i in range(0, len(my_batches), self.prefetch):
            chunk = my_batches[i : i + self.prefetch]
            keys = [row["key"] for batch in chunk for row in batch]

            with bundle_objects(self.s3_client, self.bucket, keys) as response:
                with tarfile.open(fileobj=response, mode="r|") as tar:
                    for member in tar:
                        if member.name == "__bundle_errors.json":
                            continue
                        f = tar.extractfile(member)
                        if f is None:
                            continue
                        image = Image.open(BytesIO(f.read())).convert("RGB")
                        yield {"image": image}

How it works

The Bundle API is a Tigris extension to the S3 API. You send a POST request with a list of object keys and receive a streaming tar archive:

POST /{bucket}?bundle HTTP/1.1
x-tigris-bundle-format: tar
Content-Type: application/json

{"keys": ["train/img_001.jpg", "train/img_002.jpg", "train/img_003.jpg"]}

The server streams back a tar archive containing those objects, in the order you requested. Each tar entry's filename is the full object key.

Request headers

Header	Required	Values	Default
`x-tigris-bundle-format`	Yes	`tar`	—
`x-tigris-bundle-on-error`	No	`fail` \| `skip`	`skip`

Request body

Send a JSON array of object keys:

{
  "keys": [
    "dataset/train/img_00001.jpg",
    "dataset/train/img_00002.jpg",
    "dataset/train/img_00003.jpg"
  ]
}

XML is also supported:

<?xml version="1.0" encoding="UTF-8"?>
<Bundle xmlns="http://s3.amazonaws.com/doc/2006-03-01/">
  <Object><Key>dataset/train/img_00001.jpg</Key></Object>
  <Object><Key>dataset/train/img_00002.jpg</Key></Object>
</Bundle>

Error handling

Skip mode (default)

Missing or inaccessible objects are silently omitted from the tar. A __bundle_errors.json entry is appended at the end of the archive:

{
  "skipped": [{ "key": "dataset/train/img_00002.jpg", "reason": "NoSuchKey" }]
}

This is the recommended mode for training pipelines. Dataloaders already handle missing samples gracefully.

Fail mode

Set x-tigris-bundle-on-error: fail to pre-validate all keys before streaming. If any key is missing, the server returns a 404 error with the list of missing keys — no partial tar is sent.

<Error>
  <Code>BundleKeyNotFound</Code>
  <Message>One or more keys could not be resolved</Message>
  <MissingKeys>
    <Key>dataset/train/img_00002.jpg</Key>
  </MissingKeys>
</Error>

Use fail mode for inference or serving where every object must be present.

Response trailers

After the stream completes, the response includes HTTP trailers:

Trailer	Description
`x-tigris-bundle-count`	Number of objects in the tar
`x-tigris-bundle-bytes`	Total bytes streamed
`x-tigris-bundle-skipped`	Number of skipped keys (skip mode)

Limits

Parameter	Limit
Max keys per request	5,000
Max assembled size	50 GB
Max request body	5 MB
Request timeout	15 min

Authentication

Standard S3 SigV4 authentication. The caller must have s3:GetObject permission on the bucket. No new IAM actions are required.

SDK examples​

Basic usage​

Error handling​

Response metadata​

PyTorch DataLoader integration​

How it works​

Request headers​

Request body​

Error handling​

Skip mode (default)​

Fail mode​

Response trailers​

Limits​

Authentication​

SDK examples

Basic usage

Error handling

Response metadata

PyTorch DataLoader integration

How it works

Request headers

Request body

Error handling

Skip mode (default)

Fail mode

Response trailers

Limits

Authentication