Using Hugging Face datasets with Tigris

Senior Cloud Whisperer

One of the most popular ways to share datasets is via Hugging Face’s dataset platform. You can even stream larger-than-laptop datasets, but there are no guarantees on throughput nor availability. When you’re developing a toy model, this might not be an issue. But as you mature your model, and combine your custom datasets with public datasets, it’s critical to save your own copy.

The ability to reproduce the state of your model at a given time has become critical, and even legally required, as models are integrated into healthcare, legal, and other compliance heavy domains. Why did the AI agree to sell a car for $1? Or delete a production database?

As we develop models, they’re going to make mistakes. It’s challenging to debug across scattered datasets, especially public ones outside your control. Centralizing your datasets in a common store is a good first step on your way to full dataset version control. Just make sure you think about additional costs– HuggingFace dataset streaming is free, but private stores can quickly rack up egress fees.

Today we’re going to learn how to import Hugging Face datasets into Tigris so that you can use them for whatever you need.

note

In production workloads, we recommend that you use LanceDB’s multimodal lakehouse to store your training datasets; but if you’re just getting started then this is way more than enough.

Prerequisites

Here’s what you need to get started:

A local Python development environment (our blog has a guide on using development containers to set one up).
A Tigris account from storage.new.
A Tigris bucket and access keys with the Editor permission on that bucket.

Setting up your environment manually

For manual setup, you'll need:

Python 3.10 or later
uv or another Python dependency manager
Your Tigris access credentials

Install the dependencies:

uv python install 3.10
uv venv
uv sync

Next, copy .env.example to .env and configure your Tigris credentials:

# Tigris configuration
AWS_ACCESS_KEY_ID=tid_your_access_key_here
AWS_SECRET_ACCESS_KEY=tsec_your_secret_key_here
AWS_ENDPOINT_URL_S3=https://fly.storage.tigris.dev
AWS_ENDPOINT_URL_IAM=https://iam.tigris.dev
AWS_REGION=auto

# Dataset and bucket
BUCKET_NAME=your-bucket-name-here
DATASET_NAME=mlabonne/FineTome-100k

To verify your configuration is correct, run the validation script:

uv run scripts/ensure-dotenv.py

This script checks that all required environment variables are set:

import os
from dotenv import load_dotenv

load_dotenv()

for key in [
    "AWS_ACCESS_KEY_ID",
    "AWS_SECRET_ACCESS_KEY",
    "AWS_ENDPOINT_URL_S3",
    "AWS_ENDPOINT_URL_IAM",
    "AWS_REGION",
    "BUCKET_NAME",
    "DATASET_NAME",
]:
    assert os.getenv(key) is not None, f"Environment variable {key} is not defined"

print("Your .env file is good to go!")

Importing a dataset

Now let's import the FineTome-100k dataset to Tigris. The process is surprisingly straightforward thanks to Hugging Face datasets' built-in support for S3-compatible storage.
First, let's look at the helper module that sets up our Tigris connection:

import os
import s3fs
from dotenv import load_dotenv
from typing import Dict, Tuple

def setup() -> Tuple[Dict[str, str], s3fs.S3FileSystem]:
    load_dotenv()

    storage_options = {
        "key": os.getenv("AWS_ACCESS_KEY_ID"),
        "secret": os.getenv("AWS_SECRET_ACCESS_KEY"),
        "endpoint_url": os.getenv("AWS_ENDPOINT_URL_S3"),
    }

    # Create the S3 filesystem
    fs = s3fs.S3FileSystem(**storage_options)

    # Test write access
    bucket_name = os.getenv("BUCKET_NAME")
    fs.write_text(f"/{bucket_name}/test.txt", "this is a test")
    fs.rm(f"/{bucket_name}/test.txt")

    return (storage_options, fs)

The import script uses Hugging Face datasets' save_to_disk method with our Tigris storage options:

import os
import tigris
from datasets import load_dataset
from dotenv import load_dotenv

def main():
    storage_options, fs = tigris.setup()

    bucket_name = os.getenv("BUCKET_NAME")
    dataset_name = os.getenv("DATASET_NAME")

    # Load the dataset from Hugging Face
    dataset = load_dataset(dataset_name, split="train")

    # Save directly to Tigris
    dataset.save_to_disk(
        f"s3://{bucket_name}/datasets/{dataset_name}",
        storage_options=storage_options
    )

    print(f"Dataset {dataset_name} is now in Tigris at {bucket_name}/datasets/{dataset_name}")

if __name__ == "__main__":
    main()

Run the import script:

uv run scripts/import-to-tigris.py

That's it! The dataset is now stored in Tigris and ready to use from anywhere.

Reading and processing datasets from Tigris

Once your dataset is in Tigris, you can load it from anywhere using the same storage options. Here's an example that loads the dataset, applies a filter, and saves the filtered version back to Tigris:

import os
import tigris
from datasets import load_from_disk

def remove_blue(row):
    """
    Example transformation that removes conversations mentioning "blue".
    You can implement any filtering or transformation logic here.
    """
    for conv in row['conversations']:
        if "blue" in conv['value']:
            return False  # remove the row
    return True  # keep the row

def main():
    storage_options, fs = tigris.setup()

    bucket_name = os.getenv("BUCKET_NAME")
    dataset_name = os.getenv("DATASET_NAME")

    # Load dataset from Tigris
    dataset = load_from_disk(
        f"s3://{bucket_name}/datasets/{dataset_name}",
        storage_options=storage_options
    )

    # Apply filtering
    filtered_ds = dataset.filter(remove_blue)

    # Save filtered dataset back to Tigris
    filtered_ds.save_to_disk(
        f"s3://{bucket_name}/no-blue/{dataset_name}",
        storage_options=storage_options
    )

    print(f"Filtered dataset saved to {bucket_name}/no-blue/{dataset_name}")

if __name__ == "__main__":
    main()

Run the processing script:

uv run scripts/read-from-tigris.py

Conclusion

You did it! Your copy of those datasets are safely stored in your own bucket. You have centralized your datasets and are on the path to versioning them.

We love Hugging Face for providing models and datasets to the world for free, and we want you to keep using them to develop your own models. However, as you start maturing and complying with regulations, making your own copy ensures no one tampers with the data. And that your bandwidth won’t suddenly drop mid training job, or lag across regions. Tigris dynamically places your datasets where you need them so you can scale fearlessly to any cloud with an internet connection.

Globally distributed storage for your datasets and more

Tigris gives you the object storage system you never knew you needed. Automatically distribute your datasets, images, videos, and backups close to where they're needed. Just add data.

Get started

Prerequisites​

Importing a dataset​

Reading and processing datasets from Tigris​

Conclusion​

Globally distributed storage for your datasets and more

Prerequisites

Importing a dataset

Reading and processing datasets from Tigris

Conclusion