Using Hugging Face datasets with Tigris
One of the most popular ways to share datasets is via Hugging Face’s dataset platform. You can even stream larger-than-laptop datasets, but there are no guarantees on throughput nor availability. When you’re developing a toy model, this might not be an issue. But as you mature your model, and combine your custom datasets with public datasets, it’s critical to save your own copy.
The ability to reproduce the state of your model at a given time has become critical, and even legally required, as models are integrated into healthcare, legal, and other compliance heavy domains. Why did the AI agree to sell a car for $1? Or delete a production database?
As we develop models, they’re going to make mistakes. It’s challenging to debug across scattered datasets, especially public ones outside your control. Centralizing your datasets in a common store is a good first step on your way to full dataset version control. Just make sure you think about additional costs– HuggingFace dataset streaming is free, but private stores can quickly rack up egress fees.
Today we’re going to learn how to import Hugging Face datasets into Tigris so that you can use them for whatever you need.
In production workloads, we recommend that you use LanceDB’s multimodal lakehouse to store your training datasets; but if you’re just getting started then this is way more than enough.
Prerequisites
Here’s what you need to get started:
- A local Python development environment (our blog has a guide on using development containers to set one up).
- A Tigris account from storage.new.
- A Tigris bucket and access keys with the Editor permission on that bucket.
Setting up your environment manually
For manual setup, you'll need:
- Python 3.10 or later
- uv or another Python dependency manager
- Your Tigris access credentials
Install the dependencies:
uv python install 3.10
uv venv
uv sync
Next, copy .env.example
to .env
and configure your Tigris credentials:
# Tigris configuration
AWS_ACCESS_KEY_ID=tid_your_access_key_here
AWS_SECRET_ACCESS_KEY=tsec_your_secret_key_here
AWS_ENDPOINT_URL_S3=https://fly.storage.tigris.dev
AWS_ENDPOINT_URL_IAM=https://iam.tigris.dev
AWS_REGION=auto
# Dataset and bucket
BUCKET_NAME=your-bucket-name-here
DATASET_NAME=mlabonne/FineTome-100k
To verify your configuration is correct, run the validation script:
uv run scripts/ensure-dotenv.py
This script checks that all required environment variables are set:
import os
from dotenv import load_dotenv
load_dotenv()
for key in [
"AWS_ACCESS_KEY_ID",
"AWS_SECRET_ACCESS_KEY",
"AWS_ENDPOINT_URL_S3",
"AWS_ENDPOINT_URL_IAM",
"AWS_REGION",
"BUCKET_NAME",
"DATASET_NAME",
]:
assert os.getenv(key) is not None, f"Environment variable {key} is not defined"
print("Your .env file is good to go!")
Importing a dataset
Now let's import the
FineTome-100k dataset
to Tigris. The process is surprisingly straightforward thanks to Hugging Face
datasets' built-in support for S3-compatible storage.
First, let's look at the helper module that sets up our Tigris connection:
import os
import s3fs
from dotenv import load_dotenv
from typing import Dict, Tuple
def setup() -> Tuple[Dict[str, str], s3fs.S3FileSystem]:
load_dotenv()
storage_options = {
"key": os.getenv("AWS_ACCESS_KEY_ID"),
"secret": os.getenv("AWS_SECRET_ACCESS_KEY"),
"endpoint_url": os.getenv("AWS_ENDPOINT_URL_S3"),
}
# Create the S3 filesystem
fs = s3fs.S3FileSystem(**storage_options)
# Test write access
bucket_name = os.getenv("BUCKET_NAME")
fs.write_text(f"/{bucket_name}/test.txt", "this is a test")
fs.rm(f"/{bucket_name}/test.txt")
return (storage_options, fs)
The import script uses Hugging Face datasets' save_to_disk
method with our
Tigris storage options:
import os
import tigris
from datasets import load_dataset
from dotenv import load_dotenv
def main():
storage_options, fs = tigris.setup()
bucket_name = os.getenv("BUCKET_NAME")
dataset_name = os.getenv("DATASET_NAME")
# Load the dataset from Hugging Face
dataset = load_dataset(dataset_name, split="train")
# Save directly to Tigris
dataset.save_to_disk(
f"s3://{bucket_name}/datasets/{dataset_name}",
storage_options=storage_options
)
print(f"Dataset {dataset_name} is now in Tigris at {bucket_name}/datasets/{dataset_name}")
if __name__ == "__main__":
main()
Run the import script:
uv run scripts/import-to-tigris.py
That's it! The dataset is now stored in Tigris and ready to use from anywhere.
Reading and processing datasets from Tigris
Once your dataset is in Tigris, you can load it from anywhere using the same storage options. Here's an example that loads the dataset, applies a filter, and saves the filtered version back to Tigris:
import os
import tigris
from datasets import load_from_disk
def remove_blue(row):
"""
Example transformation that removes conversations mentioning "blue".
You can implement any filtering or transformation logic here.
"""
for conv in row['conversations']:
if "blue" in conv['value']:
return False # remove the row
return True # keep the row
def main():
storage_options, fs = tigris.setup()
bucket_name = os.getenv("BUCKET_NAME")
dataset_name = os.getenv("DATASET_NAME")
# Load dataset from Tigris
dataset = load_from_disk(
f"s3://{bucket_name}/datasets/{dataset_name}",
storage_options=storage_options
)
# Apply filtering
filtered_ds = dataset.filter(remove_blue)
# Save filtered dataset back to Tigris
filtered_ds.save_to_disk(
f"s3://{bucket_name}/no-blue/{dataset_name}",
storage_options=storage_options
)
print(f"Filtered dataset saved to {bucket_name}/no-blue/{dataset_name}")
if __name__ == "__main__":
main()
Run the processing script:
uv run scripts/read-from-tigris.py
Conclusion
You did it! Your copy of those datasets are safely stored in your own bucket. You have centralized your datasets and are on the path to versioning them.
We love Hugging Face for providing models and datasets to the world for free, and we want you to keep using them to develop your own models. However, as you start maturing and complying with regulations, making your own copy ensures no one tampers with the data. And that your bandwidth won’t suddenly drop mid training job, or lag across regions. Tigris dynamically places your datasets where you need them so you can scale fearlessly to any cloud with an internet connection.
Globally distributed storage for your datasets and more
Tigris gives you the object storage system you never knew you needed. Automatically distribute your datasets, images, videos, and backups close to where they're needed. Just add data.