Skip to main content

Choosing a Lance Dataset Type

LanceDB provides two dataset types for use with PyTorch, each suited for different scenarios:

  • SafeLanceDataset: A "map-style" dataset ideal for finite datasets. It provides high-performance random access to your data.
  • LanceDataset: An "iterable-style" dataset designed for streaming or constantly updated datasets that may be too large to index in memory.

This page explains the differences between them and how to pick the best one for the job.

CriterionSafeLanceDataset (map‑style)LanceDataset (iterable‑style)
Data access patternRandom access via integer indices ✅Sequential streaming; no random access ❌
Indexing & memory footprintBuilds in‑memory index of all records (requires enough RAM for index)No full index; reads shards on the fly
Multi‑worker loadingFully supported—each worker gets its own slice via PyTorch’s samplersWorkers share the iterator; use ShardedFragmentSampler to split work
Prefetching & multiprocessingLeverages PyTorch’s prefetching, caching, and multiprocessingLimited to what your iterable and sampler can buffer; may require custom logic
Typical use casesFixed, finite datasets (ImageNet, CIFAR, COCO, large text corpora)Streaming/unbounded data (real‑time feeds, logs, social media)
Sampling & sharding flexibilityUses PyTorch’s RandomSampler or DistributedSampler for uniform splitsBuilt‑in ShardedFragmentSampler that understands Lance shards—no inter‑worker coordination needed
Distributed trainingEasy to split with DistributedSampler when num_workers > 0Use iterable‑aware samplers to ensure each process reads distinct fragments
Throughput & performanceHighest throughput when dataset index fits in memory and you can use multiple workersScales to arbitrarily large or infinite data; optimized for sequential reads on huge datasets
When to choose• Your dataset size is moderate and indexable in RAM
• You want max random‑access throughput & simple multi‑worker setup
• You’re dealing with streaming/unbounded data
• Your data is too big to index in RAM
• You need custom sharding

When to Use Each Dataset Type

Both dataset types let you stream your training data from Tigris, but picking the right option depends on your use case.

Use SafeLanceDataset for Finite Datasets

SafeLanceDataset is a map-style dataset and should be your default choice for most standard, finite datasets. If your data can be indexed (most image, text, or tabular datasets fall in this category), SafeLanceDataset is the way to go.

It works by building an index of all records in your dataset, which needs to fit into your system's RAM. The dataset itself, however, can be much larger than your available memory and live in Tigris. This in-memory index allows for very fast, random access to any item, which is essential for efficient shuffling and for using multiple workers (num_workers > 0 in PyTorch's DataLoader) to parallelize and speed up data loading. This makes it ideal for common supervised learning datasets like ImageNet, COCO, or large text corpuses.

If you are in doubt, start with SafeLanceDataset.

Use LanceDataset for Streaming or Unbounded Datasets

Use the iterable-style LanceDataset for more specialized scenarios, such as:

  • Streaming or unbounded data sources: If your dataset is a continual stream (like a real-time data feed, social media posts, or other infinite sources), you need an iterable dataset. LanceDataset lets you iterate over the data as it arrives without needing a predefined length.
  • Extremely large datasets: If your dataset is so large that even its index is too big to fit in memory, you'll need to stream the data. LanceDataset reads data sequentially in chunks (or "shards") on the fly without building a full index.
  • Custom sampling or sharding logic: If you need to process data in a way that standard samplers can't handle, LanceDataset can be configured with custom samplers. It includes an optimized ShardedFragmentSampler that understands the internal structure of Lance files, which can outperform standard samplers for certain large-scale distributed training scenarios. This ensures each process reads a distinct fragment of the dataset without needing extra coordination.

In practice, many teams will find that SafeLanceDataset meets all their needs for training on fixed datasets. Use LanceDataset when you have a compelling reason, like data streaming or when experimenting with LanceDB’s advanced sampling capabilities.