# LanceDB

<!-- -->

[LanceDB Multimodal Lakehouse](https://www.lancedb.com) lets you store, process, and search across text, images, audio, video, embeddings, and structured metadata in one system. This functionality makes it easier to go from raw data to training-ready features and build pipelines that can handle a variety of inputs without stitching together multiple tools or managing pipelines manually.

<!-- -->

Teams can connect to all their existing LanceDB datasets to easily define feature logic as standard Python functions, automatically versioned and executed across distributed, scalable infrastructure.

By using Tigris with the Multimodal Lakehouse, developers can now build bottomless vector pipelines—ingesting multimodal context into LanceDB with Tigris as the backend for seamless, elastic storage that scales infinitely.

## Getting Started with LanceDB Multimodal Lakehouse[​](#getting-started-with-lancedb-multimodal-lakehouse "Direct link to Getting Started with LanceDB Multimodal Lakehouse")

The LanceDB Multimodal Lakehouse is available in LanceDB Enterprise with Tigris as a supported object storage provider. You can still use Tigris with [LanceDB Cloud](https://accounts.lancedb.com/sign-up), and with the [open-source LanceDB](https://github.com/lancedb/lancedb).

Using the multimodal lakehouse features of LanceDB Enterprise, starts with installing the Python package `geneva`:

```
pip install geneva
```

And connecting to your LanceDB table:

```
import geneva as gv

table = gv.connect("table-name")
```

From there, you can write Python functions, decorated as UDFs, and apply them to your LanceDB datasets automatically. LanceDB Enterprise packages your environment, deploys your code, and handles all the data partitioning, checkpointing, and incremental updates. Reference the LanceDB Multimodal Lakehouse [documentation](https://docs.lancedb.com) for the latest guides on using `geneva`.

You'll need existing LanceDB datasets to use the Multimodal Lakehouse features.

## How to Create LanceDB Datasets with Tigris[​](#how-to-create-lancedb-datasets-with-tigris "Direct link to How to Create LanceDB Datasets with Tigris")

In order to use LanceDB as our data lakehouse, we need to configure it to use Tigris as a storage backend. Tigris [works with LanceDB](https://www.tigrisdata.com/docs/libraries/lancedb/) because Tigris exposes the S3 API, which LanceDB can use to read and write data. All you need to do is change out the credentials and endpoints.

### Authentication via Environment Variables[​](#authentication-via-environment-variables "Direct link to Authentication via Environment Variables")

The LanceDB client will pick up credentials from environment variables. Set the following environment variables with your Tigris credentials either in your shell or in a `.env` file:

* Shell
* .env file

```
export AWS_ACCESS_KEY_ID=tid_access_key_id

export AWS_SECRET_ACCESS_KEY=tsec_secret_access_key

export AWS_ENDPOINT_URL_S3=https://t3.storage.dev

export AWS_REGION=auto
```

```
AWS_ACCESS_KEY_ID=tid_access_key_id

AWS_SECRET_ACCESS_KEY=tsec_secret_access_key

AWS_ENDPOINT_URL_S3=https://t3.storage.dev AWS_REGION=auto
```

Make sure `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` match the values you get from the Tigris console. The other variables tell LanceDB where to look for your data. Once these variables are exported into or loaded by your training loop, LanceDB will use them internally to authenticate with Tigris.

### Authentication via Hardcoded Credentials[​](#authentication-via-hardcoded-credentials "Direct link to Authentication via Hardcoded Credentials")

warning

Hardcoding access credentials is a bad idea from a security standpoint. These credentials will allow an attacker to delete or modify your training dataset bucket. Please only do this if you are unable to set the credentials in your platform of choice properly.

You can also pass credentials directly to LanceDB in your code. LanceDB’s `connect` method accepts a `storage_options` dictionary to let you specify whatever credentials you want. This is useful if you are in an environment where you need to retrieve credentials from a secret store at runtime.

### Connecting to Tigris[​](#connecting-to-tigris "Direct link to Connecting to Tigris")

Now, let’s connect LanceDB to your Tigris bucket. In Python, use the [LanceDB SDK](https://lancedb.github.io/lancedb/python/python/) to connect to your database in Tigris. For example, if your bucket is `my-bucket` and you want to store a dataset under the path `ml-datasets/food101`, you can connect like this:

* JavaScript
* Python
* TypeScript

```
import * as lancedb from "@lancedb/lancedb";



const bucketName = process.env.BUCKET_NAME || "my-bucket";



// Connect to LanceDB database in a Tigris bucket (s3-compatible URI for Tigris)

const db = await lancedb.connect(`s3://${bucketName}/ml-datasets/food101`, {

  storageOptions: {

    endpoint: "https://t3.storage.dev", // Tigris storage endpoint

    region: "auto", // auto region for global routing

    // If you are not using env vars, you can also specify credentials here:

    // accessKeyId: "tid_access_key_id",

    // secretAccessKey: "tsec_secret_access_key",

  },

});
```

```
import lancedb



# Connect to LanceDB database in a Tigris bucket (s3-compatible URI for Tigris)

db = lancedb.connect(

    "s3://my-bucket/ml-datasets/food101",

    storage_options={

        "endpoint": "https://t3.storage.dev", # Tigris storage endpoint

        "region": "auto",                     # auto region for global routing

        # If you are not using env vars, you can also specify credentials here:

        # "aws_access_key_id": "tid_access_key_id",

        # "aws_secret_access_key": "tsec_secret_access_key",

    }

)
```

```
import * as lancedb from "@lancedb/lancedb";



const bucketName: string = process.env.BUCKET_NAME || "my-bucket";



// Connect to LanceDB database in a Tigris bucket (s3-compatible URI for Tigris)

const db = await lancedb.connect(`s3://${bucketName}/ml-datasets/food101`, {

  storageOptions: {

    endpoint: "https://t3.storage.dev", // Tigris storage endpoint

    region: "auto", // auto region for global routing

    // If you are not using env vars, you can also specify credentials here:

    // accessKeyId: "tid_access_key_id",

    // secretAccessKey: "tsec_secret_access_key",

  },

});
```

In this example, we connected to the path `s3://my-bucket/ml-datasets/food101` with the Tigris endpoint and region we specified. The `db` object is a handle to a LanceDB database stored remotely on Tigris. We’ve configured it to talk to Tigris instead of S3 by setting the the endpoint to Tigris’ [`https://t3.storage.dev`](https://t3.storage.dev), so even though the path starts with `s3://`, it’s actually talking to Tigris.

## Creating a dataset[​](#creating-a-dataset "Direct link to Creating a dataset")

To get started, install LanceDB into your project's dependencies:

* npm
* pnpm
* yarn
* pip
* uv

```
npm install --save @lancedb/lancedb apache-arrow
```

```
pnpm add @lancedb/lancedb apache-arrow
```

```
yarn add @lancedb/lancedb apache-arrow
```

```
pip install lancedb
```

```
uv pip install lancedb
```

Then register the embedding model you plan to use, such as [OpenAI's embedding model](https://platform.openai.com/docs/guides/embeddings):

```
import "@lancedb/lancedb/embedding/openai";

import { LanceSchema, getRegistry, register } from "@lancedb/lancedb/embedding";

import { EmbeddingFunction } from "@lancedb/lancedb/embedding";



const func = getRegistry()

  .get("openai")

  ?.create({ model: "text-embedding-3-small" }) as EmbeddingFunction;
```

And create the schema for the data you want to ingest:

```
const contentSchema = LanceSchema({

  text: func.sourceField(new arrow.Utf8()),

  vector: func.vectorField(),

  //title: new arrow.Utf8(),

  url: new arrow.Utf8(),

  heading: new arrow.Utf8(),

});
```

This creates a schema that has a few fields:

* The source `text` that you are searching against
* The high-dimensional generated `vector` used to search for similar embeddings
* Additional metadata such as the `title`, `heading`, and `url` of the document you're embedding so that the model can link users back to a source

Strictly speaking, only the `text` and `vector` fields are required. The rest are optional but can help you make the user experience better. Users tend to trust responses that have citations a lot more than responses that don't.

Next, create a table that uses that schema:

```
const tbl = await db.createEmptyTable("content", contentSchema, {

  // if both of these are set, LanceDB uses the semantics of

  // `CREATE TABLE IF NOT EXISTS content` in your favorite relational

  // database.

  mode: "create",

  existOk: true,

});
```

### Ingesting files[​](#ingesting-files "Direct link to Ingesting files")

The exact details of how you ingest files will vary based on what you are ingesting, but at a high level you can make a lot of cheap assumptions about the data that will help. The biggest barrier to ingesting data into a model is a combination of two factors:

1. The context window of the model ([8191 tokens for OpenAI models](https://dev.to/simplr_sh/the-best-way-to-chunk-text-data-for-generating-embeddings-with-openai-models-56c9)).
2. Figuring out where to chunk files such that they fit into the context window of the model.

For the sake of argument, let's say that we're dealing with a folder full of [Markdown documents](https://en.wikipedia.org/wiki/Markdown). Markdown is a fairly variable format that is versatile enough (this document is written in a variant of Markdown), but we can also take advantage of human organizational psychology to make this easier. People generally break Markdown documents into sections where each section is separated by a line beginning with one or more hashes:

```
# Title of the document



Ah yes, the venerable introduction paragraph—the sacred scroll...



## Insights



What began as an unrelated string of metaphors...
```

You can break this into two chunks:

```
[

  {

    "heading": "Title of the document",

    "content": "Ah yes, the venerable introduction paragraph—the sacred scroll..."

  },

  {

    "heading": "Insights",

    "content": "What began as an unrelated string of metaphors..."

  }

]
```

Each of these should be indexed separately and the heading metadata should be attached to each record in the database. You can break it up into sections of up to 8191 tokens (or however big your model's context window is) with logic like this:

<!-- -->

Long code block with example document chunking code

```
import { encoding_for_model } from "@dqbd/tiktoken";



export type MarkdownSection = {

  heading: string;

  content: string;

};



// Exercise for the reader: handle front matter with the gray-matter package.



export async function chunkify(

  markdown: string,

  maxTokens = 8191,

  model = "text-embedding-3-small",

): Promise<MarkdownSection[]> {

  const encoding = await encoding_for_model(model);

  const sections: MarkdownSection[] = [];



  const lines = markdown.split("\n");

  let currentHeading: string | null = null;

  let currentContent: string[] = [];



  const pushSection = (heading: string, content: string) => {

    const tokens = encoding.encode(content);

    if (tokens.length <= maxTokens) {

      sections.push({ heading, content });

    } else {

      // If section is too long, split by paragraphs

      const paragraphs = content.split(/\n{2,}/);

      let chunkTokens: number[] = [];

      let chunkText: string = "";



      for (const para of paragraphs) {

        const paraTokens = encoding.encode(para + "\n\n");

        if (chunkTokens.length + paraTokens.length > maxTokens) {

          sections.push({

            heading,

            content: chunkText.trim(),

          });

          chunkTokens = [...paraTokens];

          chunkText = para + "\n\n";

        } else {

          chunkTokens.push(...paraTokens);

          chunkText += para + "\n\n";

        }

      }



      if (chunkTokens.length > 0) {

        sections.push({

          heading,

          content: chunkText.trim(),

        });

      }

    }

  };



  for (const line of lines) {

    const headingMatch = line.match(/^#{1,6} (.+)/);

    if (headingMatch) {

      if (currentHeading !== null) {

        const sectionText = currentContent.join("\n").trim();

        if (sectionText) {

          pushSection(currentHeading, sectionText);

        }

      }

      currentHeading = headingMatch[1].trim();

      currentContent = [];

    } else {

      currentContent.push(line);

    }

  }



  // Push the final section

  if (currentHeading !== null) {

    const sectionText = currentContent.join("\n").trim();

    if (sectionText) {

      pushSection(currentHeading, sectionText);

    }

  }



  encoding.free();

  return sections;

}
```

Then when you're reading your files, use a loop like this to break all of the files into chunks:

```
import { glob } from "glob";

import { readFile } from "node:fs/promises";

import { chunkify } from "./markdownChunk";



const markdownFiles = await glob("./docs/**/*.md");



const files = [...markdownFiles].filter(

  (fname) => !fname.endsWith("README.md"),

);

files.sort();



const fnameToURL = (fname) => {

  // Implement me!

};



const utterances = [];



for (const fname of files) {

  const data = await readFile(fname, "utf-8");

  const chunks = await chunkify(data);



  chunks.forEach(({ heading, content }) => {

    utterances.push({

      fname,

      heading,

      content,

      url: fnameToURL(fname),

    });

  });

}
```

And finally ingest all the files into LanceDB:

```
let docs: unknown = []; // temporary buffer so we don't block all the time

const MAX_BUFFER_SIZE = 100;

for (const utterance of utterances) {

  const { heading, content, url } = utterance;

  docs.push({

    heading,

    content,

    url,

  });



  if (docs.length >= MAX_BUFFER_SIZE) {

    console.log(`adding ${docs.length} documents`);

    await tbl.add(docs);

    docs = [];

  }

}



if (docs.length !== 0) {

  console.log(`adding ${docs.length} documents`);

  await tbl.add(docs);

}
```

Finally, create an index on the `vector` field so the LanceDB client can search faster:

```
await tbl.createIndex("vector");
```

And then run an example search for the term "Tigris":

```
const query = "Tigris";

const actual = await tbl.search(query).limit(10).toArray();

console.log(

  actual.map(({ url, heading, text }) => {

    return { url, heading, text };

  }),

);
```

<!-- -->

The entire example in one big file

```
import * as lancedb from "@lancedb/lancedb";

import * as arrow from "apache-arrow";

import "@lancedb/lancedb/embedding/openai";

import { LanceSchema, getRegistry } from "@lancedb/lancedb/embedding";

import { EmbeddingFunction } from "@lancedb/lancedb/embedding";

import { glob } from "glob";

import { readFile } from "node:fs/promises";

import { chunkify } from "./markdownChunk";



const bucketName = process.env.BUCKET_NAME || "tigris-example";



interface Utterance {

  fname: string;

  heading: string;

  content: string;

  url: string;

}



const func = getRegistry()

  .get("openai")

  ?.create({ model: "text-embedding-3-small" }) as EmbeddingFunction;



const contentSchema = LanceSchema({

  text: func.sourceField(new arrow.Utf8()),

  vector: func.vectorField(),

  url: new arrow.Utf8(),

  heading: new arrow.Utf8(),

});



const fnameToURL = (fname) => {

  let ref = /\.\.\/\.\.\/(.*)\.md/.exec(fname)![1];

  if (ref.endsWith("/index")) {

    ref = ref.slice(0, -"index".length);

  }

  return `https://tigrisdata.com/docs/${ref}`;

};



(async () => {

  const markdownFiles = glob.sync("../../**/*.md");

  const files = [...markdownFiles].filter(

    (fname) => !fname.endsWith("README.md"),

  );

  files.sort();



  const utterances: Utterance[] = [];



  for (const fname of files) {

    const data = await readFile(fname, "utf-8");

    const chunks = await chunkify(data);



    chunks.forEach(({ heading, content }) => {

      utterances.push({

        fname,

        heading,

        content,

        url: fnameToURL(fname),

      });

    });

  }



  const db = await lancedb.connect(`s3://${bucketName}/docs-test`, {

    storageOptions: {

      endpoint: "https://t3.storage.dev",

      region: "auto",

    },

  });



  const tbl = await db.createEmptyTable("content", contentSchema, {

    mode: "create",

    existOk: true,

  });



  let docs: Record<string, string>[] = []; // temporary buffer so we don't block all the time

  const MAX_BUFFER_SIZE = 100;

  for (const utterance of utterances) {

    const { heading, content, url } = utterance;

    docs.push({

      heading,

      text: content,

      url,

    });



    if (docs.length >= MAX_BUFFER_SIZE) {

      console.log(`adding ${docs.length} documents`);

      await tbl.add(docs);

      docs = [];

    }

  }



  if (docs.length !== 0) {

    console.log(`adding ${docs.length} documents`);

    await tbl.add(docs);

  }



  await tbl.createIndex("vector");



  const query = "Tigris";

  const actual = await tbl.search(query).limit(10).toArray();

  console.log(

    actual.map(({ url, heading, text }) => {

      return { url, heading, text };

    }),

  );

})();
```

And now you can search the Tigris docs!
