Store Vector Embeddings with LanceDB

LanceDB is an open-source vector database (like a search engine) that enables you to ingest, and query documents via any number of backends, including Tigris. LanceDB sets itself apart from other vector databases because you can embed it into the same process that your web service runs in. Imagine it as being zero-setup like SQLite, but for vector databases instead of relational databases.

When you ingest a document, image, video, or your secret world domination plans into LanceDB, you use an embedding model to transform the text of the document into high-dimensional embedding vectors. These vectors can be searched through cosine similarity, allowing you to search by concepts (eg: "hot" and "warm" would register as adjacent to eachother) instead of just the pure words of a search query.

Of note: LanceDB isn't limited to just storing text, embedding vectors, and other simple metadata. LanceDB schemas can point to the raw bytes of your data, meaning that you don't need to store references to where your data lives if you don't need to.

The main reason you'd want to bring a vector database into the equation is if you're doing a Retrieval-Augmented Generation (RAG) pipeline like this:

Vector databases are also general search engines and you can search anything that can be embedded, be it text, images, audio, or video. Most AI providers offer an embedding model and nearly all of them are good enough for just about any task you would want to do.

Getting Started

To get started, install LanceDB into your project's NPM dependencies:

npm
pnpm
yarn

npm install --save @lancedb/lancedb apache-arrow

pnpm add @lancedb/lancedb apache-arrow

yarn add @lancedb/lancedb apache-arrow

Then import LanceDB into your project:

import * as lancedb from "@lancedb/lancedb";
import * as arrow from "apache-arrow";

const bucketName = process.env.BUCKET_NAME || "tigris-example";

const db = await lancedb.connect(`s3://${bucketName}/docs`, {
  storageOptions: {
    endpoint: "https://t3.storage.dev",
    region: "auto",
  },
});

Then register the embedding model you plan to use, such as OpenAI's embedding model:

import "@lancedb/lancedb/embedding/openai";
import { LanceSchema, getRegistry, register } from "@lancedb/lancedb/embedding";
import { EmbeddingFunction } from "@lancedb/lancedb/embedding";

const func = getRegistry()
  .get("openai")
  ?.create({ model: "text-embedding-3-small" }) as EmbeddingFunction;

And create the schema for the data you want to ingest:

const contentSchema = LanceSchema({
  text: func.sourceField(new arrow.Utf8()),
  vector: func.vectorField(),
  //title: new arrow.Utf8(),
  url: new arrow.Utf8(),
  heading: new arrow.Utf8(),
});

This creates a schema that has a few fields:

The source text that you are searching against
The high-dimensional generated vector used to search for similar embeddings
Additional metadata such as the title, heading, and url of the document you're embedding so that the model can link users back to a source

Strictly speaking, only the text and vector fields are required. The rest are optional but can help you make the user experience better. Users tend to trust responses that have citations a lot more than responses that don't.

Next, create a table that uses that schema:

const tbl = await db.createEmptyTable("content", contentSchema, {
  // if both of these are set, LanceDB uses the semantics of
  // `CREATE TABLE IF NOT EXISTS content` in your favorite relational
  // database.
  mode: "create",
  existOk: true,
});

Ingesting files

The exact details of how you ingest files will vary based on what you are ingesting, but at a high level you can make a lot of cheap assumptions about the data that will help. The biggest barrier to ingesting data into a model is a combination of two factors:

The context window of the model (8191 tokens for OpenAI models).
Figuring out where to chunk files such that they fit into the context window of the model.

For the sake of argument, let's say that we're dealing with a folder full of Markdown documents. Markdown is a fairly variable format that is versatile enough (this document is written in a variant of Markdown), but we can also take advantage of human organizational psychology to make this easier. People generally break Markdown documents into sections where each section is separated by a line beginning with one or more hashes:

# Title of the document

Ah yes, the venerable introduction paragraph—the sacred scroll...

## Insights

What began as an unrelated string of metaphors...

You can break this into two chunks:

[
  {
    "heading": "Title of the document",
    "content": "Ah yes, the venerable introduction paragraph—the sacred scroll..."
  },
  {
    "heading": "Insights",
    "content": "What began as an unrelated string of metaphors..."
  }
]

Each of these should be indexed separately and the heading metadata should be attached to each record in the database. You can break it up into sections of up to 8191 tokens (or however big your model's context window is) with logic like this:

Long code block with example document chunking code

import { encoding_for_model } from "@dqbd/tiktoken";

export type MarkdownSection = {
  heading: string;
  content: string;
};

// Exercise for the reader: handle front matter with the gray-matter package.

export async function chunkify(
  markdown: string,
  maxTokens = 8191,
  model = "text-embedding-3-small",
): Promise<MarkdownSection[]> {
  const encoding = await encoding_for_model(model);
  const sections: MarkdownSection[] = [];

  const lines = markdown.split("\n");
  let currentHeading: string | null = null;
  let currentContent: string[] = [];

  const pushSection = (heading: string, content: string) => {
    const tokens = encoding.encode(content);
    if (tokens.length <= maxTokens) {
      sections.push({ heading, content });
    } else {
      // If section is too long, split by paragraphs
      const paragraphs = content.split(/\n{2,}/);
      let chunkTokens: number[] = [];
      let chunkText: string = "";

      for (const para of paragraphs) {
        const paraTokens = encoding.encode(para + "\n\n");
        if (chunkTokens.length + paraTokens.length > maxTokens) {
          sections.push({
            heading,
            content: chunkText.trim(),
          });
          chunkTokens = [...paraTokens];
          chunkText = para + "\n\n";
        } else {
          chunkTokens.push(...paraTokens);
          chunkText += para + "\n\n";
        }
      }

      if (chunkTokens.length > 0) {
        sections.push({
          heading,
          content: chunkText.trim(),
        });
      }
    }
  };

  for (const line of lines) {
    const headingMatch = line.match(/^#{1,6} (.+)/);
    if (headingMatch) {
      if (currentHeading !== null) {
        const sectionText = currentContent.join("\n").trim();
        if (sectionText) {
          pushSection(currentHeading, sectionText);
        }
      }
      currentHeading = headingMatch[1].trim();
      currentContent = [];
    } else {
      currentContent.push(line);
    }
  }

  // Push the final section
  if (currentHeading !== null) {
    const sectionText = currentContent.join("\n").trim();
    if (sectionText) {
      pushSection(currentHeading, sectionText);
    }
  }

  encoding.free();
  return sections;
}

Then when you're reading your files, use a loop like this to break all of the files into chunks:

import { glob } from "glob";
import { readFile } from "node:fs/promises";
import { chunkify } from "./markdownChunk";

const markdownFiles = await glob("./docs/**/*.md");

const files = [...markdownFiles].filter(
  (fname) => !fname.endsWith("README.md"),
);
files.sort();

const fnameToURL = (fname) => {
  // Implement me!
};

const utterances = [];

for (const fname of files) {
  const data = await readFile(fname, "utf-8");
  const chunks = await chunkify(data);

  chunks.forEach(({ heading, content }) => {
    utterances.push({
      fname,
      heading,
      content,
      url: fnameToURL(fname),
    });
  });
}

And finally ingest all the files into LanceDB:

let docs: unknown = []; // temporary buffer so we don't block all the time
const MAX_BUFFER_SIZE = 100;
for (const utterance of utterances) {
  const { heading, content, url } = utterance;
  docs.push({
    heading,
    content,
    url,
  });

  if (docs.length >= MAX_BUFFER_SIZE) {
    console.log(`adding ${docs.length} documents`);
    await tbl.add(docs);
    docs = [];
  }
}

if (docs.length !== 0) {
  console.log(`adding ${docs.length} documents`);
  await tbl.add(docs);
}

Finally, create an index on the vector field so the LanceDB client can search faster:

await tbl.createIndex("vector");

And then run an example search for the term "Tigris":

const query = "Tigris";
const actual = await tbl.search(query).limit(10).toArray();
console.log(
  actual.map(({ url, heading, text }) => {
    return { url, heading, text };
  }),
);

The entire example in one big file

import * as lancedb from "@lancedb/lancedb";
import * as arrow from "apache-arrow";
import "@lancedb/lancedb/embedding/openai";
import { LanceSchema, getRegistry } from "@lancedb/lancedb/embedding";
import { EmbeddingFunction } from "@lancedb/lancedb/embedding";
import { glob } from "glob";
import { readFile } from "node:fs/promises";
import { chunkify } from "./markdownChunk";

const bucketName = process.env.BUCKET_NAME || "tigris-example";

interface Utterance {
  fname: string;
  heading: string;
  content: string;
  url: string;
}

const func = getRegistry()
  .get("openai")
  ?.create({ model: "text-embedding-3-small" }) as EmbeddingFunction;

const contentSchema = LanceSchema({
  text: func.sourceField(new arrow.Utf8()),
  vector: func.vectorField(),
  url: new arrow.Utf8(),
  heading: new arrow.Utf8(),
});

const fnameToURL = (fname) => {
  let ref = /\.\.\/\.\.\/(.*)\.md/.exec(fname)![1];
  if (ref.endsWith("/index")) {
    ref = ref.slice(0, -"index".length);
  }
  return `https://tigrisdata.com/docs/${ref}`;
};

(async () => {
  const markdownFiles = glob.sync("../../**/*.md");
  const files = [...markdownFiles].filter(
    (fname) => !fname.endsWith("README.md"),
  );
  files.sort();

  const utterances: Utterance[] = [];

  for (const fname of files) {
    const data = await readFile(fname, "utf-8");
    const chunks = await chunkify(data);

    chunks.forEach(({ heading, content }) => {
      utterances.push({
        fname,
        heading,
        content,
        url: fnameToURL(fname),
      });
    });
  }

  const db = await lancedb.connect(`s3://${bucketName}/docs-test`, {
    storageOptions: {
      endpoint: "https://t3.storage.dev",
      region: "auto",
    },
  });

  const tbl = await db.createEmptyTable("content", contentSchema, {
    mode: "create",
    existOk: true,
  });

  let docs: Record<string, string>[] = []; // temporary buffer so we don't block all the time
  const MAX_BUFFER_SIZE = 100;
  for (const utterance of utterances) {
    const { heading, content, url } = utterance;
    docs.push({
      heading,
      text: content,
      url,
    });

    if (docs.length >= MAX_BUFFER_SIZE) {
      console.log(`adding ${docs.length} documents`);
      await tbl.add(docs);
      docs = [];
    }
  }

  if (docs.length !== 0) {
    console.log(`adding ${docs.length} documents`);
    await tbl.add(docs);
  }

  await tbl.createIndex("vector");

  const query = "Tigris";
  const actual = await tbl.search(query).limit(10).toArray();
  console.log(
    actual.map(({ url, heading, text }) => {
      return { url, heading, text };
    }),
  );
})();

Example: Docs search

We've created an example app that lets you search the Tigris docs using LanceDB and the OpenAI embedding model at tigrisdata-community/docs-search-example. To set this up you need the following:

An API key for OpenAI
A Tigris account and bucket
A Tigris access keypair
Node.js installed

Clone the repository to your local machine:

git clone https://github.com/tigrisdata-community/docs-search-example
cd docs-search-example

Install all of the NPM dependencies:

npm ci

Then clone the blog and documentation repositories:

cd var
git clone https://github.com/tigrisdata/tigris-blog
git clone https://github.com/tigrisdata/tigris-os-docs
cd ..

Set your OpenAI API key and Tigris credentials in your environment:

export OPENAI_API_KEY=sk-hunter2-hunter2hunter2hunter2
export AWS_ACCESS_KEY_ID=tid_AzureDiamond
export AWS_SECRET_ACCESS_KEY=tid_hunter2hunter2hunter2
export AWS_ENDPOINT_URL_S3=https://t3.storage.dev
export AWS_REGION=auto
export BUCKET_NAME=your-bucket-name-here

Make sure to replace the secrets with your keys!

Ingest the docs with ingest.ts:

npx tsx ingest.ts

Then you can run the server with node:

node index.js

Then open http://localhost:3000 and search for whatever you want to know!

Next steps

The following is left as an exercise for the reader:

The markdown chunkify function doesn't properly handle Markdown front matter. Try adding support for it using the gray-matter package.
Try integrating this with an AI model by passing the user query through LanceDB to get a list of candidate documents, insert the document details, and then see how it changes the results of your model.

What else can you do with this database? The cloud's the limit!

Getting Started​

Ingesting files​

Example: Docs search​

Next steps​

Getting Started

Ingesting files

Example: Docs search

Next steps