Skip to content
Get Started

LanceDB

The rig-lancedb crate backs Rig’s vector store with LanceDB, a serverless vector database built on Apache Arrow. It stores embeddings in a columnar format and runs either embedded on local disk or against cloud object storage (S3, GCS, Azure).

[dependencies]
rig = "0.39.0"
rig-lancedb = "0.39.0"
lancedb = "0.30"
tokio = { version = "1", features = ["full"] }

lancedb::connect opens (or creates) a database at a URI. Use a path for local storage or an s3:// / gs:// / az:// URI for cloud storage.

// Local, on-disk store.
let db = lancedb::connect("data/lancedb-store").execute().await?;
// Cloud storage on S3 (see the LanceDB storage guide for IAM requirements).
let db = lancedb::connect("s3://my-lancedb-bucket").execute().await?;

LanceDB tables are typed. Each table needs an id column, your document columns, and a fixed-size list column for the embedding whose length matches your model’s dimensions.

use std::sync::Arc;
use lancedb::arrow::arrow_schema::{DataType, Field, Fields, Schema};
fn schema(dims: usize) -> Schema {
Schema::new(Fields::from(vec![
Field::new("id", DataType::Utf8, false),
Field::new("definition", DataType::Utf8, false),
Field::new(
"embedding",
DataType::FixedSizeList(
Arc::new(Field::new("item", DataType::Float64, true)),
dims as i32,
),
false,
),
]))
}

You populate the table by converting your embedded documents into Arrow RecordBatches and calling LanceDB’s create_table / add. See the full local example in the repo for the batch-building code.

LanceDbVectorIndex::new wraps a LanceDB table together with an embedding model, the id column name, and search parameters.

use rig_lancedb::{LanceDbVectorIndex, SearchParams};
let table = db.open_table("documents").execute().await?;
let index = LanceDbVectorIndex::new(
table,
model, // an EmbeddingModel, used to embed queries
"id", // id column
SearchParams::default(),
)
.await?;

SearchParams configures how queries run — the distance metric (Cosine, L2) and the number of candidates to consider. SearchParams::default() is a sensible starting point; see the docs.rs API for the full builder.

LanceDB supports two nearest-neighbor strategies:

  • IVF-PQ (Inverted File with Product Quantization) — approximate search (ANN). Faster on large tables but approximate; creating an IVF-PQ index requires at least 256 rows.
  • Exact Nearest Neighbors (ENN) — exact results, slower. Good for small tables where an ANN index isn’t warranted.

Queries use the shared VectorSearchRequest and top_n:

use rig::vector_store::{VectorStoreIndex, VectorSearchRequest};
let req = VectorSearchRequest::builder()
.query("search query")
.samples(5)
.build();
let results = index.top_n::<Document>(req).await?;
for (score, id, doc) in results {
println!("{score:.3} {id}");
}

LanceDB also supports metadata filtering; pass a Filter on the request to narrow results (see Filters).

  • Distance metric — pick Cosine or L2 to match how your embeddings were produced.
  • Index size — keep local storage for smaller datasets; move to cloud storage (S3/GCS/Azure) for large-scale deployments.
  • ANN vs ENN — IVF-PQ trades a little accuracy for speed on large tables; ENN gives exact results on small ones.