LanceDB

The rig-lancedb crate backs Rig’s vector store with LanceDB, a serverless vector database built on Apache Arrow. It stores embeddings in a columnar format and runs either embedded on local disk or against cloud object storage (S3, GCS, Azure).

Setup

[dependencies]
rig = "0.39.0"
rig-lancedb = "0.39.0"
lancedb = "0.30"
tokio = { version = "1", features = ["full"] }

Connecting

lancedb::connect opens (or creates) a database at a URI. Use a path for local storage or an s3:// / gs:// / az:// URI for cloud storage.

// Local, on-disk store.
let db = lancedb::connect("data/lancedb-store").execute().await?;

// Cloud storage on S3 (see the LanceDB storage guide for IAM requirements).
let db = lancedb::connect("s3://my-lancedb-bucket").execute().await?;

Table schema

LanceDB tables are typed. Each table needs an id column, your document columns, and a fixed-size list column for the embedding whose length matches your model’s dimensions.

use std::sync::Arc;
use lancedb::arrow::arrow_schema::{DataType, Field, Fields, Schema};

fn schema(dims: usize) -> Schema {
    Schema::new(Fields::from(vec![
        Field::new("id", DataType::Utf8, false),
        Field::new("definition", DataType::Utf8, false),
        Field::new(
            "embedding",
            DataType::FixedSizeList(
                Arc::new(Field::new("item", DataType::Float64, true)),
                dims as i32,
            ),
            false,
        ),
    ]))
}

You populate the table by converting your embedded documents into Arrow RecordBatches and calling LanceDB’s create_table / add. See the full local example in the repo for the batch-building code.

Creating the index

LanceDbVectorIndex::new wraps a LanceDB table together with an embedding model, the id column name, and search parameters.

use rig_lancedb::{LanceDbVectorIndex, SearchParams};

let table = db.open_table("documents").execute().await?;

let index = LanceDbVectorIndex::new(
    table,
    model,          // an EmbeddingModel, used to embed queries
    "id",           // id column
    SearchParams::default(),
)
.await?;

Search parameters

SearchParams configures how queries run — the distance metric (Cosine, L2) and the number of candidates to consider. SearchParams::default() is a sensible starting point; see the docs.rs API for the full builder.

Index types

LanceDB supports two nearest-neighbor strategies:

IVF-PQ (Inverted File with Product Quantization) — approximate search (ANN). Faster on large tables but approximate; creating an IVF-PQ index requires at least 256 rows.
Exact Nearest Neighbors (ENN) — exact results, slower. Good for small tables where an ANN index isn’t warranted.

Querying

Queries use the shared VectorSearchRequest and top_n:

use rig::vector_store::{VectorStoreIndex, VectorSearchRequest};

let req = VectorSearchRequest::builder()
    .query("search query")
    .samples(5)
    .build();

let results = index.top_n::<Document>(req).await?;
for (score, id, doc) in results {
    println!("{score:.3} {id}");
}

LanceDB also supports metadata filtering; pass a Filter on the request to narrow results (see Filters).

Tips

Distance metric — pick Cosine or L2 to match how your embeddings were produced.
Index size — keep local storage for smaller datasets; move to cloud storage (S3/GCS/Azure) for large-scale deployments.
ANN vs ENN — IVF-PQ trades a little accuracy for speed on large tables; ENN gives exact results on small ones.