Skip to main content
Connect to your LanceDB Enterprise deployment, define a UDF, and run a distributed backfill — all from a notebook or a script. No cluster setup required.

Installation

Geneva is published on PyPI. Install the latest stable release with uv (recommended) or pip. Newer pre-release builds with the latest features are also available on LanceDB’s Fury indexes — see Pre-release builds below.

Prerequisites

  • Python 3.10+
  • uv (recommended) or pip

Install the latest stable release

uv pip install --upgrade geneva

Verify

python -c "import geneva; print(geneva.__version__)"

Pre-release builds

To pick up the newest features ahead of a stable release, install a pre-release from LanceDB’s Fury indexes. Geneva and its dependencies are published across two indexes:
PackageIndex
geneva, lancedbhttps://pypi.fury.io/lancedb/
pylancehttps://pypi.fury.io/lance-format/
uv pip install --pre --upgrade \
  --extra-index-url https://pypi.fury.io/lancedb/ \
  --extra-index-url https://pypi.fury.io/lance-format \
  --index-strategy unsafe-best-match \
  geneva
The --index-strategy unsafe-best-match flag is required with uv. By default, uv only considers package versions from the first index that lists a given package (PyPI). Since geneva and pylance also appear on PyPI, this flag tells uv to pick the best match across all indexes.

Quickstart

import os
import geneva
import pyarrow as pa

# Connect to LanceDB Enterprise
db = geneva.connect(
    uri="db://my-db",
    host_override=os.getenv("LANCEDB_URI", "http://localhost:10024"),
    api_key=os.getenv("LANCEDB_API_KEY"),
)

tbl = db.open_table("my_table")

# Define a User Defined Function (UDF) that counts the words in the text column
@geneva.udf(data_type=pa.int32())
def word_count(text: str) -> int:
    return len(text.split())

# Register the UDF as a new virtual column
tbl.add_columns({"word_count": word_count})

# Backfill the new column using distributed execution with incremental checkpointing
tbl.backfill("word_count")

Auto-backfill

With auto_backfill=True, LanceDB Enterprise recomputes the column for you whenever the data or the UDF version changes — no explicit backfill() call needed (see Backfilling).
# Change the column to use a new UDF version with auto-backfill enabled
@geneva.udf(data_type=pa.int32(), auto_backfill=True)
def word_count(text: str) -> int:
    return len(text.split())

tbl.alter_columns({"path": "word_count", "udf": word_count})

# Add new rows. word_count is computed automatically in the background.
tbl.add([{"text": "hello world"}])

Materialized views and chunkers

A materialized view applies UDFs over a query and refreshes incrementally. A chunker view expands each source row into many rows (1:N) — useful for splitting documents, videos, or images.
# Materialized view: a query with UDF-computed columns, refreshed incrementally
query = tbl.search(None).select({"text": "text", "word_count": word_count})
view = db.create_materialized_view("my_view", query)
view.refresh()

# Chunker view: 1:N row expansion — split each row's text into one row per word
from typing import Iterator, NamedTuple

class Chunk(NamedTuple):
    chunk_index: int
    chunk_text: str

@geneva.chunker
def split_text(text: str) -> Iterator[Chunk]:
    for i, word in enumerate(text.split()):
        yield Chunk(chunk_index=i, chunk_text=word)

chunks = db.create_udtf_view(
    "my_chunks",
    source=tbl.search(None).select(["text"]),
    udtf=split_text,
)
chunks.refresh()

Connecting to object storage or a local filesystem

Geneva can also run directly against cloud object storage or a local path. In this mode, jobs run on a distributed execution context you provide.
# Cloud object storage (S3, GCS, Azure, or any S3-compatible object store)
db = geneva.connect("s3://my-bucket/my-database")

# Local filesystem
db = geneva.connect("/path/to/my-database")