Connect to your LanceDB Enterprise deployment, define a UDF, and run a distributed
backfill — all from a notebook or a script. No cluster setup required.
Installation
Geneva is published on PyPI. Install the latest stable
release with uv (recommended) or pip. Newer pre-release
builds with the latest features are also available on LanceDB’s Fury indexes — see
Pre-release builds below.
Prerequisites
- Python 3.10+
- uv (recommended) or
pip
Install the latest stable release
uv pip install --upgrade geneva
Verify
python -c "import geneva; print(geneva.__version__)"
Pre-release builds
To pick up the newest features ahead of a stable release, install a pre-release from LanceDB’s
Fury indexes. Geneva and its dependencies are published across two indexes:
| Package | Index |
|---|
geneva, lancedb | https://pypi.fury.io/lancedb/ |
pylance | https://pypi.fury.io/lance-format/ |
uv pip install --pre --upgrade \
--extra-index-url https://pypi.fury.io/lancedb/ \
--extra-index-url https://pypi.fury.io/lance-format \
--index-strategy unsafe-best-match \
geneva
The --index-strategy unsafe-best-match flag is required with uv. By default, uv only
considers package versions from the first index that lists a given package (PyPI). Since
geneva and pylance also appear on PyPI, this flag tells uv to pick the best match across
all indexes.
Quickstart
import os
import geneva
import pyarrow as pa
# Connect to LanceDB Enterprise
db = geneva.connect(
uri="db://my-db",
host_override=os.getenv("LANCEDB_URI", "http://localhost:10024"),
api_key=os.getenv("LANCEDB_API_KEY"),
)
tbl = db.open_table("my_table")
# Define a User Defined Function (UDF) that counts the words in the text column
@geneva.udf(data_type=pa.int32())
def word_count(text: str) -> int:
return len(text.split())
# Register the UDF as a new virtual column
tbl.add_columns({"word_count": word_count})
# Backfill the new column using distributed execution with incremental checkpointing
tbl.backfill("word_count")
Auto-backfill
With auto_backfill=True, LanceDB Enterprise recomputes the column for you whenever the
data or the UDF version changes — no explicit backfill() call needed (see
Backfilling).
# Change the column to use a new UDF version with auto-backfill enabled
@geneva.udf(data_type=pa.int32(), auto_backfill=True)
def word_count(text: str) -> int:
return len(text.split())
tbl.alter_columns({"path": "word_count", "udf": word_count})
# Add new rows. word_count is computed automatically in the background.
tbl.add([{"text": "hello world"}])
Materialized views and chunkers
A materialized view applies UDFs over a query and
refreshes incrementally. A chunker view expands each source
row into many rows (1:N) — useful for splitting documents, videos, or images.
# Materialized view: a query with UDF-computed columns, refreshed incrementally
query = tbl.search(None).select({"text": "text", "word_count": word_count})
view = db.create_materialized_view("my_view", query)
view.refresh()
# Chunker view: 1:N row expansion — split each row's text into one row per word
from typing import Iterator, NamedTuple
class Chunk(NamedTuple):
chunk_index: int
chunk_text: str
@geneva.chunker
def split_text(text: str) -> Iterator[Chunk]:
for i, word in enumerate(text.split()):
yield Chunk(chunk_index=i, chunk_text=word)
chunks = db.create_udtf_view(
"my_chunks",
source=tbl.search(None).select(["text"]),
udtf=split_text,
)
chunks.refresh()
Connecting to object storage or a local filesystem
Geneva can also run directly against cloud object storage or a local path. In this mode, jobs run on a
distributed execution context you provide.
# Cloud object storage (S3, GCS, Azure, or any S3-compatible object store)
db = geneva.connect("s3://my-bucket/my-database")
# Local filesystem
db = geneva.connect("/path/to/my-database")