ips/docs/ai/2025-11-12-redb-search-index-simplified.md at 7a3373a17de0aa0445aafad9123562b8e1625a5b

mirror of https://codeberg.org/Toasterson/ips.git synced 2026-04-10 13:20:42 +00:00

Till Wegmueller 6be608164d

Add detailed documentation for Redb-based IPS Search Index schema and encoding plans

- Introduced a series of planning documents detailing the Redb-based IPS index design, schema specification, and encoding strategies.
- Added a high-level overview of the core search index schema and the use of Redb MVCC transactions for consistency and performance improvements.
- Documented simplified schema definitions avoiding optional elements, focusing on compact encodings.
- Defined transitions to postcard-encoded binary formats, aligning with Rust’s serde for standardized serialization.
- Outlined migration strategies, invariants, error handling, and testing plans for index adoption.
- Enhanced documentation with structured explanations for developers to implement, extend, and migrate seamlessly to the new index model.

2025-12-08 20:10:04 +01:00

7.8 KiB

Raw Blame History

Title: Redb-based IPS Search Index — Simplified Plan (MVCC-first, Rust-friendly)

Author: Junie (JetBrains AI Coding Agent) Date: 2025-11-12 Status: Planning Document (supersedes complex/optional parts; implementation to follow)

What changed in this revision (why simpler)

Remove epoch partitioning and any indirection tables. Rely on redb MVCC and atomic write transactions for consistency. No more active_epoch flips.
Remove optional/aux tables and features for v1: no token_offsets, no fmri_offsets indirection, no string-intern tables, no chunking. Keep a single canonical encoding optimized for Rust parsing.
Keep feature parity with pkg5 using a mandatory mini delta index for fast updates. The fast path is always supported without schema toggles.
Keep encodings compact and deterministic: length-prefixed UTF-8 and LEB128 varints; no URL-quoting on disk.

Goals (unchanged intent, simplified mechanism)

Functional equivalence with pkg5 search results and fast update behavior.
Atomic multi-table updates via a single redb write transaction; readers get consistent snapshots automatically.
Minimal schema that is straightforward to implement, test, and migrate.

Minimal Database Schema (fixed, no epochs)

All tables live in one redb database file (e.g., index.redb). Schema versioning is done via a single meta record.

meta
- Key: "schema_version"
- Value: u32 (starts at 1)
fmri_catalog
- Purpose: dense id ↔ string mapping for fmris used in postings
- Keys/Values:
  - id_to_str: key = (0x00, u32 id) → value = utf8 fmri
  - str_to_id: key = (0x01, utf8 fmri) → value = u32 id
- Policy: ids are compact (0..N-1). Rebuilds may reassign ids; fast delta uses strings and is joined at query time.
postings
- Purpose: main inverted index token → grouped postings with per-manifest offsets
- Key: utf8 token (exact, unescaped)
- Value: binary blob encoded as:
  - at_group_count: varint
  - repeat at_group_count times:
    - action_type: len(varint) + utf8 bytes
    - sub_group_count: varint
    - repeat sub_group_count times:
      - subtype: len(varint) + utf8 bytes
      - fv_group_count: varint
      - repeat fv_group_count times:
        
        full_value: len(varint) + utf8 bytes
        
        pair_count: varint
        
        repeat pair_count times:
        
        fmri_id: u32 (LE)
        
        offsets_count: varint
        
        offsets_delta[varint; length=offsets_count] (d0=a0, di=a[i]-a[i-1])
- Notes: strictly increasing offsets; no compression or chunking in v1.
mini_delta
- Purpose: mandatory mini token index for fast updates (additions only). Mirrors postings schema but values may reference fmris by string to avoid id assignment.
- Key: utf8 token
- Value: binary blob encoded as:
  - at/sub/fv hierarchy identical to postings
  - pair_count: varint
  - repeat pair_count times:
    - fmri_str: len(varint) + utf8 bytes
    - offsets_count: varint
    - offsets_delta[varint]
- Rationale: keeps fast path independent of fmri_catalog churn; simplifies writer.
fast_add
- Purpose: set of fmri strings added since last rebuild
- Key: utf8 fmri
- Value: unit (empty) or timestamp string (optional)
fast_remove
- Purpose: set of fmri strings removed since last rebuild
- Key: utf8 fmri
- Value: unit (empty) or timestamp string (optional)
fmri_catalog_hash
- Purpose: preserve pkg5 integrity feature
- Key: 0x00
- Value: utf8 lowercase hex SHA-1 of sorted fmri strings currently in fmri_catalog (id_to_str)
locks (optional but simple)
- Key: "index_writer"
- Value: utf8 holder + timestamp (opaque)

Operations (simplified)

3.1 Consistent reads

Open a read transaction. Query tables directly; snapshot isolation ensures consistency.
For token queries: result = postings[token] merged with mini_delta[token]; then subtract any hits whose fmri (mapped via fmri_catalog or taken from mini_delta) is present in fast_remove; finally, union results with any mini_delta entries whose fmri is in fast_add (already included by merge).
For listing fmris: list = (all fmri_catalog ids → strings) ∪ fast_add \ fast_remove; integrity hash = fmri_catalog_hash.

3.2 Fast update (client/server small change sets)

Start write txn.
For each added fmri: insert fmri string into fast_add, remove from fast_remove if present. Append/merge token postings for that fmri into mini_delta at the token granularity.
For each removed fmri: insert fmri string into fast_remove, remove from fast_add if present. Optionally prune any mini_delta entries for that fmri.
Commit. Readers immediately see the delta via MVCC.

3.3 Full rebuild

Start write txn.
Recompute fmri_catalog from manifests (dense ids), fmri_catalog_hash from catalog.
Rebuild postings from scratch (tokenize, group, encode).
Clear mini_delta, fast_add, fast_remove.
Commit. Done. No epoch flips necessary.

Rust-friendly encoding rules

Strings: UTF-8 with varint length prefix; decode with a zero-copy slice + from_utf8.
Integers: LEB128 varints for counts and deltas; fmri_id stored as fixed u32 LE for fast decoding.
Hierarchy order is fixed and deterministic; all counts come before their sequences to enable single-pass decoding.
Keys are raw UTF-8 tokens/fmris; no URL-quoting required.

Invariants

postings references only valid fmri_id values present in fmri_catalog at the time of commit.
offsets decode to strictly increasing absolute positions.
fast_add ∩ fast_remove = ∅ (writers maintain disjointness).
mini_delta entries reference fmri strings; when a full rebuild commits, mini_delta MUST be empty.
fmri_catalog ids are 0..N-1 contiguous.

Migration (from pkg5 text index)

Step-by-step importer (single write txn):

Build fmri_catalog from manf_list.v1; compute fmri_catalog_hash from full_fmri_list.
Convert main_dict.ascii.v2 lines into postings values (URL-unquote tokens and full values; map pfmri_index → fmri_id; parse offsets and store delta-encoded).
Initialize mini_delta empty.
Copy fast_add/fast_remove lines into respective sets.
Commit. Result is immediately queryable and functionally equivalent.

Error handling (per project guidelines)

Library code (libips): define specific error enums with thiserror and miette::Diagnostic derives (no fancy feature in lib). Errors include: SchemaMismatch, DecodeError, InvalidOffsets, MissingFmriId, TxnFailure.
Application crates: use miette::Result and attach helpful diagnostics.

Testing plan (focused for simplified schema)

Unit tests: encoder/decoder for postings and mini_delta, varint/delta round-trips, fmri_catalog id assignment.
Property tests: offsets strictly increasing after decode; random postings encode/decode equality.
Concurrency: readers during fast update and during rebuild; verify stable snapshots.
Migration: import sample pkg5 index; compare query results token-by-token with legacy.

Implementation roadmap (tight scope)

Phase A: Define codecs and table handles in libips::search::index; implement full rebuild writer + read path.
Phase B: Implement fast path (mini_delta + fast_add/remove) with merge logic; add pruning on removals.
Phase C: Importer from pkg5 files; verification utilities; basic GC (mini_delta cleanup is already enforced; no epochs to GC).

Appendix: Removed/Deferred items and rationale

Epochs: removed — redb’s atomic transactions and MVCC provide consistent multi-table updates without indirection.
token_offsets and fmri_offsets tables: removed — premature optimization; postings are single-key fetches. Can be reconsidered if profiling shows need.
Chunking/postings compression/interning: deferred — start with a straightforward encoding; add compression or interning only if performance data demands it.

This simplified plan keeps the same user-visible behavior as pkg5 with fewer moving parts, favoring redb’s built-in guarantees and a Rust-friendly, deterministic binary encoding.

7.8 KiB Raw Blame History Unescape Escape

7.8 KiB

Raw Blame History