Add detailed documentation for Redb-based IPS Search Index schema and encoding plans

- Introduced a series of planning documents detailing the Redb-based IPS index design, schema specification, and encoding strategies. - Added a high-level overview of the core search index schema and the use of Redb MVCC transactions for consistency and performance improvements. - Documented simplified schema definitions avoiding optional elements, focusing on compact encodings. - Defined transitions to postcard-encoded binary formats, aligning with Rust’s serde for standardized serialization. - Outlined migration strategies, invariants, error handling, and testing plans for index adoption. - Enhanced documentation with structured explanations for developers to implement, extend, and migrate seamlessly to the new index model.
2026-04-10 13:20:42 +00:00 · 2025-12-08 20:10:04 +01:00 · 2025-12-08 20:10:04 +01:00 · 6be608164d
commit 6be608164d
parent 33dd228df7
3 changed files with 668 additions and 0 deletions
--- a/doc/ai/2025-11-12-redb-search-index.md
+++ b/doc/ai/2025-11-12-redb-search-index.md
@ -0,0 +1,294 @@
+Title: Redb-based IPS Search Index — Design Plan and Format Specification
+
+Author: Junie (JetBrains AI Coding Agent)
+Date: 2025-11-12
+Status: Planning Document (implementation to follow)
+
+1. Motivation and Goals
+
+- Provide a new on-disk search index format using the redb embedded database while remaining functionally equivalent to the pkg5 search index.
+- Preserve behavior and semantics of the existing client/server search pipelines (fast incremental updates vs. full rebuilds, consistent reader view, and compatibility of results).
+- Improve robustness and consistency guarantees by leveraging redb’s ACID transactions and MVCC snapshots instead of ad-hoc multi-file versioning and migration.
+- Keep storage efficient (delta/varint compression for offsets, dedup of common structures) and query-time fast.
+
+Functional equivalence targets (from pkg5 search index):
+- Inverted index from token → postings grouped by action type, subtype, full_value, with per-manifest offsets.
+- Ability to serve consistent reads while writer updates occur.
+- Fast incremental updates for small package change sets using append-only “fast add/remove” semantics without rebuilding main dictionaries.
+- Full rebuild path for large changes or inconsistencies.
+- A complete set of FMRIs represented by the index and an integrity hash equivalent to full_fmri_list + full_fmri_list.hash.
+- Ability to map between manifest ids and fmri strings (manf_list.v1 semantics).
+- Deduped representation for fmri offset sets (fmri_offsets.v1 semantics) and delta-compressed offsets.
+
+2. Redb Background and Mapping Strategy
+
+Redb is an embedded, crash-safe, ACID key-value store with MVCC snapshots:
+- Readers operate on a consistent snapshot without blocking writers.
+- Writers use transactions with atomic commit.
+- Data is organized into named tables (key/value types specified) and supports zero-copy reads where possible.
+
+We model each prior index “file” as one or more redb tables. We retain a global logical VERSION (called epoch) to preserve pkg5’s contract and for interop with external tools if needed. However, consistency is provided by redb transactions; the epoch is a semantic marker rather than the mechanism.
+
+3. Global Concepts and Conventions
+
+- Epoch: A monotonically increasing u64 stored in a metadata table indicating the current logical version of the index state visible to new readers. A rebuild creates a shadow epoch, fully populates it, and then atomically flips the active epoch pointer.
+- Names and quoting: For tokens and full values that previously required URL-quoting, we store the unescaped UTF-8 token/value with a binary-safe key encoding. If any legacy wire format requires exact URL-quoted strings, we compute them at query time.
+- Numeric encodings: Use compact varint (LEB128) for integer lists and delta-encoded offsets; redb values remain binary blobs with structured framing described below.
+- Compression: Optional lightweight compression for large value blobs (e.g., lz4-frame). The initial implementation can make this configurable per-table.
+
+4. Database Layout (Tables and Types)
+
+4.1 meta (singleton)
+- Purpose: Global metadata and pointers.
+- Key: fixed strings (e.g., "active_epoch", "next_epoch").
+- Value formats:
+  - active_epoch: u64 (the epoch new readers should use).
+  - last_full_rebuild_epoch: u64.
+  - created_at / updated_at: RFC3339 timestamp strings.
+  - schema_version: u32 for this redb schema.
+
+4.2 epochs (catalog of epochs)
+- Purpose: Track epoch lifecycle and allow safe garbage collection of old data.
+- Key: u64 epoch.
+- Value: struct:
+  - state: enum { Building, Active, Retired }
+  - built_from: Option<u64> (previous active epoch)
+  - created_at, activated_at, retired_at: timestamps
+  - stats: optional counts (tokens, postings bytes, fmris)
+
+4.3 token_postings (partitioned by epoch)
+- Purpose: Main inverted index; replaces main_dict.ascii.v2.
+- Table name pattern: token_postings:e{epoch}
+- Key: token_key
+  - token_key encoding: a prefix byte 0x00 followed by UTF-8 token bytes. A secondary composite layout is allowed: [action_type, subtype, full_value] are not part of the key; see value layout below.
+- Value: postings blob encoded as:
+  struct PostingsValue {
+    // One token’s postings grouped by action type and subtype and full_value.
+    // varint for sizes; strings are UTF-8 length-prefixed varint.
+    at_groups: Vec<AtGroup>
+  }
+  struct AtGroup {
+    action_type: String
+    sub_groups: Vec<SubGroup>
+  }
+  struct SubGroup {
+    subtype: String
+    fv_groups: Vec<FvGroup>
+  }
+  struct FvGroup {
+    full_value: String
+    // fmri_id -> offsets (delta-encoded, varint list)
+    // Store as list of pairs for sparse storage
+    pairs: Vec<(u32 fmri_id, Vec<u32> offsets_delta)>
+  }
+  // Entire blob may be compressed using lz4 if size exceeds threshold.
+
+Notes:
+- This preserves the hierarchical grouping and the per-manifest offsets from pkg5.
+- For very high-cardinality tokens, we may shard across multiple rows by introducing a chunk ordinal in the key (token_key + chunk_id). The chunking strategy can be added later without breaking the base schema; record chunk_count in a side table if used.
+
+4.4 token_offsets (optional accelerator; replaces token_byte_offset.v1)
+- Purpose: Random access accelerator if token_postings values become very large. In redb we can often fetch the single value by key; thus this table is optional. If chunking is implemented, this stores per-token chunk byte ranges for fast partial reads if we adopt a file-like storage.
+- Table name: token_offsets:e{epoch}
+- Key: token_key
+- Value: struct { total_len: u32, chunks: Vec<(chunk_id: u16, offset: u32, len: u32)> }
+
+4.5 fast_add and fast_remove (incremental logs)
+- Purpose: Client-side/apply small updates quickly without full rebuild; replaces fast_add.v1 / fast_remove.v1.
+- Table names: fast_add, fast_remove (not epoch-scoped; they represent deltas relative to active epoch).
+- Key: fmri string (UTF-8).
+- Value: struct { when: timestamp, epoch_base: u64 } or unit (). Presence indicates membership.
+Notes:
+- On activation of a new epoch after full rebuild, these logs are drained/cleared.
+- Queries union active epoch results with fast_add set and subtract fast_remove set when calculating current state.
+
+4.6 fmri_catalog (full membership; replaces full_fmri_list)
+- Purpose: Enumerate all FMRIs represented by the active index epoch.
+- Table name: fmri_catalog:e{epoch}
+- Key: u32 fmri_id (dense ids for compact postings); ids are stable within an epoch.
+- Value: fmri string (UTF-8).
+Auxiliary: fmri_lookup:e{epoch} mapping fmri string → u32 id for convenience during rebuild.
+
+4.7 fmri_catalog_hash (integrity; replaces full_fmri_list.hash)
+- Table name: fmri_catalog_hash:e{epoch}
+- Key: const 0x00
+- Value: hex lowercase SHA-1 of sorted fmri strings from fmri_catalog:e{epoch}.
+
+4.8 fmri_offsets (dedup store; replaces fmri_offsets.v1 semantics)
+- Purpose: Deduplicate common offset lists shared across FMRIs or tokens.
+- Table name: fmri_offsets:e{epoch}
+- Key: content hash (e.g., blake3 of absolute offsets encoding).
+- Value: delta-encoded varint list of offsets; may be lz4-compressed. Optionally also store a small header with original length/first few entries to detect corruption early.
+Usage: token_postings FvGroup pairs may reference either inline offsets or an indirect reference by hash if the list exceeds size threshold. To keep lookups simple initially, we will inline offsets; indirect storage is a future optimization.
+
+4.9 locks (writer serialization)
+- Table: locks
+- Key: "index_writer"
+- Value: struct { holder: string, since: timestamp }
+Notes: While redb provides transactions, we still serialize rebuilds/updates to mirror the single-writer expectation of pkg5. On server, publishing already serializes indexing; on client, operations are serialized by image plan execution. We can also keep the historical file-based lock for external tools, but within redb, this table suffices.
+
+5. Operations
+
+5.1 Consistent Reads
+- Readers begin a read transaction and fetch meta.active_epoch to locate epoch-scoped tables.
+- Because redb is MVCC, the entire read uses a consistent snapshot of those tables; concurrent writers do not affect the transaction’s view.
+- Readers also fetch fast_add and fast_remove sets to adjust responses. To avoid race conditions, readers read fast_* with the same snapshot, and can safely compute: effective_installed = (epoch state ∪ fast_add) \ fast_remove.
+
+5.2 Fast Incremental Update
+Trigger: Small number of fmris added/removed on client or small publish window on server.
+Steps:
+1) Start write transaction.
+2) Insert fmri strings into fast_add or fast_remove tables accordingly (idempotent puts; remove from opposite table if present).
+3) Optionally update a small counter or watermark (e.g., meta.fast_delta_count).
+4) Commit. No changes to token_postings or fmri_catalog for fast path.
+
+Query impact:
+- During query evaluation, postings derived from the active epoch are merged with delta postings computed on the fly for the fast_* fmris if we choose to precompute quick shards. Initial implementation can simply filter the result set by membership: when returning matched manifests, subtract those in fast_remove and optionally include those in fast_add only if we maintain their postings in a small side cache. Two options:
+  A. Ultra-minimal: fast_* only modifies membership for fmri listing queries; token-based matches for newly added packages require full rebuild or a small background mini-index.
+  B. Practical equivalence: maintain an auxiliary mini-index table mini_token_postings built incrementally for fast_* fmris. This table mirrors token_postings structure but only for the delta set. During query, we merge results from token_postings (epoch) and mini_token_postings, then subtract any fmris in fast_remove.
+
+Chosen approach for equivalence: Option B (mini-token index) to match pkg5 behavior that answers queries without immediate full rebuild.
+
+5.3 Full Rebuild
+Trigger: Large change set, first-time indexing, detected inconsistency, or server bulk operations.
+Steps:
+1) Start write transaction; allocate new epoch E = meta.active_epoch + 1. Create epochs[E] = Building.
+2) Compute fmri_catalog:eE and fmri_lookup:eE from manifests; compute fmri_catalog_hash:eE.
+3) Build token_postings:eE by scanning manifests, tokenizing, and aggregating postings with offsets. Apply delta-encoding and optional lz4 compression for large values.
+4) Optionally create auxiliary indices like token_offsets:eE and fmri_offsets:eE if enabled.
+5) Update epochs[E] = Active and set meta.active_epoch = E in the same commit; set epochs[prev] = Retired.
+6) Clear mini_token_postings and fast_add/fast_remove within the same or subsequent small transaction.
+
+Crash safety:
+- If the process crashes before flipping active_epoch, readers continue to use prev epoch.
+- A subsequent startup task scans epochs table; any epochs in Building with no pointer from meta.active_epoch are GC candidates.
+
+5.4 Garbage Collection and Compaction
+- Retain N recent epochs for rollback/debug (configurable). Periodically remove Retired epochs older than retention.
+- Use redb’s compaction/cleanup mechanisms as recommended by upstream to reclaim space.
+
+6. Data Encodings
+
+6.1 Strings
+- UTF-8, length-prefix varint. Deduplicate common strings (action_type, subtype) via small intern tables if hot.
+
+6.2 Integer Lists (Offsets)
+- Store offsets as increasing u32 values; encode as delta varints: d0 = abs0, di = abs[i] - abs[i-1].
+- For small lists (<= 4 items), inline in FvGroup; for larger, consider optional lz4 compression.
+
+6.3 Keys
+- Use simple UTF-8 token as key. For optional chunking: key = token + 0x1F separator + u16 chunk_id (big-endian) to keep lexicographic locality.
+
+6.4 Checksums and Hashes
+- fmri_catalog_hash:eE stores SHA-1 hex of sorted fmri strings to match pkg5 semantics.
+- For internal dedup/validation, use blake3 for speed where not exposed.
+
+7. Query Semantics and Equivalence
+
+Token search path:
+1) Read meta.active_epoch = E.
+2) Lookup token_postings:eE[token]. If chunked, merge all chunks.
+3) Lookup mini_token_postings[token] and merge into the same structure.
+4) Apply fast_remove filter by excluding any fmri ids whose fmri string appears in fast_remove.
+5) If the query path needs to output fmri strings, map fmri_id → string via fmri_catalog:eE; for mini index entries that reference fmri strings directly, assign transient ids or join by string.
+6) Return groupings by action type, subtype, full_value with manifest offsets just like pkg5.
+
+Full fmri list queries:
+- Return fmri_catalog:eE ∪ fast_add \ fast_remove, ensure sorted order if requested, and provide fmri_catalog_hash:eE (unchanged) for backward-compat features.
+
+Consistency guarantees:
+- Readers never block writers (MVCC) and see a consistent epoch.
+- Fast updates are visible immediately after their transaction commits.
+- Epoch flips are atomic from readers’ perspective.
+
+8. Migration Plan (from existing pkg5-style on-disk index)
+
+Inputs: Existing $ROOT/index text files as specified in pkg5 search.txt.
+Steps:
+1) Parse existing files using a one-time importer tool in Rust.
+2) Create redb database at $ROOT/index.redb/ (new directory) or reuse $ROOT/index with a different extension.
+3) Initialize meta (schema_version=1) and create epoch E=1.
+4) Fill fmri_catalog:e1 and fmri_lookup:e1 from manf_list.v1.
+5) Convert main_dict.ascii.v2 and token_byte_offset.v1 into token_postings:e1 values. Honor URL-unquoting/quoting rules.
+6) Populate fmri_offsets:e1 if we choose indirect references (optional initially).
+7) Compute fmri_catalog_hash:e1 from full_fmri_list or reconstructed catalog.
+8) Populate fast_add/fast_remove by copying logs if needed; or prefer to start with empty delta on first activation.
+9) Activate epoch E and mark old files as deprecated. Keep old files read-only until confidence is achieved.
+
+9. Error Handling and Recovery
+
+- All writer operations use miette diagnostics in application crates and specific error types in libips (per repository guidelines). Errors include: redb transaction failures, data corruption checks, invalid manifest encodings, and invariant violations.
+- Validation steps:
+  - Ensure fmri ids are contiguous in fmri_catalog:eE.
+  - Verify SHA-1 hash matches computed value.
+  - Verify postings reference valid fmri ids and that offsets lists are strictly increasing after decoding.
+  - For mini_token_postings, ensure referenced fmri strings exist in fast_add or are to be installed.
+- Recovery:
+  - If fast update fails, roll back the transaction; retry or fall back to full rebuild.
+  - If a rebuild fails mid-way, epoch stays in Building and is GC’d later; readers continue using prior epoch.
+
+10. Concurrency and Locking Details
+
+- Single logical writer: Maintain a process-level lock (either OS file lock at $ROOT/index.lock for compatibility, or locks table in redb) to serialize rebuilds and bulk updates.
+- Readers: No locks required; rely on MVCC.
+- Writer never blocks readers: Guaranteed by redb’s snapshot isolation.
+
+11. Performance Considerations
+
+- Posting value size thresholds: If a token’s postings exceed N bytes, enable lz4 compression for the value.
+- Hot string interning: Maintain two tiny epoch-scoped tables action_types:eE and subtypes:eE mapping string → u16 id to shrink postings.
+- Streaming build: Build token_postings:eE in sorted token order to improve locality; batch commits.
+- Mini-index size control: Evict older entries if fast delta grows beyond threshold; trigger background full rebuild when threshold is exceeded, mirroring MAX_FAST_INDEXED_PKGS behavior.
+
+12. Configuration Knobs
+
+- retention_epochs: number of retired epochs to keep.
+- compress_threshold_bytes: per-value threshold for lz4.
+- enable_indirect_offsets: bool to use fmri_offsets:eE.
+- fast_update_threshold: mirrors pkg5 MAX_FAST_INDEXED_PKGS; exceeding triggers full rebuild.
+
+13. Invariants (must hold)
+
+- meta.active_epoch points to an epochs entry in Active state.
+- All epoch-scoped tables for active epoch exist and are self-consistent.
+- fmri ids in token_postings:eE reference existing entries in fmri_catalog:eE.
+- Offsets decode to strictly increasing absolute positions.
+- fmri_catalog_hash:eE matches the sorted fmri list.
+- fast_add ∩ fast_remove = ∅; writes maintain disjointness.
+
+14. Testing Plan
+
+- Unit tests for encoding/decoding of postings and delta offsets.
+- Property tests: round-trip manifests → postings → decode invariants.
+- Concurrency tests: readers during rebuild and fast updates; ensure stable results.
+- Migration tests: import from sample pkg5 index and validate equivalence of query results.
+- End-to-end tests using cargo xtask setup-test-env to seed sample images/repos, index them, and run search queries.
+
+15. Implementation Roadmap (high level)
+
+Phase 1
+- Define redb schemas and value codecs in libips::search::index.
+- Implement full rebuild writer and read-only query path.
+- Implement fast_add/remove tables without mini-token index; gate features.
+
+Phase 2
+- Add mini_token_postings and merge logic for full equivalence with pkg5 fast updates.
+- Add compression and string interning optimizations.
+
+Phase 3
+- Migration tool from pkg5 text index; verification utilities.
+- Epoch GC and compaction tooling.
+
+Appendix A: Mapping of pkg5 files to redb tables
+
+- main_dict.ascii.v2 → token_postings:e{epoch}
+- token_byte_offset.v1 → token_offsets:e{epoch} (optional)
+- fast_add.v1 → fast_add
+- fast_remove.v1 → fast_remove
+- full_fmri_list → fmri_catalog:e{epoch}
+- full_fmri_list.hash → fmri_catalog_hash:e{epoch}
+- manf_list.v1 → fmri_catalog:e{epoch} + fmri_lookup:e{epoch}
+- fmri_offsets.v1 → fmri_offsets:e{epoch} (optional, can be inlined initially)
+- lock → external file lock or locks table; writer serialization via single-writer discipline
+
+This plan keeps feature parity with pkg5 while modernizing the storage and concurrency model using redb. It preserves all key capabilities and exposes a clear path for incremental adoption and migration.
--- a/docs/ai/2025-11-12-redb-search-index-postcard.md
+++ b/docs/ai/2025-11-12-redb-search-index-postcard.md
@ -0,0 +1,220 @@
+Title: Redb-based IPS Search Index — Postcard-Standardized Binary Encodings
+
+Author: Junie (JetBrains AI Coding Agent)
+Date: 2025-11-12
+Status: Planning Document (encoding revision to standardize all blobs on postcard)
+
+0. Motivation (what and why)
+
+- Requirement: for any encoded binary blob, use the postcard format to standardize serialization and simplify parsing in Rust.
+- Scope: Update the simplified Redb index plan so that all structured values stored as redb value blobs are serialized with postcard via serde, replacing the custom ad-hoc varint framing proposed earlier.
+- Benefits:
+  - Consistent, well-tested serialization with minimal overhead and no_std compatibility.
+  - Straightforward Rust implementation using #[derive(Serialize, Deserialize)].
+  - Reduced hand-rolled codec surface area and fewer edge cases.
+
+1. Affected schema elements (values become postcard-encoded)
+
+- Keys remain raw UTF-8 strings as previously specified, to keep lookups simple. The change applies to values that carry structured data:
+  - postings: token → postings groups with offsets (was: custom LEB128 + lengths). Now: postcard of Rust structs defined below.
+  - mini_delta: token → delta postings entries (was: custom). Now: postcard.
+  - fast_add and fast_remove values: retain empty (unit) or timestamp string; optionally postcard-encode a small struct for uniformity (see below).
+  - meta values: previously plain u32 for schema_version. We will keep primitive numbers for trivial singletons, but allow postcard-encoded structs if/when meta grows.
+  - fmri_catalog_hash value: keep as UTF-8 hex string (interoperability with external tools); alternatively add a postcard mirror key if needed later.
+
+Schema version bump: Increment meta.schema_version from 1 → 2 to denote postcard adoption for postings and mini_delta tables.
+
+2. Rust data model (serde-friendly, postcard-serializable)
+
+Use serde + postcard for all structured blobs. The types below precisely mirror the hierarchical structure used by pkg5 and the simplified plan, optimized for Rust parsing.
+
+2.1 Common types
+
+```rust
+use serde::{Deserialize, Serialize};
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct OffsetList {
+    // Absolute manifest offsets in strictly increasing order.
+    pub offsets: Vec<u32>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct PostingPairId {
+    pub fmri_id: u32,
+    pub positions: OffsetList,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct PostingPairStr {
+    pub fmri_str: String,
+    pub positions: OffsetList,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct FullValueGroupId {
+    pub full_value: String,
+    pub pairs: Vec<PostingPairId>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct FullValueGroupStr {
+    pub full_value: String,
+    pub pairs: Vec<PostingPairStr>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct SubtypeGroupId {
+    pub subtype: String,
+    pub full_values: Vec<FullValueGroupId>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct SubtypeGroupStr {
+    pub subtype: String,
+    pub full_values: Vec<FullValueGroupStr>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct ActionTypeGroupId {
+    pub action_type: String,
+    pub subtypes: Vec<SubtypeGroupId>,
+}
+
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct ActionTypeGroupStr {
+    pub action_type: String,
+    pub subtypes: Vec<SubtypeGroupStr>,
+}
+
+/// Postings value stored in `postings` table (token → this), using fmri_id for compactness.
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct PostingsValueId {
+    pub groups: Vec<ActionTypeGroupId>,
+}
+
+/// Postings value stored in `mini_delta` table (token → this), using fmri_str for easy fast-path writes.
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
+pub struct PostingsValueStr {
+    pub groups: Vec<ActionTypeGroupStr>,
+}
+```
+
+Notes:
+- Offsets are absolute here for simplicity; postcard’s compact varints plus delta-at-build-time remain viable as a pre-serialization optimization if desired. If we want to retain delta semantics, we can change `OffsetList` to store deltas and normalize at read.
+- We intentionally keep field names stable to benefit from serde’s default representation with postcard.
+
+2.2 Optional uniformity for fast_add/fast_remove values
+
+To fully standardize “any binary blob,” we can encode fast_add/remove values as postcard too, while keeping keys as the fmri string:
+
+```rust
+#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash, Default)]
+pub struct FastMarker {
+    // Optional metadata; empty means just a presence marker.
+    pub timestamp_iso8601: Option<String>,
+}
+```
+
+3. Table specifications with postcard encodings
+
+- postings
+  - Key: UTF-8 token
+  - Value: postcard(PostingsValueId)
+
+- mini_delta
+  - Key: UTF-8 token
+  - Value: postcard(PostingsValueStr)
+
+- fast_add
+  - Key: UTF-8 fmri
+  - Value: unit (empty) OR postcard(FastMarker) if we decide to populate timestamps; readers must accept either for schema_version 2.
+
+- fast_remove
+  - Key: UTF-8 fmri
+  - Value: unit (empty) OR postcard(FastMarker)
+
+- fmri_catalog (unchanged)
+  - id_to_str: key = (0x00, u32) → value = UTF-8 fmri (string is not a “binary blob” and benefits from direct storage)
+  - str_to_id: key = (0x01, UTF-8 fmri) → value = u32 id
+
+- fmri_catalog_hash (unchanged)
+  - Key: 0x00
+  - Value: UTF-8 lowercase hex SHA-1
+
+- meta
+  - Key: "schema_version" → Value: u32 (set to 2)
+  - Future composite meta records MAY be postcard-encoded structs.
+
+4. Dependency guidance (per project error-handling/dependency policy)
+
+Library crates (e.g., libips):
+
+```toml
+[dependencies]
+serde = { version = "1", features = ["derive"] }
+postcard = { version = "1", features = ["use-std"] }
+thiserror = "1.0.50"
+miette = "7.6.0"
+tracing = "0.1.37"
+```
+
+Application crates keep their existing miette configuration (fancy in apps only) and add postcard/serde if they interact with the index directly.
+
+5. Read/write rules (implementation notes)
+
+- Writers
+  - Build PostingsValueId/PostingsValueStr structures in memory, serialize with postcard::to_allocvec or to_stdvec, and store as the redb value bytes.
+  - Enforce invariants: strictly increasing offsets, valid fmri_id references, disjoint fast_add/fast_remove.
+
+- Readers
+  - Fetch value bytes from redb and decode with postcard::from_bytes into the matching struct.
+  - Merge logic: postings[token] (Id) ∪ mini_delta[token] (Str) with joins to fmri_catalog and fast_add/remove exactly as in the simplified plan.
+
+Performance notes:
+- Postcard uses compact varint-like encodings and typically yields sizes close to hand-rolled varints without the maintenance burden. If needed, we can pre-delta-encode offsets before serialization (store deltas in OffsetList) and restore absolute on read.
+
+6. Migration and backward compatibility
+
+- schema_version bump: Set meta.schema_version = 2 when the index is built with postcard encodings for postings and mini_delta.
+- Readers should support both versions during the transition:
+  - If schema_version == 1: decode custom ad-hoc blobs (legacy path).
+  - If schema_version == 2: decode postcard structs defined above.
+- One-time converter: implement a small tool that opens the redb database, reads v1 blobs, converts to structs, writes v2 values in a single atomic write transaction, updates schema_version, and clears any v1-only artifacts. This can live under xtask or a dedicated migration command.
+
+7. Error handling (aligns with project guidelines)
+
+- Define errors like EncodeError, DecodeError, SchemaMismatch, InvalidOffsets, MissingFmriId, TxnFailure in libips with thiserror + miette::Diagnostic derives.
+- Wrap postcard errors with a transparent source in DecodeError/EncodeError.
+- Application crates use miette::Result for convenience.
+
+8. Testing plan updates
+
+- Unit tests: round-trip postcard serialization for all structs; invariants on offsets; merge correctness (Id ∪ Str).
+- Property tests: generate random postings structures and assert from_bytes(to_vec(x)) == x; ensure offsets remain strictly increasing after any delta/absolute conversions.
+- Back-compat tests: v1 blobs decode → v2 structs encode → re-decode equality for semantic content.
+- Concurrency: unchanged (rely on redb MVCC); ensure no partial writes by using single-transaction updates.
+
+9. Implementation roadmap deltas
+
+- Phase A (Postcard types):
+  - Add the serde types above under libips::search::index::types.
+  - Implement postcard encode/decode helpers in libips::search::index::codec.
+  - Gate by schema_version; keep legacy decoder behind the same module for migration.
+
+- Phase B (Writers/Readers):
+  - Update full rebuild writer to produce PostingsValueId and write as postcard.
+  - Update fast path to produce PostingsValueStr.
+  - Update read path to decode postcard when schema_version == 2.
+
+- Phase C (Migration Tooling):
+  - Implement an xtask subcommand to migrate v1 → v2 in-place atomically.
+
+Appendix: Rationale for keeping some values as plain strings or integers
+
+- fmri_catalog values and fmri_catalog_hash are single UTF-8 strings and not multi-field blobs; storing them as raw strings avoids unnecessary serde overhead and maintains interop with tools that may read them directly.
+- meta.schema_version is a simple u32; postcard would not add value here. If meta grows into a multi-field record, we will switch that specific value to postcard.
+
+Summary
+
+This revision standardizes all structured binary blobs on postcard while preserving the simplified schema, operations, and invariants. It reduces custom codec complexity, aligns with Rust best practices (serde), and provides a clear migration path via a schema_version bump and dual-decoder support during transition.
--- a/docs/ai/2025-11-12-redb-search-index-simplified.md
+++ b/docs/ai/2025-11-12-redb-search-index-simplified.md
@ -0,0 +1,154 @@
+Title: Redb-based IPS Search Index — Simplified Plan (MVCC-first, Rust-friendly)
+
+Author: Junie (JetBrains AI Coding Agent)
+Date: 2025-11-12
+Status: Planning Document (supersedes complex/optional parts; implementation to follow)
+
+0. What changed in this revision (why simpler)
+
+- Remove epoch partitioning and any indirection tables. Rely on redb MVCC and atomic write transactions for consistency. No more active_epoch flips.
+- Remove optional/aux tables and features for v1: no token_offsets, no fmri_offsets indirection, no string-intern tables, no chunking. Keep a single canonical encoding optimized for Rust parsing.
+- Keep feature parity with pkg5 using a mandatory mini delta index for fast updates. The fast path is always supported without schema toggles.
+- Keep encodings compact and deterministic: length-prefixed UTF-8 and LEB128 varints; no URL-quoting on disk.
+
+1. Goals (unchanged intent, simplified mechanism)
+
+- Functional equivalence with pkg5 search results and fast update behavior.
+- Atomic multi-table updates via a single redb write transaction; readers get consistent snapshots automatically.
+- Minimal schema that is straightforward to implement, test, and migrate.
+
+2. Minimal Database Schema (fixed, no epochs)
+
+All tables live in one redb database file (e.g., index.redb). Schema versioning is done via a single meta record.
+
+- meta
+  - Key: "schema_version"
+  - Value: u32 (starts at 1)
+
+- fmri_catalog
+  - Purpose: dense id ↔ string mapping for fmris used in postings
+  - Keys/Values:
+    - id_to_str: key = (0x00, u32 id) → value = utf8 fmri
+    - str_to_id: key = (0x01, utf8 fmri) → value = u32 id
+  - Policy: ids are compact (0..N-1). Rebuilds may reassign ids; fast delta uses strings and is joined at query time.
+
+- postings
+  - Purpose: main inverted index token → grouped postings with per-manifest offsets
+  - Key: utf8 token (exact, unescaped)
+  - Value: binary blob encoded as:
+    - at_group_count: varint
+    - repeat at_group_count times:
+      - action_type: len(varint) + utf8 bytes
+      - sub_group_count: varint
+      - repeat sub_group_count times:
+        - subtype: len(varint) + utf8 bytes
+        - fv_group_count: varint
+        - repeat fv_group_count times:
+          - full_value: len(varint) + utf8 bytes
+          - pair_count: varint
+          - repeat pair_count times:
+            - fmri_id: u32 (LE)
+            - offsets_count: varint
+            - offsets_delta[varint; length=offsets_count] (d0=a0, di=a[i]-a[i-1])
+  - Notes: strictly increasing offsets; no compression or chunking in v1.
+
+- mini_delta
+  - Purpose: mandatory mini token index for fast updates (additions only). Mirrors postings schema but values may reference fmris by string to avoid id assignment.
+  - Key: utf8 token
+  - Value: binary blob encoded as:
+    - at/sub/fv hierarchy identical to postings
+    - pair_count: varint
+    - repeat pair_count times:
+      - fmri_str: len(varint) + utf8 bytes
+      - offsets_count: varint
+      - offsets_delta[varint]
+  - Rationale: keeps fast path independent of fmri_catalog churn; simplifies writer.
+
+- fast_add
+  - Purpose: set of fmri strings added since last rebuild
+  - Key: utf8 fmri
+  - Value: unit (empty) or timestamp string (optional)
+
+- fast_remove
+  - Purpose: set of fmri strings removed since last rebuild
+  - Key: utf8 fmri
+  - Value: unit (empty) or timestamp string (optional)
+
+- fmri_catalog_hash
+  - Purpose: preserve pkg5 integrity feature
+  - Key: 0x00
+  - Value: utf8 lowercase hex SHA-1 of sorted fmri strings currently in fmri_catalog (id_to_str)
+
+- locks (optional but simple)
+  - Key: "index_writer"
+  - Value: utf8 holder + timestamp (opaque)
+
+3. Operations (simplified)
+
+3.1 Consistent reads
+- Open a read transaction. Query tables directly; snapshot isolation ensures consistency.
+- For token queries: result = postings[token] merged with mini_delta[token]; then subtract any hits whose fmri (mapped via fmri_catalog or taken from mini_delta) is present in fast_remove; finally, union results with any mini_delta entries whose fmri is in fast_add (already included by merge).
+- For listing fmris: list = (all fmri_catalog ids → strings) ∪ fast_add \ fast_remove; integrity hash = fmri_catalog_hash.
+
+3.2 Fast update (client/server small change sets)
+- Start write txn.
+- For each added fmri: insert fmri string into fast_add, remove from fast_remove if present. Append/merge token postings for that fmri into mini_delta at the token granularity.
+- For each removed fmri: insert fmri string into fast_remove, remove from fast_add if present. Optionally prune any mini_delta entries for that fmri.
+- Commit. Readers immediately see the delta via MVCC.
+
+3.3 Full rebuild
+- Start write txn.
+- Recompute fmri_catalog from manifests (dense ids), fmri_catalog_hash from catalog.
+- Rebuild postings from scratch (tokenize, group, encode).
+- Clear mini_delta, fast_add, fast_remove.
+- Commit. Done. No epoch flips necessary.
+
+4. Rust-friendly encoding rules
+
+- Strings: UTF-8 with varint length prefix; decode with a zero-copy slice + from_utf8.
+- Integers: LEB128 varints for counts and deltas; fmri_id stored as fixed u32 LE for fast decoding.
+- Hierarchy order is fixed and deterministic; all counts come before their sequences to enable single-pass decoding.
+- Keys are raw UTF-8 tokens/fmris; no URL-quoting required.
+
+5. Invariants
+
+- postings references only valid fmri_id values present in fmri_catalog at the time of commit.
+- offsets decode to strictly increasing absolute positions.
+- fast_add ∩ fast_remove = ∅ (writers maintain disjointness).
+- mini_delta entries reference fmri strings; when a full rebuild commits, mini_delta MUST be empty.
+- fmri_catalog ids are 0..N-1 contiguous.
+
+6. Migration (from pkg5 text index)
+
+Step-by-step importer (single write txn):
+- Build fmri_catalog from manf_list.v1; compute fmri_catalog_hash from full_fmri_list.
+- Convert main_dict.ascii.v2 lines into postings values (URL-unquote tokens and full values; map pfmri_index → fmri_id; parse offsets and store delta-encoded).
+- Initialize mini_delta empty.
+- Copy fast_add/fast_remove lines into respective sets.
+- Commit. Result is immediately queryable and functionally equivalent.
+
+7. Error handling (per project guidelines)
+
+- Library code (libips): define specific error enums with thiserror and miette::Diagnostic derives (no fancy feature in lib). Errors include: SchemaMismatch, DecodeError, InvalidOffsets, MissingFmriId, TxnFailure.
+- Application crates: use miette::Result and attach helpful diagnostics.
+
+8. Testing plan (focused for simplified schema)
+
+- Unit tests: encoder/decoder for postings and mini_delta, varint/delta round-trips, fmri_catalog id assignment.
+- Property tests: offsets strictly increasing after decode; random postings encode/decode equality.
+- Concurrency: readers during fast update and during rebuild; verify stable snapshots.
+- Migration: import sample pkg5 index; compare query results token-by-token with legacy.
+
+9. Implementation roadmap (tight scope)
+
+- Phase A: Define codecs and table handles in libips::search::index; implement full rebuild writer + read path.
+- Phase B: Implement fast path (mini_delta + fast_add/remove) with merge logic; add pruning on removals.
+- Phase C: Importer from pkg5 files; verification utilities; basic GC (mini_delta cleanup is already enforced; no epochs to GC).
+
+Appendix: Removed/Deferred items and rationale
+
+- Epochs: removed — redb’s atomic transactions and MVCC provide consistent multi-table updates without indirection.
+- token_offsets and fmri_offsets tables: removed — premature optimization; postings are single-key fetches. Can be reconsidered if profiling shows need.
+- Chunking/postings compression/interning: deferred — start with a straightforward encoding; add compression or interning only if performance data demands it.
+
+This simplified plan keeps the same user-visible behavior as pkg5 with fewer moving parts, favoring redb’s built-in guarantees and a Rust-friendly, deterministic binary encoding.