Add detailed documentation for Redb-based IPS Search Index schema and encoding plans

- Introduced a series of planning documents detailing the Redb-based IPS index design, schema specification, and encoding strategies.
- Added a high-level overview of the core search index schema and the use of Redb MVCC transactions for consistency and performance improvements.
- Documented simplified schema definitions avoiding optional elements, focusing on compact encodings.
- Defined transitions to postcard-encoded binary formats, aligning with Rust’s serde for standardized serialization.
- Outlined migration strategies, invariants, error handling, and testing plans for index adoption.
- Enhanced documentation with structured explanations for developers to implement, extend, and migrate seamlessly to the new index model.
This commit is contained in:
Till Wegmueller 2025-12-08 20:10:04 +01:00
parent 33dd228df7
commit 6be608164d
No known key found for this signature in database
3 changed files with 668 additions and 0 deletions

View file

@ -0,0 +1,294 @@
Title: Redb-based IPS Search Index — Design Plan and Format Specification
Author: Junie (JetBrains AI Coding Agent)
Date: 2025-11-12
Status: Planning Document (implementation to follow)
1. Motivation and Goals
- Provide a new on-disk search index format using the redb embedded database while remaining functionally equivalent to the pkg5 search index.
- Preserve behavior and semantics of the existing client/server search pipelines (fast incremental updates vs. full rebuilds, consistent reader view, and compatibility of results).
- Improve robustness and consistency guarantees by leveraging redbs ACID transactions and MVCC snapshots instead of ad-hoc multi-file versioning and migration.
- Keep storage efficient (delta/varint compression for offsets, dedup of common structures) and query-time fast.
Functional equivalence targets (from pkg5 search index):
- Inverted index from token → postings grouped by action type, subtype, full_value, with per-manifest offsets.
- Ability to serve consistent reads while writer updates occur.
- Fast incremental updates for small package change sets using append-only “fast add/remove” semantics without rebuilding main dictionaries.
- Full rebuild path for large changes or inconsistencies.
- A complete set of FMRIs represented by the index and an integrity hash equivalent to full_fmri_list + full_fmri_list.hash.
- Ability to map between manifest ids and fmri strings (manf_list.v1 semantics).
- Deduped representation for fmri offset sets (fmri_offsets.v1 semantics) and delta-compressed offsets.
2. Redb Background and Mapping Strategy
Redb is an embedded, crash-safe, ACID key-value store with MVCC snapshots:
- Readers operate on a consistent snapshot without blocking writers.
- Writers use transactions with atomic commit.
- Data is organized into named tables (key/value types specified) and supports zero-copy reads where possible.
We model each prior index “file” as one or more redb tables. We retain a global logical VERSION (called epoch) to preserve pkg5s contract and for interop with external tools if needed. However, consistency is provided by redb transactions; the epoch is a semantic marker rather than the mechanism.
3. Global Concepts and Conventions
- Epoch: A monotonically increasing u64 stored in a metadata table indicating the current logical version of the index state visible to new readers. A rebuild creates a shadow epoch, fully populates it, and then atomically flips the active epoch pointer.
- Names and quoting: For tokens and full values that previously required URL-quoting, we store the unescaped UTF-8 token/value with a binary-safe key encoding. If any legacy wire format requires exact URL-quoted strings, we compute them at query time.
- Numeric encodings: Use compact varint (LEB128) for integer lists and delta-encoded offsets; redb values remain binary blobs with structured framing described below.
- Compression: Optional lightweight compression for large value blobs (e.g., lz4-frame). The initial implementation can make this configurable per-table.
4. Database Layout (Tables and Types)
4.1 meta (singleton)
- Purpose: Global metadata and pointers.
- Key: fixed strings (e.g., "active_epoch", "next_epoch").
- Value formats:
- active_epoch: u64 (the epoch new readers should use).
- last_full_rebuild_epoch: u64.
- created_at / updated_at: RFC3339 timestamp strings.
- schema_version: u32 for this redb schema.
4.2 epochs (catalog of epochs)
- Purpose: Track epoch lifecycle and allow safe garbage collection of old data.
- Key: u64 epoch.
- Value: struct:
- state: enum { Building, Active, Retired }
- built_from: Option<u64> (previous active epoch)
- created_at, activated_at, retired_at: timestamps
- stats: optional counts (tokens, postings bytes, fmris)
4.3 token_postings (partitioned by epoch)
- Purpose: Main inverted index; replaces main_dict.ascii.v2.
- Table name pattern: token_postings:e{epoch}
- Key: token_key
- token_key encoding: a prefix byte 0x00 followed by UTF-8 token bytes. A secondary composite layout is allowed: [action_type, subtype, full_value] are not part of the key; see value layout below.
- Value: postings blob encoded as:
struct PostingsValue {
// One tokens postings grouped by action type and subtype and full_value.
// varint for sizes; strings are UTF-8 length-prefixed varint.
at_groups: Vec<AtGroup>
}
struct AtGroup {
action_type: String
sub_groups: Vec<SubGroup>
}
struct SubGroup {
subtype: String
fv_groups: Vec<FvGroup>
}
struct FvGroup {
full_value: String
// fmri_id -> offsets (delta-encoded, varint list)
// Store as list of pairs for sparse storage
pairs: Vec<(u32 fmri_id, Vec<u32> offsets_delta)>
}
// Entire blob may be compressed using lz4 if size exceeds threshold.
Notes:
- This preserves the hierarchical grouping and the per-manifest offsets from pkg5.
- For very high-cardinality tokens, we may shard across multiple rows by introducing a chunk ordinal in the key (token_key + chunk_id). The chunking strategy can be added later without breaking the base schema; record chunk_count in a side table if used.
4.4 token_offsets (optional accelerator; replaces token_byte_offset.v1)
- Purpose: Random access accelerator if token_postings values become very large. In redb we can often fetch the single value by key; thus this table is optional. If chunking is implemented, this stores per-token chunk byte ranges for fast partial reads if we adopt a file-like storage.
- Table name: token_offsets:e{epoch}
- Key: token_key
- Value: struct { total_len: u32, chunks: Vec<(chunk_id: u16, offset: u32, len: u32)> }
4.5 fast_add and fast_remove (incremental logs)
- Purpose: Client-side/apply small updates quickly without full rebuild; replaces fast_add.v1 / fast_remove.v1.
- Table names: fast_add, fast_remove (not epoch-scoped; they represent deltas relative to active epoch).
- Key: fmri string (UTF-8).
- Value: struct { when: timestamp, epoch_base: u64 } or unit (). Presence indicates membership.
Notes:
- On activation of a new epoch after full rebuild, these logs are drained/cleared.
- Queries union active epoch results with fast_add set and subtract fast_remove set when calculating current state.
4.6 fmri_catalog (full membership; replaces full_fmri_list)
- Purpose: Enumerate all FMRIs represented by the active index epoch.
- Table name: fmri_catalog:e{epoch}
- Key: u32 fmri_id (dense ids for compact postings); ids are stable within an epoch.
- Value: fmri string (UTF-8).
Auxiliary: fmri_lookup:e{epoch} mapping fmri string → u32 id for convenience during rebuild.
4.7 fmri_catalog_hash (integrity; replaces full_fmri_list.hash)
- Table name: fmri_catalog_hash:e{epoch}
- Key: const 0x00
- Value: hex lowercase SHA-1 of sorted fmri strings from fmri_catalog:e{epoch}.
4.8 fmri_offsets (dedup store; replaces fmri_offsets.v1 semantics)
- Purpose: Deduplicate common offset lists shared across FMRIs or tokens.
- Table name: fmri_offsets:e{epoch}
- Key: content hash (e.g., blake3 of absolute offsets encoding).
- Value: delta-encoded varint list of offsets; may be lz4-compressed. Optionally also store a small header with original length/first few entries to detect corruption early.
Usage: token_postings FvGroup pairs may reference either inline offsets or an indirect reference by hash if the list exceeds size threshold. To keep lookups simple initially, we will inline offsets; indirect storage is a future optimization.
4.9 locks (writer serialization)
- Table: locks
- Key: "index_writer"
- Value: struct { holder: string, since: timestamp }
Notes: While redb provides transactions, we still serialize rebuilds/updates to mirror the single-writer expectation of pkg5. On server, publishing already serializes indexing; on client, operations are serialized by image plan execution. We can also keep the historical file-based lock for external tools, but within redb, this table suffices.
5. Operations
5.1 Consistent Reads
- Readers begin a read transaction and fetch meta.active_epoch to locate epoch-scoped tables.
- Because redb is MVCC, the entire read uses a consistent snapshot of those tables; concurrent writers do not affect the transactions view.
- Readers also fetch fast_add and fast_remove sets to adjust responses. To avoid race conditions, readers read fast_* with the same snapshot, and can safely compute: effective_installed = (epoch state fast_add) \ fast_remove.
5.2 Fast Incremental Update
Trigger: Small number of fmris added/removed on client or small publish window on server.
Steps:
1) Start write transaction.
2) Insert fmri strings into fast_add or fast_remove tables accordingly (idempotent puts; remove from opposite table if present).
3) Optionally update a small counter or watermark (e.g., meta.fast_delta_count).
4) Commit. No changes to token_postings or fmri_catalog for fast path.
Query impact:
- During query evaluation, postings derived from the active epoch are merged with delta postings computed on the fly for the fast_* fmris if we choose to precompute quick shards. Initial implementation can simply filter the result set by membership: when returning matched manifests, subtract those in fast_remove and optionally include those in fast_add only if we maintain their postings in a small side cache. Two options:
A. Ultra-minimal: fast_* only modifies membership for fmri listing queries; token-based matches for newly added packages require full rebuild or a small background mini-index.
B. Practical equivalence: maintain an auxiliary mini-index table mini_token_postings built incrementally for fast_* fmris. This table mirrors token_postings structure but only for the delta set. During query, we merge results from token_postings (epoch) and mini_token_postings, then subtract any fmris in fast_remove.
Chosen approach for equivalence: Option B (mini-token index) to match pkg5 behavior that answers queries without immediate full rebuild.
5.3 Full Rebuild
Trigger: Large change set, first-time indexing, detected inconsistency, or server bulk operations.
Steps:
1) Start write transaction; allocate new epoch E = meta.active_epoch + 1. Create epochs[E] = Building.
2) Compute fmri_catalog:eE and fmri_lookup:eE from manifests; compute fmri_catalog_hash:eE.
3) Build token_postings:eE by scanning manifests, tokenizing, and aggregating postings with offsets. Apply delta-encoding and optional lz4 compression for large values.
4) Optionally create auxiliary indices like token_offsets:eE and fmri_offsets:eE if enabled.
5) Update epochs[E] = Active and set meta.active_epoch = E in the same commit; set epochs[prev] = Retired.
6) Clear mini_token_postings and fast_add/fast_remove within the same or subsequent small transaction.
Crash safety:
- If the process crashes before flipping active_epoch, readers continue to use prev epoch.
- A subsequent startup task scans epochs table; any epochs in Building with no pointer from meta.active_epoch are GC candidates.
5.4 Garbage Collection and Compaction
- Retain N recent epochs for rollback/debug (configurable). Periodically remove Retired epochs older than retention.
- Use redbs compaction/cleanup mechanisms as recommended by upstream to reclaim space.
6. Data Encodings
6.1 Strings
- UTF-8, length-prefix varint. Deduplicate common strings (action_type, subtype) via small intern tables if hot.
6.2 Integer Lists (Offsets)
- Store offsets as increasing u32 values; encode as delta varints: d0 = abs0, di = abs[i] - abs[i-1].
- For small lists (<= 4 items), inline in FvGroup; for larger, consider optional lz4 compression.
6.3 Keys
- Use simple UTF-8 token as key. For optional chunking: key = token + 0x1F separator + u16 chunk_id (big-endian) to keep lexicographic locality.
6.4 Checksums and Hashes
- fmri_catalog_hash:eE stores SHA-1 hex of sorted fmri strings to match pkg5 semantics.
- For internal dedup/validation, use blake3 for speed where not exposed.
7. Query Semantics and Equivalence
Token search path:
1) Read meta.active_epoch = E.
2) Lookup token_postings:eE[token]. If chunked, merge all chunks.
3) Lookup mini_token_postings[token] and merge into the same structure.
4) Apply fast_remove filter by excluding any fmri ids whose fmri string appears in fast_remove.
5) If the query path needs to output fmri strings, map fmri_id → string via fmri_catalog:eE; for mini index entries that reference fmri strings directly, assign transient ids or join by string.
6) Return groupings by action type, subtype, full_value with manifest offsets just like pkg5.
Full fmri list queries:
- Return fmri_catalog:eE fast_add \ fast_remove, ensure sorted order if requested, and provide fmri_catalog_hash:eE (unchanged) for backward-compat features.
Consistency guarantees:
- Readers never block writers (MVCC) and see a consistent epoch.
- Fast updates are visible immediately after their transaction commits.
- Epoch flips are atomic from readers perspective.
8. Migration Plan (from existing pkg5-style on-disk index)
Inputs: Existing $ROOT/index text files as specified in pkg5 search.txt.
Steps:
1) Parse existing files using a one-time importer tool in Rust.
2) Create redb database at $ROOT/index.redb/ (new directory) or reuse $ROOT/index with a different extension.
3) Initialize meta (schema_version=1) and create epoch E=1.
4) Fill fmri_catalog:e1 and fmri_lookup:e1 from manf_list.v1.
5) Convert main_dict.ascii.v2 and token_byte_offset.v1 into token_postings:e1 values. Honor URL-unquoting/quoting rules.
6) Populate fmri_offsets:e1 if we choose indirect references (optional initially).
7) Compute fmri_catalog_hash:e1 from full_fmri_list or reconstructed catalog.
8) Populate fast_add/fast_remove by copying logs if needed; or prefer to start with empty delta on first activation.
9) Activate epoch E and mark old files as deprecated. Keep old files read-only until confidence is achieved.
9. Error Handling and Recovery
- All writer operations use miette diagnostics in application crates and specific error types in libips (per repository guidelines). Errors include: redb transaction failures, data corruption checks, invalid manifest encodings, and invariant violations.
- Validation steps:
- Ensure fmri ids are contiguous in fmri_catalog:eE.
- Verify SHA-1 hash matches computed value.
- Verify postings reference valid fmri ids and that offsets lists are strictly increasing after decoding.
- For mini_token_postings, ensure referenced fmri strings exist in fast_add or are to be installed.
- Recovery:
- If fast update fails, roll back the transaction; retry or fall back to full rebuild.
- If a rebuild fails mid-way, epoch stays in Building and is GCd later; readers continue using prior epoch.
10. Concurrency and Locking Details
- Single logical writer: Maintain a process-level lock (either OS file lock at $ROOT/index.lock for compatibility, or locks table in redb) to serialize rebuilds and bulk updates.
- Readers: No locks required; rely on MVCC.
- Writer never blocks readers: Guaranteed by redbs snapshot isolation.
11. Performance Considerations
- Posting value size thresholds: If a tokens postings exceed N bytes, enable lz4 compression for the value.
- Hot string interning: Maintain two tiny epoch-scoped tables action_types:eE and subtypes:eE mapping string → u16 id to shrink postings.
- Streaming build: Build token_postings:eE in sorted token order to improve locality; batch commits.
- Mini-index size control: Evict older entries if fast delta grows beyond threshold; trigger background full rebuild when threshold is exceeded, mirroring MAX_FAST_INDEXED_PKGS behavior.
12. Configuration Knobs
- retention_epochs: number of retired epochs to keep.
- compress_threshold_bytes: per-value threshold for lz4.
- enable_indirect_offsets: bool to use fmri_offsets:eE.
- fast_update_threshold: mirrors pkg5 MAX_FAST_INDEXED_PKGS; exceeding triggers full rebuild.
13. Invariants (must hold)
- meta.active_epoch points to an epochs entry in Active state.
- All epoch-scoped tables for active epoch exist and are self-consistent.
- fmri ids in token_postings:eE reference existing entries in fmri_catalog:eE.
- Offsets decode to strictly increasing absolute positions.
- fmri_catalog_hash:eE matches the sorted fmri list.
- fast_add ∩ fast_remove = ∅; writes maintain disjointness.
14. Testing Plan
- Unit tests for encoding/decoding of postings and delta offsets.
- Property tests: round-trip manifests → postings → decode invariants.
- Concurrency tests: readers during rebuild and fast updates; ensure stable results.
- Migration tests: import from sample pkg5 index and validate equivalence of query results.
- End-to-end tests using cargo xtask setup-test-env to seed sample images/repos, index them, and run search queries.
15. Implementation Roadmap (high level)
Phase 1
- Define redb schemas and value codecs in libips::search::index.
- Implement full rebuild writer and read-only query path.
- Implement fast_add/remove tables without mini-token index; gate features.
Phase 2
- Add mini_token_postings and merge logic for full equivalence with pkg5 fast updates.
- Add compression and string interning optimizations.
Phase 3
- Migration tool from pkg5 text index; verification utilities.
- Epoch GC and compaction tooling.
Appendix A: Mapping of pkg5 files to redb tables
- main_dict.ascii.v2 → token_postings:e{epoch}
- token_byte_offset.v1 → token_offsets:e{epoch} (optional)
- fast_add.v1 → fast_add
- fast_remove.v1 → fast_remove
- full_fmri_list → fmri_catalog:e{epoch}
- full_fmri_list.hash → fmri_catalog_hash:e{epoch}
- manf_list.v1 → fmri_catalog:e{epoch} + fmri_lookup:e{epoch}
- fmri_offsets.v1 → fmri_offsets:e{epoch} (optional, can be inlined initially)
- lock → external file lock or locks table; writer serialization via single-writer discipline
This plan keeps feature parity with pkg5 while modernizing the storage and concurrency model using redb. It preserves all key capabilities and exposes a clear path for incremental adoption and migration.

View file

@ -0,0 +1,220 @@
Title: Redb-based IPS Search Index — Postcard-Standardized Binary Encodings
Author: Junie (JetBrains AI Coding Agent)
Date: 2025-11-12
Status: Planning Document (encoding revision to standardize all blobs on postcard)
0. Motivation (what and why)
- Requirement: for any encoded binary blob, use the postcard format to standardize serialization and simplify parsing in Rust.
- Scope: Update the simplified Redb index plan so that all structured values stored as redb value blobs are serialized with postcard via serde, replacing the custom ad-hoc varint framing proposed earlier.
- Benefits:
- Consistent, well-tested serialization with minimal overhead and no_std compatibility.
- Straightforward Rust implementation using #[derive(Serialize, Deserialize)].
- Reduced hand-rolled codec surface area and fewer edge cases.
1. Affected schema elements (values become postcard-encoded)
- Keys remain raw UTF-8 strings as previously specified, to keep lookups simple. The change applies to values that carry structured data:
- postings: token → postings groups with offsets (was: custom LEB128 + lengths). Now: postcard of Rust structs defined below.
- mini_delta: token → delta postings entries (was: custom). Now: postcard.
- fast_add and fast_remove values: retain empty (unit) or timestamp string; optionally postcard-encode a small struct for uniformity (see below).
- meta values: previously plain u32 for schema_version. We will keep primitive numbers for trivial singletons, but allow postcard-encoded structs if/when meta grows.
- fmri_catalog_hash value: keep as UTF-8 hex string (interoperability with external tools); alternatively add a postcard mirror key if needed later.
Schema version bump: Increment meta.schema_version from 1 → 2 to denote postcard adoption for postings and mini_delta tables.
2. Rust data model (serde-friendly, postcard-serializable)
Use serde + postcard for all structured blobs. The types below precisely mirror the hierarchical structure used by pkg5 and the simplified plan, optimized for Rust parsing.
2.1 Common types
```rust
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct OffsetList {
// Absolute manifest offsets in strictly increasing order.
pub offsets: Vec<u32>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PostingPairId {
pub fmri_id: u32,
pub positions: OffsetList,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PostingPairStr {
pub fmri_str: String,
pub positions: OffsetList,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct FullValueGroupId {
pub full_value: String,
pub pairs: Vec<PostingPairId>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct FullValueGroupStr {
pub full_value: String,
pub pairs: Vec<PostingPairStr>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct SubtypeGroupId {
pub subtype: String,
pub full_values: Vec<FullValueGroupId>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct SubtypeGroupStr {
pub subtype: String,
pub full_values: Vec<FullValueGroupStr>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct ActionTypeGroupId {
pub action_type: String,
pub subtypes: Vec<SubtypeGroupId>,
}
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct ActionTypeGroupStr {
pub action_type: String,
pub subtypes: Vec<SubtypeGroupStr>,
}
/// Postings value stored in `postings` table (token → this), using fmri_id for compactness.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PostingsValueId {
pub groups: Vec<ActionTypeGroupId>,
}
/// Postings value stored in `mini_delta` table (token → this), using fmri_str for easy fast-path writes.
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash)]
pub struct PostingsValueStr {
pub groups: Vec<ActionTypeGroupStr>,
}
```
Notes:
- Offsets are absolute here for simplicity; postcards compact varints plus delta-at-build-time remain viable as a pre-serialization optimization if desired. If we want to retain delta semantics, we can change `OffsetList` to store deltas and normalize at read.
- We intentionally keep field names stable to benefit from serdes default representation with postcard.
2.2 Optional uniformity for fast_add/fast_remove values
To fully standardize “any binary blob,” we can encode fast_add/remove values as postcard too, while keeping keys as the fmri string:
```rust
#[derive(Debug, Clone, Serialize, Deserialize, PartialEq, Eq, Hash, Default)]
pub struct FastMarker {
// Optional metadata; empty means just a presence marker.
pub timestamp_iso8601: Option<String>,
}
```
3. Table specifications with postcard encodings
- postings
- Key: UTF-8 token
- Value: postcard(PostingsValueId)
- mini_delta
- Key: UTF-8 token
- Value: postcard(PostingsValueStr)
- fast_add
- Key: UTF-8 fmri
- Value: unit (empty) OR postcard(FastMarker) if we decide to populate timestamps; readers must accept either for schema_version 2.
- fast_remove
- Key: UTF-8 fmri
- Value: unit (empty) OR postcard(FastMarker)
- fmri_catalog (unchanged)
- id_to_str: key = (0x00, u32) → value = UTF-8 fmri (string is not a “binary blob” and benefits from direct storage)
- str_to_id: key = (0x01, UTF-8 fmri) → value = u32 id
- fmri_catalog_hash (unchanged)
- Key: 0x00
- Value: UTF-8 lowercase hex SHA-1
- meta
- Key: "schema_version" → Value: u32 (set to 2)
- Future composite meta records MAY be postcard-encoded structs.
4. Dependency guidance (per project error-handling/dependency policy)
Library crates (e.g., libips):
```toml
[dependencies]
serde = { version = "1", features = ["derive"] }
postcard = { version = "1", features = ["use-std"] }
thiserror = "1.0.50"
miette = "7.6.0"
tracing = "0.1.37"
```
Application crates keep their existing miette configuration (fancy in apps only) and add postcard/serde if they interact with the index directly.
5. Read/write rules (implementation notes)
- Writers
- Build PostingsValueId/PostingsValueStr structures in memory, serialize with postcard::to_allocvec or to_stdvec, and store as the redb value bytes.
- Enforce invariants: strictly increasing offsets, valid fmri_id references, disjoint fast_add/fast_remove.
- Readers
- Fetch value bytes from redb and decode with postcard::from_bytes into the matching struct.
- Merge logic: postings[token] (Id) mini_delta[token] (Str) with joins to fmri_catalog and fast_add/remove exactly as in the simplified plan.
Performance notes:
- Postcard uses compact varint-like encodings and typically yields sizes close to hand-rolled varints without the maintenance burden. If needed, we can pre-delta-encode offsets before serialization (store deltas in OffsetList) and restore absolute on read.
6. Migration and backward compatibility
- schema_version bump: Set meta.schema_version = 2 when the index is built with postcard encodings for postings and mini_delta.
- Readers should support both versions during the transition:
- If schema_version == 1: decode custom ad-hoc blobs (legacy path).
- If schema_version == 2: decode postcard structs defined above.
- One-time converter: implement a small tool that opens the redb database, reads v1 blobs, converts to structs, writes v2 values in a single atomic write transaction, updates schema_version, and clears any v1-only artifacts. This can live under xtask or a dedicated migration command.
7. Error handling (aligns with project guidelines)
- Define errors like EncodeError, DecodeError, SchemaMismatch, InvalidOffsets, MissingFmriId, TxnFailure in libips with thiserror + miette::Diagnostic derives.
- Wrap postcard errors with a transparent source in DecodeError/EncodeError.
- Application crates use miette::Result for convenience.
8. Testing plan updates
- Unit tests: round-trip postcard serialization for all structs; invariants on offsets; merge correctness (Id Str).
- Property tests: generate random postings structures and assert from_bytes(to_vec(x)) == x; ensure offsets remain strictly increasing after any delta/absolute conversions.
- Back-compat tests: v1 blobs decode → v2 structs encode → re-decode equality for semantic content.
- Concurrency: unchanged (rely on redb MVCC); ensure no partial writes by using single-transaction updates.
9. Implementation roadmap deltas
- Phase A (Postcard types):
- Add the serde types above under libips::search::index::types.
- Implement postcard encode/decode helpers in libips::search::index::codec.
- Gate by schema_version; keep legacy decoder behind the same module for migration.
- Phase B (Writers/Readers):
- Update full rebuild writer to produce PostingsValueId and write as postcard.
- Update fast path to produce PostingsValueStr.
- Update read path to decode postcard when schema_version == 2.
- Phase C (Migration Tooling):
- Implement an xtask subcommand to migrate v1 → v2 in-place atomically.
Appendix: Rationale for keeping some values as plain strings or integers
- fmri_catalog values and fmri_catalog_hash are single UTF-8 strings and not multi-field blobs; storing them as raw strings avoids unnecessary serde overhead and maintains interop with tools that may read them directly.
- meta.schema_version is a simple u32; postcard would not add value here. If meta grows into a multi-field record, we will switch that specific value to postcard.
Summary
This revision standardizes all structured binary blobs on postcard while preserving the simplified schema, operations, and invariants. It reduces custom codec complexity, aligns with Rust best practices (serde), and provides a clear migration path via a schema_version bump and dual-decoder support during transition.

View file

@ -0,0 +1,154 @@
Title: Redb-based IPS Search Index — Simplified Plan (MVCC-first, Rust-friendly)
Author: Junie (JetBrains AI Coding Agent)
Date: 2025-11-12
Status: Planning Document (supersedes complex/optional parts; implementation to follow)
0. What changed in this revision (why simpler)
- Remove epoch partitioning and any indirection tables. Rely on redb MVCC and atomic write transactions for consistency. No more active_epoch flips.
- Remove optional/aux tables and features for v1: no token_offsets, no fmri_offsets indirection, no string-intern tables, no chunking. Keep a single canonical encoding optimized for Rust parsing.
- Keep feature parity with pkg5 using a mandatory mini delta index for fast updates. The fast path is always supported without schema toggles.
- Keep encodings compact and deterministic: length-prefixed UTF-8 and LEB128 varints; no URL-quoting on disk.
1. Goals (unchanged intent, simplified mechanism)
- Functional equivalence with pkg5 search results and fast update behavior.
- Atomic multi-table updates via a single redb write transaction; readers get consistent snapshots automatically.
- Minimal schema that is straightforward to implement, test, and migrate.
2. Minimal Database Schema (fixed, no epochs)
All tables live in one redb database file (e.g., index.redb). Schema versioning is done via a single meta record.
- meta
- Key: "schema_version"
- Value: u32 (starts at 1)
- fmri_catalog
- Purpose: dense id ↔ string mapping for fmris used in postings
- Keys/Values:
- id_to_str: key = (0x00, u32 id) → value = utf8 fmri
- str_to_id: key = (0x01, utf8 fmri) → value = u32 id
- Policy: ids are compact (0..N-1). Rebuilds may reassign ids; fast delta uses strings and is joined at query time.
- postings
- Purpose: main inverted index token → grouped postings with per-manifest offsets
- Key: utf8 token (exact, unescaped)
- Value: binary blob encoded as:
- at_group_count: varint
- repeat at_group_count times:
- action_type: len(varint) + utf8 bytes
- sub_group_count: varint
- repeat sub_group_count times:
- subtype: len(varint) + utf8 bytes
- fv_group_count: varint
- repeat fv_group_count times:
- full_value: len(varint) + utf8 bytes
- pair_count: varint
- repeat pair_count times:
- fmri_id: u32 (LE)
- offsets_count: varint
- offsets_delta[varint; length=offsets_count] (d0=a0, di=a[i]-a[i-1])
- Notes: strictly increasing offsets; no compression or chunking in v1.
- mini_delta
- Purpose: mandatory mini token index for fast updates (additions only). Mirrors postings schema but values may reference fmris by string to avoid id assignment.
- Key: utf8 token
- Value: binary blob encoded as:
- at/sub/fv hierarchy identical to postings
- pair_count: varint
- repeat pair_count times:
- fmri_str: len(varint) + utf8 bytes
- offsets_count: varint
- offsets_delta[varint]
- Rationale: keeps fast path independent of fmri_catalog churn; simplifies writer.
- fast_add
- Purpose: set of fmri strings added since last rebuild
- Key: utf8 fmri
- Value: unit (empty) or timestamp string (optional)
- fast_remove
- Purpose: set of fmri strings removed since last rebuild
- Key: utf8 fmri
- Value: unit (empty) or timestamp string (optional)
- fmri_catalog_hash
- Purpose: preserve pkg5 integrity feature
- Key: 0x00
- Value: utf8 lowercase hex SHA-1 of sorted fmri strings currently in fmri_catalog (id_to_str)
- locks (optional but simple)
- Key: "index_writer"
- Value: utf8 holder + timestamp (opaque)
3. Operations (simplified)
3.1 Consistent reads
- Open a read transaction. Query tables directly; snapshot isolation ensures consistency.
- For token queries: result = postings[token] merged with mini_delta[token]; then subtract any hits whose fmri (mapped via fmri_catalog or taken from mini_delta) is present in fast_remove; finally, union results with any mini_delta entries whose fmri is in fast_add (already included by merge).
- For listing fmris: list = (all fmri_catalog ids → strings) fast_add \ fast_remove; integrity hash = fmri_catalog_hash.
3.2 Fast update (client/server small change sets)
- Start write txn.
- For each added fmri: insert fmri string into fast_add, remove from fast_remove if present. Append/merge token postings for that fmri into mini_delta at the token granularity.
- For each removed fmri: insert fmri string into fast_remove, remove from fast_add if present. Optionally prune any mini_delta entries for that fmri.
- Commit. Readers immediately see the delta via MVCC.
3.3 Full rebuild
- Start write txn.
- Recompute fmri_catalog from manifests (dense ids), fmri_catalog_hash from catalog.
- Rebuild postings from scratch (tokenize, group, encode).
- Clear mini_delta, fast_add, fast_remove.
- Commit. Done. No epoch flips necessary.
4. Rust-friendly encoding rules
- Strings: UTF-8 with varint length prefix; decode with a zero-copy slice + from_utf8.
- Integers: LEB128 varints for counts and deltas; fmri_id stored as fixed u32 LE for fast decoding.
- Hierarchy order is fixed and deterministic; all counts come before their sequences to enable single-pass decoding.
- Keys are raw UTF-8 tokens/fmris; no URL-quoting required.
5. Invariants
- postings references only valid fmri_id values present in fmri_catalog at the time of commit.
- offsets decode to strictly increasing absolute positions.
- fast_add ∩ fast_remove = ∅ (writers maintain disjointness).
- mini_delta entries reference fmri strings; when a full rebuild commits, mini_delta MUST be empty.
- fmri_catalog ids are 0..N-1 contiguous.
6. Migration (from pkg5 text index)
Step-by-step importer (single write txn):
- Build fmri_catalog from manf_list.v1; compute fmri_catalog_hash from full_fmri_list.
- Convert main_dict.ascii.v2 lines into postings values (URL-unquote tokens and full values; map pfmri_index → fmri_id; parse offsets and store delta-encoded).
- Initialize mini_delta empty.
- Copy fast_add/fast_remove lines into respective sets.
- Commit. Result is immediately queryable and functionally equivalent.
7. Error handling (per project guidelines)
- Library code (libips): define specific error enums with thiserror and miette::Diagnostic derives (no fancy feature in lib). Errors include: SchemaMismatch, DecodeError, InvalidOffsets, MissingFmriId, TxnFailure.
- Application crates: use miette::Result and attach helpful diagnostics.
8. Testing plan (focused for simplified schema)
- Unit tests: encoder/decoder for postings and mini_delta, varint/delta round-trips, fmri_catalog id assignment.
- Property tests: offsets strictly increasing after decode; random postings encode/decode equality.
- Concurrency: readers during fast update and during rebuild; verify stable snapshots.
- Migration: import sample pkg5 index; compare query results token-by-token with legacy.
9. Implementation roadmap (tight scope)
- Phase A: Define codecs and table handles in libips::search::index; implement full rebuild writer + read path.
- Phase B: Implement fast path (mini_delta + fast_add/remove) with merge logic; add pruning on removals.
- Phase C: Importer from pkg5 files; verification utilities; basic GC (mini_delta cleanup is already enforced; no epochs to GC).
Appendix: Removed/Deferred items and rationale
- Epochs: removed — redbs atomic transactions and MVCC provide consistent multi-table updates without indirection.
- token_offsets and fmri_offsets tables: removed — premature optimization; postings are single-key fetches. Can be reconsidered if profiling shows need.
- Chunking/postings compression/interning: deferred — start with a straightforward encoding; add compression or interning only if performance data demands it.
This simplified plan keeps the same user-visible behavior as pkg5 with fewer moving parts, favoring redbs built-in guarantees and a Rust-friendly, deterministic binary encoding.