ips/doc/pkg5_docs/search.txt


pkg
SEARCH

1. Goals

   i.   Provide relevant information
   ii.  Provide a consistently fast response
   iii. Make responses consistent between local and remote search
   iv.  Provide the user with a good interface to the information
   v.   Allow seamless recovery when search fails
   vi.  Ensure the index is (almost) always in a consistent state

2. Approach

   From a high level, there are two components to search: the 
   indexer, which maintains the information needed for search; the 
   query engine, which actually performs a search of the information 
   provided. The indexer is responsible for creating and updating the 
   indexes and ensuring they're always in a consistent state. It does this 
   by maintaining a set of inverted indexes as text files (details of which
   can be found in the comments at the top of indexer.py). On the server 
   side, it's hooked into the publishing code so that the index is updated 
   each  time a package is published. If indexing is already happening when 
   packages are published, they're queued and another update to the indexes 
   happens once the current run is finished. On the client side, it's 
   hooked into the install, image-update, and uninstall code so that each 
   of those actions are reflected in the index.

   The query engine is responsible for processing the text from the user, 
   searching for that token in its information, and giving the client code 
   the information needed for a reasonable response to the user. It must 
   ensure that the information it uses is in a consistent state. On the 
   server, an engine is created during the server initialization. It reads 
   in the files it needs and stores the data internally. When the server gets
   a search request from a client, it hands the search token to the query
   engine. The query engine ensures that it has the most recent information
   (locking and rereading the files from disk if necessary) and then searches
   for the token in its dictionaries. On the client, the process is the same
   except that the indexes are read from disk each time instead of being stored
   because a new instance of pkg is started for each search.

3. Details

   Search reserves the $ROOT/index directory for its use on both the client
   and the server. It also creates a TMP directory inside index which it stores
   indexes in until it's ready to migrate them to the the proper directory.

   indexer.py contains detailed information about the files used to store the
   index and their formats. 

   3.1 Locking

       The indexes use a version locking protocol. The requirements for the
       protocol are: 
		the writer never blocks on readers
		any number of readers are allowed
		readers must always have consistent data regardless the
			writer's actions
       To implement these features, several conventions must be observed. The
       writer is responsible for updating these files in another location,
       then moving them on top of existing files so that from a reader's
       perspective, file updates are always atomic. Each file in the index has
       a version in the first line. The writer is responsible for ensuring that
       each time it updates the index, the files all have the same version
       number and that version number has not been previously used. The writer
       is not responsible for moving multiple files atomically, but it should
       make an effort to have files in $ROOT/index be out of sync for as short
       a time as is possible.

       The readers are responsible for ensuring that the files their reading 
       the indexes from are a consistent set (have identical version 
       numbers). consistent_open in search_storage takes care of this
       functionality.

   3.2 Implementation overview (how the index is built and updated)

       The implementation of the search index lives primarily in
       src/modules/indexer.py and src/modules/search_storage.py.

       At a high level, updates are handled by Indexer._generic_update_index(),
       which performs these steps:

         1) Obtain an exclusive lock on $ROOT/index/lock to serialize writers.
         2) Read the existing index files using a version-checked, consistent
            open (see search_storage.consistent_open()) so readers always see a
            matched set of files with the same VERSION number.
         3) Build new index artifacts in a temporary directory
            $ROOT/index/TMP/ without disturbing the live files.
         4) Write out all helper dictionaries; then migrate (rename/move)
            files from TMP into $ROOT/index, updating the version number.
            This migration is not atomic across multiple files, so readers use
            version checks and retry to ensure consistency.
         5) Release the lock and remove the temporary directory.

       Two update paths are used depending on what changed:

         - Fast update (incremental): For client-side installs/removals below a
           small threshold (see MAX_FAST_INDEXED_PKGS in indexer.py), the
           update only appends to the fast_add and fast_remove logs and updates
           the set of full fmris. The large main dictionary files are not
           rewritten or moved during this path.

         - Full rebuild: When indexing on a repository (server side), or when
           the number of changes exceeds the fast-update threshold, or when
           an inconsistency/error is detected, the index is rebuilt by parsing
           manifests and regenerating all dictionaries. The new files are then
           migrated into place.

       Temporary files and migration:

         - All new/updated files are written under $ROOT/index/TMP/ with the
           target VERSION header already set. After successful creation, the
           files are moved into $ROOT/index. Legacy auxiliary directories are
           cleaned up as needed during migration.

   3.3 When indexing is triggered (client and server)

       Server side (repository):

         - The depot hooks indexing into the publish operation. Each time a
           package is published to the repository, the indexer is invoked with
           Indexer.server_update_index(). If another indexing run is already
           in progress, new fmris are queued and a subsequent run processes
           them. The helper function Indexer.check_for_updates(index_root, cat)
           can be used to discover catalog entries that have not yet been
           indexed (e.g., across restarts).

       Client side (images):

         - The client integrates indexing with image modification operations:
           install, update/image-update, uninstall, and any execution of an
           image plan that changes packages. After successful execution of an
           image plan (see src/modules/client/imageplan.py), the code calls
           Indexer.client_update_index() to record the changes. On a brand-new
           image (empty index directory), Indexer.setup() seeds empty stubs.

         - If a fast incremental update is possible (few package changes), the
           operation only updates the fast logs. If not (too many changes), the
           client releases the lock and triggers a full rebuild via
           Indexer.rebuild_index_from_scratch(image.gen_installed_pkgs()).

         - If an inconsistency or unexpected error occurs during incremental
           update, the client falls back to a full rebuild to restore a clean
           and consistent index.

   3.4 Fast vs. full rebuild criteria

       - Fast updates are used when the number of packages added/removed since
         the last rebuild is small. The current threshold is defined by
         MAX_FAST_INDEXED_PKGS (20 at the time of writing) in indexer.py.
         During a fast update, only these files are changed/migrated:
           * fast_add.v1
           * fast_remove.v1
           * full_fmri_list (and its hash)
         The large main dictionary and token offset files are left untouched.

       - A full rebuild is performed when:
           * The change count exceeds the threshold, or
           * The index on disk is missing or inconsistent, or
           * An error occurs during fast update, or
           * On server-side bulk operations (e.g., first-time indexing).

   3.5 Files and on-disk layout

       All files reside under $ROOT/index. Important files include
       (see src/modules/search_storage.py for authoritative names):

         - main_dict.ascii.v2
             The main inverted index mapping search tokens to postings. It is
             written in sorted token blocks and may be split/merged during
             rebuild; very large and thus avoided in fast updates.

         - token_byte_offset.v1
             A map of token to byte offsets into main_dict for efficient
             random access by the query engine.

         - fast_add.v1 / fast_remove.v1
             Incremental logs holding fmris added/removed since the last full
             rebuild. Used to answer queries without rebuilding immediately.

         - full_fmri_list and full_fmri_list.hash
             A list (and content hash) of all fmris currently represented in
             the index. Used to detect divergence and for consistency checks.

         - fmri_offsets.v1
             An auxiliary mapping between fmri identifiers and positions used
             when assembling postings; replaces legacy per-pkg files.

         - manf_list.v1
             A mapping table of internal manifest IDs to their fmri strings.

         - lock
             The writer lock file used to serialize index modifications.

       During indexing, new versions of the above are created in
       $ROOT/index/TMP with a new VERSION header and then moved into
       $ROOT/index. Readers always verify that all open files have identical
       VERSION headers and will reopen/retry if a migration is in progress.
       
   3.6 On-disk file formats (authoritative specification)

       This section describes the exact line formats, encodings, and
       invariants for the files under $ROOT/index as implemented by
       src/modules/search_storage.py and used by src/modules/indexer.py.

       Unless otherwise noted, every index file begins with a first line of
       the form:

           VERSION: <integer>\n
       All subsequent lines are specific to each file’s purpose, and may be
       in arbitrary order unless ordering is explicitly stated. Readers always
       validate that all opened files share the same VERSION number.

       Conventions used below:
         - “token” means a search term after tokenization.
         - “fmri” means a package FMRI string; for storage the scheme is
           omitted (include_scheme=False) and anarchy=True to avoid
           normalization changes.
         - “byte offset” means an integer offset used by the query engine to
           seek quickly within a larger file or posting list.

       3.6.1 main_dict.ascii.v2 — main inverted index

         Purpose
           Maps each token to its postings grouped by action type, key subtype,
           and full value. Very large; written during full rebuilds only.

         Encoding
           One line per token. Each line has five kinds of separators in the
           following precedence: space, '!', '@', '#', ','. The token itself is
           URL-quoted (urllib.parse.quote) to ensure line safety.

           Grammar (informal):

             <line> := <token_q> <space> <AT_LIST> '\n'
             <AT_LIST> := <AT_ENTRY> | <AT_ENTRY> <space> <AT_LIST>
             <AT_ENTRY> := <action_type> '!' <ST_LIST>
             <ST_LIST> := <ST_ENTRY> | <ST_ENTRY> '@' <ST_LIST>
             <ST_ENTRY> := <subtype> '#' <FV_LIST>
             <FV_LIST> := <FV_ENTRY> | <FV_ENTRY> '#' <FV_LIST>
             <FV_ENTRY> := <full_value_q> <PF_LIST>
             <PF_LIST> := <PF_PAIR> | <PF_PAIR> '#' <PF_LIST>
             <PF_PAIR> := <pfmri_index> ',' <offset> [',' <offset>]*

           Where <token_q> and <full_value_q> are URL-quoted; numeric fields are
           base-10 integers. The first number after '#' in a PF pair is the
           integer id of the FMRI (line number in manf_list.v1), followed by
           one or more manifest byte offsets where the token matched.

         Example

           %25gconf.xml file!basename@basename#579,13249,13692,77391,77628

           Meaning: the token "%gconf.xml" (quoted to %25gconf.xml) appears in
           action type "file", key subtype "basename", with full_value
           "basename" in manifest id 579 at offsets 13249, 13692, 77391, 77628.

       3.6.2 token_byte_offset.v1 — token → byte offset map

         Purpose
           Provides random access into large posting structures for a token.
           Written during full rebuilds; unchanged during fast updates.

         Encoding
           After the VERSION line, each subsequent line maps a token to a
           byte offset:

             <tok_marker><tok> ' ' <offset> '\n'

           Where <tok_marker> is one of:
             - '0' + tok        when tok contains no spaces
             - '1' + quote(tok) when tok contains spaces (URL-quoted)

           On read, '1' indicates the token must be URL-unquoted. Offsets are
           base-10 integers. Example:

             0libc 123456
             1hello%20world 98765

       3.6.3 fast_add.v1 and fast_remove.v1 — incremental update logs

         Purpose
           Record installed/uninstalled package FMRIs since the last full
           rebuild. Used to answer queries incrementally without rewriting the
           main dictionaries. Updated by fast client updates only.

         Encoding
           After the VERSION line, each non-empty line contains a single FMRI
           string (anarchy=True, include_scheme=False). Lines may repeat across
           files (e.g., install vs remove) but individual sets are managed to
           avoid duplicates by the indexer logic.

       3.6.4 full_fmri_list and full_fmri_list.hash — membership and checksum

         Purpose
           `full_fmri_list` holds the complete set of FMRIs represented by the
           index at a given VERSION. `full_fmri_list.hash` stores a SHA-1 of
           the sorted `full_fmri_list` contents for quick integrity checking
           and to support old clients.

         Encoding
           - full_fmri_list: one FMRI per line after VERSION.
           - full_fmri_list.hash: after VERSION, exactly one line containing a
             lowercase hexadecimal SHA-1 digest of the sorted FMRI list.

       3.6.5 manf_list.v1 — manifest id ↔ fmri mapping

         Purpose
           Provides a compact bidirectional mapping for manifest ids used in
           `main_dict.ascii.v2` postings (pfmri_index) and other structures.

         Encoding
           After VERSION, each line corresponds to a numeric id equal to its
           zero-based line number (excluding the VERSION line). A blank line
           represents an empty/removed slot that can be reused later.

           Examples (line numbers shown at left for clarity; not stored):

             0: library/libc@1.0-0.0.0.0.0
             1: driver/storage@2.3-1
             2: \n                ← id 2 available for reuse

       3.6.6 fmri_offsets.v1 — fmri groups → delta-compressed offsets

         Purpose
           Associates groups of FMRIs with sets of manifest offsets using
           delta compression and deduplication. Used during rebuild to avoid
           storing duplicate offset lists per FMRI.

         Encoding
           After VERSION, each line encodes a set of space-separated FMRIs,
           then '!', then a space-separated list of delta-encoded offsets:

             <fmri_1> ' ' <fmri_2> ... '!' <d0> ' ' <d1> ... '\n'

           The offsets after '!' are not absolute; they are deltas. To recover
           absolute offsets, start from 0 and cumulatively add each value:

             abs[0] = d0
             abs[i] = abs[i-1] + di

           Example

             lib/libc@1.0-0 lib/libm@1.0-0!10 5 7

           This expands to absolute offsets [10, 15, 22].

       3.6.7 Auxiliary __at_* and __st_* files (legacy per-type caches)

         Purpose
           During full rebuilds the indexer may generate per-action-type and
           per-subtype auxiliary files named "__at_<action>" and "__st_<sub>".
           These are migration artifacts moved from TMP into the index root by
           the _migrate() step for compatibility with older query paths. Their
           exact internal format is not intended as a public contract and may
           change; they are managed entirely by the indexer.

       3.6.8 lock — writer lock

         Purpose
           A lock file used to serialize writers. The file’s content is
           controlled by the generic lock mechanism in pkg.lockfile. There is
           no VERSION header; readers ignore this file.

       3.6.9 Invariants and validation

         - All non-auxiliary files (except lock) MUST begin with identical
           VERSION lines within a consistent snapshot.
         - URL-quoting is used for tokens and any full values that may contain
           spaces or reserved separators; conversely, readers must unquote
           where indicated by the format (see token_byte_offset.v1 and
           main_dict.ascii.v2).
         - `full_fmri_list.hash` is computed as the SHA-1 of the sorted
           `full_fmri_list` contents encoded as bytes; it is used for quick
           integrity checks and interop with older clients.
         - `manf_list.v1` may contain blank lines indicating reusable ids; ids
           are the zero-based line numbers (excluding the VERSION line).
         - `fmri_offsets.v1` stores delta-encoded offsets; readers must
           convert to absolute offsets before use.
         - Fast updates only modify: fast_add.v1, fast_remove.v1, and
           full_fmri_list (+ hash). Full rebuilds rewrite all structures.