2025-07-22 11:57:44 +02:00
|
|
|
|
|
|
|
|
|
|
pkg
|
|
|
|
|
|
SEARCH
|
|
|
|
|
|
|
|
|
|
|
|
1. Goals
|
|
|
|
|
|
|
|
|
|
|
|
i. Provide relevant information
|
|
|
|
|
|
ii. Provide a consistently fast response
|
|
|
|
|
|
iii. Make responses consistent between local and remote search
|
|
|
|
|
|
iv. Provide the user with a good interface to the information
|
|
|
|
|
|
v. Allow seamless recovery when search fails
|
|
|
|
|
|
vi. Ensure the index is (almost) always in a consistent state
|
|
|
|
|
|
|
|
|
|
|
|
2. Approach
|
|
|
|
|
|
|
|
|
|
|
|
From a high level, there are two components to search: the
|
|
|
|
|
|
indexer, which maintains the information needed for search; the
|
|
|
|
|
|
query engine, which actually performs a search of the information
|
|
|
|
|
|
provided. The indexer is responsible for creating and updating the
|
|
|
|
|
|
indexes and ensuring they're always in a consistent state. It does this
|
|
|
|
|
|
by maintaining a set of inverted indexes as text files (details of which
|
|
|
|
|
|
can be found in the comments at the top of indexer.py). On the server
|
|
|
|
|
|
side, it's hooked into the publishing code so that the index is updated
|
|
|
|
|
|
each time a package is published. If indexing is already happening when
|
|
|
|
|
|
packages are published, they're queued and another update to the indexes
|
|
|
|
|
|
happens once the current run is finished. On the client side, it's
|
|
|
|
|
|
hooked into the install, image-update, and uninstall code so that each
|
|
|
|
|
|
of those actions are reflected in the index.
|
|
|
|
|
|
|
|
|
|
|
|
The query engine is responsible for processing the text from the user,
|
|
|
|
|
|
searching for that token in its information, and giving the client code
|
|
|
|
|
|
the information needed for a reasonable response to the user. It must
|
|
|
|
|
|
ensure that the information it uses is in a consistent state. On the
|
|
|
|
|
|
server, an engine is created during the server initialization. It reads
|
|
|
|
|
|
in the files it needs and stores the data internally. When the server gets
|
|
|
|
|
|
a search request from a client, it hands the search token to the query
|
|
|
|
|
|
engine. The query engine ensures that it has the most recent information
|
|
|
|
|
|
(locking and rereading the files from disk if necessary) and then searches
|
|
|
|
|
|
for the token in its dictionaries. On the client, the process is the same
|
|
|
|
|
|
except that the indexes are read from disk each time instead of being stored
|
|
|
|
|
|
because a new instance of pkg is started for each search.
|
|
|
|
|
|
|
|
|
|
|
|
3. Details
|
|
|
|
|
|
|
|
|
|
|
|
Search reserves the $ROOT/index directory for its use on both the client
|
|
|
|
|
|
and the server. It also creates a TMP directory inside index which it stores
|
|
|
|
|
|
indexes in until it's ready to migrate them to the the proper directory.
|
|
|
|
|
|
|
|
|
|
|
|
indexer.py contains detailed information about the files used to store the
|
|
|
|
|
|
index and their formats.
|
|
|
|
|
|
|
|
|
|
|
|
3.1 Locking
|
|
|
|
|
|
|
|
|
|
|
|
The indexes use a version locking protocol. The requirements for the
|
|
|
|
|
|
protocol are:
|
|
|
|
|
|
the writer never blocks on readers
|
|
|
|
|
|
any number of readers are allowed
|
|
|
|
|
|
readers must always have consistent data regardless the
|
|
|
|
|
|
writer's actions
|
|
|
|
|
|
To implement these features, several conventions must be observed. The
|
|
|
|
|
|
writer is responsible for updating these files in another location,
|
|
|
|
|
|
then moving them on top of existing files so that from a reader's
|
|
|
|
|
|
perspective, file updates are always atomic. Each file in the index has
|
|
|
|
|
|
a version in the first line. The writer is responsible for ensuring that
|
|
|
|
|
|
each time it updates the index, the files all have the same version
|
|
|
|
|
|
number and that version number has not been previously used. The writer
|
|
|
|
|
|
is not responsible for moving multiple files atomically, but it should
|
|
|
|
|
|
make an effort to have files in $ROOT/index be out of sync for as short
|
|
|
|
|
|
a time as is possible.
|
|
|
|
|
|
|
|
|
|
|
|
The readers are responsible for ensuring that the files their reading
|
|
|
|
|
|
the indexes from are a consistent set (have identical version
|
|
|
|
|
|
numbers). consistent_open in search_storage takes care of this
|
|
|
|
|
|
functionality.
|
2025-12-08 20:10:30 +01:00
|
|
|
|
|
|
|
|
|
|
3.2 Implementation overview (how the index is built and updated)
|
|
|
|
|
|
|
|
|
|
|
|
The implementation of the search index lives primarily in
|
|
|
|
|
|
src/modules/indexer.py and src/modules/search_storage.py.
|
|
|
|
|
|
|
|
|
|
|
|
At a high level, updates are handled by Indexer._generic_update_index(),
|
|
|
|
|
|
which performs these steps:
|
|
|
|
|
|
|
|
|
|
|
|
1) Obtain an exclusive lock on $ROOT/index/lock to serialize writers.
|
|
|
|
|
|
2) Read the existing index files using a version-checked, consistent
|
|
|
|
|
|
open (see search_storage.consistent_open()) so readers always see a
|
|
|
|
|
|
matched set of files with the same VERSION number.
|
|
|
|
|
|
3) Build new index artifacts in a temporary directory
|
|
|
|
|
|
$ROOT/index/TMP/ without disturbing the live files.
|
|
|
|
|
|
4) Write out all helper dictionaries; then migrate (rename/move)
|
|
|
|
|
|
files from TMP into $ROOT/index, updating the version number.
|
|
|
|
|
|
This migration is not atomic across multiple files, so readers use
|
|
|
|
|
|
version checks and retry to ensure consistency.
|
|
|
|
|
|
5) Release the lock and remove the temporary directory.
|
|
|
|
|
|
|
|
|
|
|
|
Two update paths are used depending on what changed:
|
|
|
|
|
|
|
|
|
|
|
|
- Fast update (incremental): For client-side installs/removals below a
|
|
|
|
|
|
small threshold (see MAX_FAST_INDEXED_PKGS in indexer.py), the
|
|
|
|
|
|
update only appends to the fast_add and fast_remove logs and updates
|
|
|
|
|
|
the set of full fmris. The large main dictionary files are not
|
|
|
|
|
|
rewritten or moved during this path.
|
|
|
|
|
|
|
|
|
|
|
|
- Full rebuild: When indexing on a repository (server side), or when
|
|
|
|
|
|
the number of changes exceeds the fast-update threshold, or when
|
|
|
|
|
|
an inconsistency/error is detected, the index is rebuilt by parsing
|
|
|
|
|
|
manifests and regenerating all dictionaries. The new files are then
|
|
|
|
|
|
migrated into place.
|
|
|
|
|
|
|
|
|
|
|
|
Temporary files and migration:
|
|
|
|
|
|
|
|
|
|
|
|
- All new/updated files are written under $ROOT/index/TMP/ with the
|
|
|
|
|
|
target VERSION header already set. After successful creation, the
|
|
|
|
|
|
files are moved into $ROOT/index. Legacy auxiliary directories are
|
|
|
|
|
|
cleaned up as needed during migration.
|
|
|
|
|
|
|
|
|
|
|
|
3.3 When indexing is triggered (client and server)
|
|
|
|
|
|
|
|
|
|
|
|
Server side (repository):
|
|
|
|
|
|
|
|
|
|
|
|
- The depot hooks indexing into the publish operation. Each time a
|
|
|
|
|
|
package is published to the repository, the indexer is invoked with
|
|
|
|
|
|
Indexer.server_update_index(). If another indexing run is already
|
|
|
|
|
|
in progress, new fmris are queued and a subsequent run processes
|
|
|
|
|
|
them. The helper function Indexer.check_for_updates(index_root, cat)
|
|
|
|
|
|
can be used to discover catalog entries that have not yet been
|
|
|
|
|
|
indexed (e.g., across restarts).
|
|
|
|
|
|
|
|
|
|
|
|
Client side (images):
|
|
|
|
|
|
|
|
|
|
|
|
- The client integrates indexing with image modification operations:
|
|
|
|
|
|
install, update/image-update, uninstall, and any execution of an
|
|
|
|
|
|
image plan that changes packages. After successful execution of an
|
|
|
|
|
|
image plan (see src/modules/client/imageplan.py), the code calls
|
|
|
|
|
|
Indexer.client_update_index() to record the changes. On a brand-new
|
|
|
|
|
|
image (empty index directory), Indexer.setup() seeds empty stubs.
|
|
|
|
|
|
|
|
|
|
|
|
- If a fast incremental update is possible (few package changes), the
|
|
|
|
|
|
operation only updates the fast logs. If not (too many changes), the
|
|
|
|
|
|
client releases the lock and triggers a full rebuild via
|
|
|
|
|
|
Indexer.rebuild_index_from_scratch(image.gen_installed_pkgs()).
|
|
|
|
|
|
|
|
|
|
|
|
- If an inconsistency or unexpected error occurs during incremental
|
|
|
|
|
|
update, the client falls back to a full rebuild to restore a clean
|
|
|
|
|
|
and consistent index.
|
|
|
|
|
|
|
|
|
|
|
|
3.4 Fast vs. full rebuild criteria
|
|
|
|
|
|
|
|
|
|
|
|
- Fast updates are used when the number of packages added/removed since
|
|
|
|
|
|
the last rebuild is small. The current threshold is defined by
|
|
|
|
|
|
MAX_FAST_INDEXED_PKGS (20 at the time of writing) in indexer.py.
|
|
|
|
|
|
During a fast update, only these files are changed/migrated:
|
|
|
|
|
|
* fast_add.v1
|
|
|
|
|
|
* fast_remove.v1
|
|
|
|
|
|
* full_fmri_list (and its hash)
|
|
|
|
|
|
The large main dictionary and token offset files are left untouched.
|
|
|
|
|
|
|
|
|
|
|
|
- A full rebuild is performed when:
|
|
|
|
|
|
* The change count exceeds the threshold, or
|
|
|
|
|
|
* The index on disk is missing or inconsistent, or
|
|
|
|
|
|
* An error occurs during fast update, or
|
|
|
|
|
|
* On server-side bulk operations (e.g., first-time indexing).
|
|
|
|
|
|
|
|
|
|
|
|
3.5 Files and on-disk layout
|
|
|
|
|
|
|
|
|
|
|
|
All files reside under $ROOT/index. Important files include
|
|
|
|
|
|
(see src/modules/search_storage.py for authoritative names):
|
|
|
|
|
|
|
|
|
|
|
|
- main_dict.ascii.v2
|
|
|
|
|
|
The main inverted index mapping search tokens to postings. It is
|
|
|
|
|
|
written in sorted token blocks and may be split/merged during
|
|
|
|
|
|
rebuild; very large and thus avoided in fast updates.
|
|
|
|
|
|
|
|
|
|
|
|
- token_byte_offset.v1
|
|
|
|
|
|
A map of token to byte offsets into main_dict for efficient
|
|
|
|
|
|
random access by the query engine.
|
|
|
|
|
|
|
|
|
|
|
|
- fast_add.v1 / fast_remove.v1
|
|
|
|
|
|
Incremental logs holding fmris added/removed since the last full
|
|
|
|
|
|
rebuild. Used to answer queries without rebuilding immediately.
|
|
|
|
|
|
|
|
|
|
|
|
- full_fmri_list and full_fmri_list.hash
|
|
|
|
|
|
A list (and content hash) of all fmris currently represented in
|
|
|
|
|
|
the index. Used to detect divergence and for consistency checks.
|
|
|
|
|
|
|
|
|
|
|
|
- fmri_offsets.v1
|
|
|
|
|
|
An auxiliary mapping between fmri identifiers and positions used
|
|
|
|
|
|
when assembling postings; replaces legacy per-pkg files.
|
|
|
|
|
|
|
|
|
|
|
|
- manf_list.v1
|
|
|
|
|
|
A mapping table of internal manifest IDs to their fmri strings.
|
|
|
|
|
|
|
|
|
|
|
|
- lock
|
|
|
|
|
|
The writer lock file used to serialize index modifications.
|
|
|
|
|
|
|
|
|
|
|
|
During indexing, new versions of the above are created in
|
|
|
|
|
|
$ROOT/index/TMP with a new VERSION header and then moved into
|
|
|
|
|
|
$ROOT/index. Readers always verify that all open files have identical
|
|
|
|
|
|
VERSION headers and will reopen/retry if a migration is in progress.
|
|
|
|
|
|
|
|
|
|
|
|
3.6 On-disk file formats (authoritative specification)
|
|
|
|
|
|
|
|
|
|
|
|
This section describes the exact line formats, encodings, and
|
|
|
|
|
|
invariants for the files under $ROOT/index as implemented by
|
|
|
|
|
|
src/modules/search_storage.py and used by src/modules/indexer.py.
|
|
|
|
|
|
|
|
|
|
|
|
Unless otherwise noted, every index file begins with a first line of
|
|
|
|
|
|
the form:
|
|
|
|
|
|
|
|
|
|
|
|
VERSION: <integer>\n
|
|
|
|
|
|
All subsequent lines are specific to each file’s purpose, and may be
|
|
|
|
|
|
in arbitrary order unless ordering is explicitly stated. Readers always
|
|
|
|
|
|
validate that all opened files share the same VERSION number.
|
|
|
|
|
|
|
|
|
|
|
|
Conventions used below:
|
|
|
|
|
|
- “token” means a search term after tokenization.
|
|
|
|
|
|
- “fmri” means a package FMRI string; for storage the scheme is
|
|
|
|
|
|
omitted (include_scheme=False) and anarchy=True to avoid
|
|
|
|
|
|
normalization changes.
|
|
|
|
|
|
- “byte offset” means an integer offset used by the query engine to
|
|
|
|
|
|
seek quickly within a larger file or posting list.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.1 main_dict.ascii.v2 — main inverted index
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
Maps each token to its postings grouped by action type, key subtype,
|
|
|
|
|
|
and full value. Very large; written during full rebuilds only.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
One line per token. Each line has five kinds of separators in the
|
|
|
|
|
|
following precedence: space, '!', '@', '#', ','. The token itself is
|
|
|
|
|
|
URL-quoted (urllib.parse.quote) to ensure line safety.
|
|
|
|
|
|
|
|
|
|
|
|
Grammar (informal):
|
|
|
|
|
|
|
|
|
|
|
|
<line> := <token_q> <space> <AT_LIST> '\n'
|
|
|
|
|
|
<AT_LIST> := <AT_ENTRY> | <AT_ENTRY> <space> <AT_LIST>
|
|
|
|
|
|
<AT_ENTRY> := <action_type> '!' <ST_LIST>
|
|
|
|
|
|
<ST_LIST> := <ST_ENTRY> | <ST_ENTRY> '@' <ST_LIST>
|
|
|
|
|
|
<ST_ENTRY> := <subtype> '#' <FV_LIST>
|
|
|
|
|
|
<FV_LIST> := <FV_ENTRY> | <FV_ENTRY> '#' <FV_LIST>
|
|
|
|
|
|
<FV_ENTRY> := <full_value_q> <PF_LIST>
|
|
|
|
|
|
<PF_LIST> := <PF_PAIR> | <PF_PAIR> '#' <PF_LIST>
|
|
|
|
|
|
<PF_PAIR> := <pfmri_index> ',' <offset> [',' <offset>]*
|
|
|
|
|
|
|
|
|
|
|
|
Where <token_q> and <full_value_q> are URL-quoted; numeric fields are
|
|
|
|
|
|
base-10 integers. The first number after '#' in a PF pair is the
|
|
|
|
|
|
integer id of the FMRI (line number in manf_list.v1), followed by
|
|
|
|
|
|
one or more manifest byte offsets where the token matched.
|
|
|
|
|
|
|
|
|
|
|
|
Example
|
|
|
|
|
|
|
|
|
|
|
|
%25gconf.xml file!basename@basename#579,13249,13692,77391,77628
|
|
|
|
|
|
|
|
|
|
|
|
Meaning: the token "%gconf.xml" (quoted to %25gconf.xml) appears in
|
|
|
|
|
|
action type "file", key subtype "basename", with full_value
|
|
|
|
|
|
"basename" in manifest id 579 at offsets 13249, 13692, 77391, 77628.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.2 token_byte_offset.v1 — token → byte offset map
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
Provides random access into large posting structures for a token.
|
|
|
|
|
|
Written during full rebuilds; unchanged during fast updates.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
After the VERSION line, each subsequent line maps a token to a
|
|
|
|
|
|
byte offset:
|
|
|
|
|
|
|
|
|
|
|
|
<tok_marker><tok> ' ' <offset> '\n'
|
|
|
|
|
|
|
|
|
|
|
|
Where <tok_marker> is one of:
|
|
|
|
|
|
- '0' + tok when tok contains no spaces
|
|
|
|
|
|
- '1' + quote(tok) when tok contains spaces (URL-quoted)
|
|
|
|
|
|
|
|
|
|
|
|
On read, '1' indicates the token must be URL-unquoted. Offsets are
|
|
|
|
|
|
base-10 integers. Example:
|
|
|
|
|
|
|
|
|
|
|
|
0libc 123456
|
|
|
|
|
|
1hello%20world 98765
|
|
|
|
|
|
|
|
|
|
|
|
3.6.3 fast_add.v1 and fast_remove.v1 — incremental update logs
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
Record installed/uninstalled package FMRIs since the last full
|
|
|
|
|
|
rebuild. Used to answer queries incrementally without rewriting the
|
|
|
|
|
|
main dictionaries. Updated by fast client updates only.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
After the VERSION line, each non-empty line contains a single FMRI
|
|
|
|
|
|
string (anarchy=True, include_scheme=False). Lines may repeat across
|
|
|
|
|
|
files (e.g., install vs remove) but individual sets are managed to
|
|
|
|
|
|
avoid duplicates by the indexer logic.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.4 full_fmri_list and full_fmri_list.hash — membership and checksum
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
`full_fmri_list` holds the complete set of FMRIs represented by the
|
|
|
|
|
|
index at a given VERSION. `full_fmri_list.hash` stores a SHA-1 of
|
|
|
|
|
|
the sorted `full_fmri_list` contents for quick integrity checking
|
|
|
|
|
|
and to support old clients.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
- full_fmri_list: one FMRI per line after VERSION.
|
|
|
|
|
|
- full_fmri_list.hash: after VERSION, exactly one line containing a
|
|
|
|
|
|
lowercase hexadecimal SHA-1 digest of the sorted FMRI list.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.5 manf_list.v1 — manifest id ↔ fmri mapping
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
Provides a compact bidirectional mapping for manifest ids used in
|
|
|
|
|
|
`main_dict.ascii.v2` postings (pfmri_index) and other structures.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
After VERSION, each line corresponds to a numeric id equal to its
|
|
|
|
|
|
zero-based line number (excluding the VERSION line). A blank line
|
|
|
|
|
|
represents an empty/removed slot that can be reused later.
|
|
|
|
|
|
|
|
|
|
|
|
Examples (line numbers shown at left for clarity; not stored):
|
|
|
|
|
|
|
|
|
|
|
|
0: library/libc@1.0-0.0.0.0.0
|
|
|
|
|
|
1: driver/storage@2.3-1
|
|
|
|
|
|
2: \n ← id 2 available for reuse
|
|
|
|
|
|
|
|
|
|
|
|
3.6.6 fmri_offsets.v1 — fmri groups → delta-compressed offsets
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
Associates groups of FMRIs with sets of manifest offsets using
|
|
|
|
|
|
delta compression and deduplication. Used during rebuild to avoid
|
|
|
|
|
|
storing duplicate offset lists per FMRI.
|
|
|
|
|
|
|
|
|
|
|
|
Encoding
|
|
|
|
|
|
After VERSION, each line encodes a set of space-separated FMRIs,
|
|
|
|
|
|
then '!', then a space-separated list of delta-encoded offsets:
|
|
|
|
|
|
|
|
|
|
|
|
<fmri_1> ' ' <fmri_2> ... '!' <d0> ' ' <d1> ... '\n'
|
|
|
|
|
|
|
|
|
|
|
|
The offsets after '!' are not absolute; they are deltas. To recover
|
|
|
|
|
|
absolute offsets, start from 0 and cumulatively add each value:
|
|
|
|
|
|
|
|
|
|
|
|
abs[0] = d0
|
|
|
|
|
|
abs[i] = abs[i-1] + di
|
|
|
|
|
|
|
|
|
|
|
|
Example
|
|
|
|
|
|
|
|
|
|
|
|
lib/libc@1.0-0 lib/libm@1.0-0!10 5 7
|
|
|
|
|
|
|
|
|
|
|
|
This expands to absolute offsets [10, 15, 22].
|
|
|
|
|
|
|
|
|
|
|
|
3.6.7 Auxiliary __at_* and __st_* files (legacy per-type caches)
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
During full rebuilds the indexer may generate per-action-type and
|
|
|
|
|
|
per-subtype auxiliary files named "__at_<action>" and "__st_<sub>".
|
|
|
|
|
|
These are migration artifacts moved from TMP into the index root by
|
|
|
|
|
|
the _migrate() step for compatibility with older query paths. Their
|
|
|
|
|
|
exact internal format is not intended as a public contract and may
|
|
|
|
|
|
change; they are managed entirely by the indexer.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.8 lock — writer lock
|
|
|
|
|
|
|
|
|
|
|
|
Purpose
|
|
|
|
|
|
A lock file used to serialize writers. The file’s content is
|
|
|
|
|
|
controlled by the generic lock mechanism in pkg.lockfile. There is
|
|
|
|
|
|
no VERSION header; readers ignore this file.
|
|
|
|
|
|
|
|
|
|
|
|
3.6.9 Invariants and validation
|
|
|
|
|
|
|
|
|
|
|
|
- All non-auxiliary files (except lock) MUST begin with identical
|
|
|
|
|
|
VERSION lines within a consistent snapshot.
|
|
|
|
|
|
- URL-quoting is used for tokens and any full values that may contain
|
|
|
|
|
|
spaces or reserved separators; conversely, readers must unquote
|
|
|
|
|
|
where indicated by the format (see token_byte_offset.v1 and
|
|
|
|
|
|
main_dict.ascii.v2).
|
|
|
|
|
|
- `full_fmri_list.hash` is computed as the SHA-1 of the sorted
|
|
|
|
|
|
`full_fmri_list` contents encoded as bytes; it is used for quick
|
|
|
|
|
|
integrity checks and interop with older clients.
|
|
|
|
|
|
- `manf_list.v1` may contain blank lines indicating reusable ids; ids
|
|
|
|
|
|
are the zero-based line numbers (excluding the VERSION line).
|
|
|
|
|
|
- `fmri_offsets.v1` stores delta-encoded offsets; readers must
|
|
|
|
|
|
convert to absolute offsets before use.
|
|
|
|
|
|
- Fast updates only modify: fast_add.v1, fast_remove.v1, and
|
|
|
|
|
|
full_fmri_list (+ hash). Full rebuilds rewrite all structures.
|