solstice-ci/docs/ai/2025-10-25-solstice-ci-architecture.md
Till Wegmueller a71f9cc7d1
Initial Commit
Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2025-10-25 20:01:08 +02:00

83 lines
4.9 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### Solstice CI — Architecture Overview (KDL Jobs + MultiHost Orchestrator)
This document updates the earlier blueprint to reflect the current direction of Solstice CI:
- The project name is Solstice CI (not Helios CI).
- Workflows are defined in KDL (KDL:0) instead of YAML.
- The Orchestrator is designed to run on multiple hosts behind a shared queue for horizontal scale.
- A small set of crates provides clean separation of concerns: `orchestrator`, `forge-integration`, `github-integration`, `workflow-runner`, `common`, `ciadm`, and `cidev`.
#### Core Components
- Forge Integration Layer (`crates/forge-integration` and `crates/github-integration`)
- Receives webhooks from Forgejo or GitHub.
- Normalizes events and publishes job requests to the Orchestrator (direct API or message queue; see multihost section).
- Reports status back to the forge (Checks API for GitHub; Commit Status API for Forgejo).
- Orchestrator (`crates/orchestrator`)
- Provisions ephemeral VMs via bhyve branded zones on illumos hosts and manages their lifecycle using ZFS clones.
- Streams logs and results between the VM resident runner and the Integration Layer.
- Multihost aware: multiple Orchestrator instances can run on different illumos hosts and share work (see below).
- Workflow Runner (`crates/workflow-runner`)
- Minimal agent binary preinstalled in the base VM image.
- Fetches job definition from the Orchestrator, executes steps, streams logs, and returns final status.
- Common (`crates/common`)
- DRY utilities used by all binaries: tracing/log initialization, KDL job parsing, and future shared abstractions.
- Admin CLI (`crates/ciadm`)
- Operator utility to trigger jobs, check status, etc., against the Orchestrator.
- Dev CLI (`crates/cidev`)
- Developer utility to validate KDL files locally, inspect jobs and steps, and debug CI issues without needing the full system.
#### MultiHost Orchestration
To support multiple hosts, Solstice CI uses a shared queue (e.g., RabbitMQ) between the Integration Layer and Orchestrators:
- The Integration Layer publishes job requests into a durable queue.
- Any healthy Orchestrator node can consume a job, subject to capacity constraints.
- Nodes coordinate through the queue and an internal state store (e.g., Postgres) for job status.
- Each node manages ZFS clones and bhyve zones locally; failure isolation is pernode.
- This model scales linearly by adding illumos hosts with Orchestrator instances.
#### KDL Workflow Definition
Solstice CI adopts a simple, explicit KDL schema for workflows. Example:
```
workflow name="Solstice CI" {
job id="build" runs_on="illumos-stable" {
step name="Format" run="cargo fmt --check"
step name="Clippy" run="cargo clippy -- -D warnings"
step name="Test" run="cargo test --workspace"
}
job id="lint" runs_on="ubuntu-22.04" {
step name="Lint" run="ruff check ."
}
}
```
Key points:
- `workflow` is the root node; `name` is optional.
- One or more `job` nodes define independent VMs. Each job can have a `runs_on` hint to select a base image.
- Each `job` contains one or more `step` nodes with a `run` command and optional `name`.
The current parser lives in `crates/common/src/job.rs` and performs strict, typed parsing using the `kdl` crate.
#### Execution Flow (HighLevel)
1. A forge sends a webhook to the Integration Layer.
2. Integration validates/authenticates and publishes a job request to the queue (or calls the Orchestrator API in singlenode setups).
3. An Orchestrator node accepts the job, creates a ZFS clone of a golden VM image, builds a bhyve zone config, and boots the VM.
4. The Runner starts in the VM, obtains the job definition (including parsed KDL steps), then executes each step, streaming logs back.
5. On completion or failure, the Orchestrator halts the zone and destroys the ZFS clone, then finalizes status via the Integration Layer.
#### Security & Observability Notes
- Secrets should be injected via a secrets backend (e.g., Vault) and masked in logs.
- Tracing/logs are initialized consistently via `crates/common` and can be wired to OTLP later.
- Network isolation defaults to an isolated VNIC and restricted egress.
#### Current Repository Skeleton
- Tracing/log initialization is provided by `common::init_tracing` (console only for now).
- KDL job parsing types: `Workflow`, `Job`, `Step` and helpers in `crates/common/src/job.rs`.
- Binaries provide Clapbased CLIs with environment variable support.
- `cidev` validates and inspects KDL locally; `ciadm` is oriented to operator interactions with the Orchestrator.
#### Next Steps
- Wire the Integration Layer to a real message queue and define the internal job request schema.
- Implement Orchestrator capacity management and host selection.
- Add gRPC service definitions for Orchestrator <-> Runner streaming logs and control.
- Add GitHub App authentication (octocrab) and Forgejo (Gitea) client for status updates.
- Implement secure secrets injection and masking.