### Solstice CI — Architecture Overview (KDL Jobs + Multi‑Host Orchestrator) This document updates the earlier blueprint to reflect the current direction of Solstice CI: - The project name is Solstice CI (not Helios CI). - Workflows are defined in KDL (KDL:0) instead of YAML. - The Orchestrator is designed to run on multiple hosts behind a shared queue for horizontal scale. - A small set of crates provides clean separation of concerns: `orchestrator`, `forge-integration`, `github-integration`, `workflow-runner`, `common`, `ciadm`, and `cidev`. #### Core Components - Forge Integration Layer (`crates/forge-integration` and `crates/github-integration`) - Receives webhooks from Forgejo or GitHub. - Normalizes events and publishes job requests to the Orchestrator (direct API or message queue; see multi‑host section). - Reports status back to the forge (Checks API for GitHub; Commit Status API for Forgejo). - Orchestrator (`crates/orchestrator`) - Provisions ephemeral VMs via bhyve branded zones on illumos hosts and manages their lifecycle using ZFS clones. - Streams logs and results between the VM resident runner and the Integration Layer. - Multi‑host aware: multiple Orchestrator instances can run on different illumos hosts and share work (see below). - Workflow Runner (`crates/workflow-runner`) - Minimal agent binary pre‑installed in the base VM image. - Fetches job definition from the Orchestrator, executes steps, streams logs, and returns final status. - Common (`crates/common`) - DRY utilities used by all binaries: tracing/log initialization, KDL job parsing, and future shared abstractions. - Admin CLI (`crates/ciadm`) - Operator utility to trigger jobs, check status, etc., against the Orchestrator. - Dev CLI (`crates/cidev`) - Developer utility to validate KDL files locally, inspect jobs and steps, and debug CI issues without needing the full system. #### Multi‑Host Orchestration To support multiple hosts, Solstice CI uses a shared queue (e.g., RabbitMQ) between the Integration Layer and Orchestrators: - The Integration Layer publishes job requests into a durable queue. - Any healthy Orchestrator node can consume a job, subject to capacity constraints. - Nodes coordinate through the queue and an internal state store (e.g., Postgres) for job status. - Each node manages ZFS clones and bhyve zones locally; failure isolation is per‑node. - This model scales linearly by adding illumos hosts with Orchestrator instances. #### KDL Workflow Definition Solstice CI adopts a simple, explicit KDL schema for workflows. Example: ``` workflow name="Solstice CI" { job id="build" runs_on="illumos-stable" { step name="Format" run="cargo fmt --check" step name="Clippy" run="cargo clippy -- -D warnings" step name="Test" run="cargo test --workspace" } job id="lint" runs_on="ubuntu-22.04" { step name="Lint" run="ruff check ." } } ``` Key points: - `workflow` is the root node; `name` is optional. - One or more `job` nodes define independent VMs. Each job can have a `runs_on` hint to select a base image. - Each `job` contains one or more `step` nodes with a `run` command and optional `name`. The current parser lives in `crates/common/src/job.rs` and performs strict, typed parsing using the `kdl` crate. #### Execution Flow (High‑Level) 1. A forge sends a webhook to the Integration Layer. 2. Integration validates/authenticates and publishes a job request to the queue (or calls the Orchestrator API in single‑node setups). 3. An Orchestrator node accepts the job, creates a ZFS clone of a golden VM image, builds a bhyve zone config, and boots the VM. 4. The Runner starts in the VM, obtains the job definition (including parsed KDL steps), then executes each step, streaming logs back. 5. On completion or failure, the Orchestrator halts the zone and destroys the ZFS clone, then finalizes status via the Integration Layer. #### Security & Observability Notes - Secrets should be injected via a secrets backend (e.g., Vault) and masked in logs. - Tracing/logs are initialized consistently via `crates/common` and can be wired to OTLP later. - Network isolation defaults to an isolated VNIC and restricted egress. #### Current Repository Skeleton - Tracing/log initialization is provided by `common::init_tracing` (console only for now). - KDL job parsing types: `Workflow`, `Job`, `Step` and helpers in `crates/common/src/job.rs`. - Binaries provide Clap‑based CLIs with environment variable support. - `cidev` validates and inspects KDL locally; `ciadm` is oriented to operator interactions with the Orchestrator. #### Next Steps - Wire the Integration Layer to a real message queue and define the internal job request schema. - Implement Orchestrator capacity management and host selection. - Add gRPC service definitions for Orchestrator <-> Runner streaming logs and control. - Add GitHub App authentication (octocrab) and Forgejo (Gitea) client for status updates. - Implement secure secrets injection and masking.