mirror of
https://codeberg.org/Toasterson/solstice-ci.git
synced 2026-04-10 13:20:41 +00:00
Add architecture plans and decision records
Plans: - 001: vm-manager migration (completed) - 002: runner-only architecture (active) Decision records (ADRs): - 001: Runner-only architecture — retire webhooks + logs service - 002: Direct QEMU over libvirt - 003: Ephemeral SSH keys with opt-in debug access - 004: User-mode (SLIRP) networking for VMs
This commit is contained in:
parent
f3d3d1465d
commit
d6c2c3662c
6 changed files with 316 additions and 0 deletions
34
docs/ai/decisions/001-runner-only-architecture.md
Normal file
34
docs/ai/decisions/001-runner-only-architecture.md
Normal file
|
|
@ -0,0 +1,34 @@
|
||||||
|
# ADR-001: Runner-Only Architecture
|
||||||
|
|
||||||
|
**Date:** 2026-04-09
|
||||||
|
**Status:** Accepted
|
||||||
|
**Deciders:** Till Wegmueller
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
Solstice CI started as a full CI/CD system with webhook receivers, a custom log storage service, and platform-specific integration layers. This resulted in 7+ services to maintain, a custom log viewer that was worse than GitHub/Forgejo's native UI, and security/multi-tenancy challenges around log access control, webhook secrets, and artifact storage.
|
||||||
|
|
||||||
|
The system's unique value is VM orchestration for non-Linux operating systems (illumos, omnios, OpenIndiana) — something no other CI runner handles well.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Act exclusively as a **native runner** for GitHub and Forgejo. Retire all webhook ingestion, log storage, and custom status reporting. Let the platforms handle everything except job execution.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- **7 services reduced to 3**: orchestrator, forgejo-runner, github-runner
|
||||||
|
- **Security solved by delegation**: log access, webhook secrets, artifacts, user auth all handled by the platform
|
||||||
|
- **Better UX**: logs appear in GitHub/Forgejo native UI instead of a custom dashboard
|
||||||
|
- **Standard workflow format**: users write GitHub Actions YAML, not custom KDL
|
||||||
|
- **Lower maintenance burden**: no custom dashboard, no log retention policy, no artifact storage
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
- **No custom KDL workflows for external users**: KDL remains as internal superset but external users must use Actions YAML
|
||||||
|
- **Feature limitations**: can only execute `run` steps, not `uses` actions (no container support, no marketplace actions)
|
||||||
|
- **Platform dependency**: tied to GitHub/Forgejo runner protocols
|
||||||
|
- **GitHub runner protocol complexity**: significantly more complex than Forgejo's connect-rpc (RSA JWT, OAuth tokens, heartbeats)
|
||||||
|
|
||||||
|
### Neutral
|
||||||
|
- Internal projects can still use `.solstice/workflow.kdl` for setup scripts and multi-OS abstractions
|
||||||
|
- RabbitMQ remains as the internal job buffer between runners and orchestrator
|
||||||
36
docs/ai/decisions/002-qemu-over-libvirt.md
Normal file
36
docs/ai/decisions/002-qemu-over-libvirt.md
Normal file
|
|
@ -0,0 +1,36 @@
|
||||||
|
# ADR-002: Direct QEMU Over Libvirt
|
||||||
|
|
||||||
|
**Date:** 2026-04-07
|
||||||
|
**Status:** Accepted
|
||||||
|
**Deciders:** Till Wegmueller
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The orchestrator used libvirt (via the Rust `virt` crate) for VM lifecycle management. Libvirt provided domain XML generation, network management (virbr0 + dnsmasq), IP discovery (domifaddr), and graceful shutdown. However, it required the libvirt daemon on the host, socket mounts into containers, and complex host configuration.
|
||||||
|
|
||||||
|
The `vm-manager` library manages QEMU processes directly via QMP (QEMU Machine Protocol), eliminating the libvirt middleman.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Replace libvirt with direct QEMU process management via vm-manager. Use user-mode (SLIRP) networking with SSH port forwarding instead of libvirt's bridged networking.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- **Containerization simplified**: only `/dev/kvm` device needed, no daemon sockets
|
||||||
|
- **712 lines of libvirt code removed** from orchestrator
|
||||||
|
- **No libvirt daemon dependency** on the host
|
||||||
|
- **Simpler networking**: user-mode SLIRP needs no bridge, no NET_ADMIN, no TAP devices
|
||||||
|
- **Pure-Rust cloud-init ISO**: no genisoimage/mkisofs required (optional `pure-iso` feature)
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
- **No libvirt network management**: must use existing bridge or user-mode networking
|
||||||
|
- **VM IP discovery changes**: `ip neigh` parsing instead of `virsh domifaddr`
|
||||||
|
- **QEMU process management**: must handle PID tracking, graceful shutdown via QMP
|
||||||
|
- **Cross-workspace dependency**: vm-manager uses workspace-inherited deps, requiring git dep + local patch override
|
||||||
|
|
||||||
|
### Lessons learned
|
||||||
|
- IDE CDROM must be used for cloud-init seed ISO — virtio-blk for both disk and seed confuses Ubuntu's root device detection
|
||||||
|
- VmHandle must preserve vcpus/memory from prepare step — vm-manager's start() reads them from the handle
|
||||||
|
- SFTP upload needs explicit chmod 0755 for executable files
|
||||||
|
- Console tailer must stop before SSH execution begins to prevent log duplication
|
||||||
40
docs/ai/decisions/003-ephemeral-ssh-keys.md
Normal file
40
docs/ai/decisions/003-ephemeral-ssh-keys.md
Normal file
|
|
@ -0,0 +1,40 @@
|
||||||
|
# ADR-003: Ephemeral SSH Keys with Opt-In Debug Access
|
||||||
|
|
||||||
|
**Date:** 2026-04-09
|
||||||
|
**Status:** Accepted
|
||||||
|
**Deciders:** Till Wegmueller
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The orchestrator generates an Ed25519 SSH keypair per job for authenticating to the provisioned VM. Currently, both public and private keys are persisted to PostgreSQL in plaintext (`job_ssh_keys` table). This creates a security risk — a database breach exposes all SSH keys.
|
||||||
|
|
||||||
|
The keys are only needed during the VM's lifetime: from provisioning (cloud-init injects the public key) through SSH execution (orchestrator authenticates with the private key) to VM destruction.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Make SSH keys fully ephemeral:
|
||||||
|
1. Generate keypair in-memory
|
||||||
|
2. Inject public key via cloud-init
|
||||||
|
3. Use private key for SSH connection
|
||||||
|
4. Forget both keys when VM is destroyed
|
||||||
|
5. Never persist to database
|
||||||
|
|
||||||
|
Exception: **opt-in debug SSH for failed builds**. When a job fails and the user has opted in (e.g., via a workflow annotation or label), keep the VM alive for a TTL (30 minutes) and expose SSH connection info in the build log so the user can debug inside the target OS.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- **Zero persistent key storage**: no database breach risk for SSH keys
|
||||||
|
- **Simpler persistence layer**: `job_ssh_keys` table can be removed
|
||||||
|
- **Debug SSH feature**: valuable for OS-specific debugging (illumos quirks, package issues) — similar to CircleCI's "rerun with SSH" and Buildkite's debug feature
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
- **Cannot retroactively access a destroyed VM**: if the key is gone, there's no way back in
|
||||||
|
- **Debug SSH adds complexity**: TTL management, rate limiting, VM lifecycle exception path
|
||||||
|
- **Debug SSH is a security surface**: must ensure TTL is enforced and connection info only goes to the authenticated user via the platform's log channel
|
||||||
|
|
||||||
|
### Design for debug SSH
|
||||||
|
- Rate limit: max 1 debug session per project concurrently
|
||||||
|
- TTL: 30 minutes, non-renewable, force-destroy on expiry
|
||||||
|
- Connection info: printed as a build log step (platform controls access)
|
||||||
|
- Opt-in: explicit flag required (never default)
|
||||||
35
docs/ai/decisions/004-user-mode-networking.md
Normal file
35
docs/ai/decisions/004-user-mode-networking.md
Normal file
|
|
@ -0,0 +1,35 @@
|
||||||
|
# ADR-004: User-Mode (SLIRP) Networking for VMs
|
||||||
|
|
||||||
|
**Date:** 2026-04-07
|
||||||
|
**Status:** Accepted
|
||||||
|
**Deciders:** Till Wegmueller
|
||||||
|
|
||||||
|
## Context
|
||||||
|
|
||||||
|
The orchestrator needs network access to VMs for SSH (uploading runner binary, executing commands). Two options:
|
||||||
|
|
||||||
|
1. **TAP with bridge** — VM gets a real IP on a bridge network (e.g., virbr0). Requires NET_ADMIN capability, host bridge access, and TAP device creation. IP discovery via ARP/DHCP lease parsing.
|
||||||
|
|
||||||
|
2. **User-mode (SLIRP)** — QEMU provides NAT via user-space networking. VM gets a private IP (10.0.2.x). SSH access via host port forwarding (`hostfwd=tcp::{port}-:22`). No special capabilities needed.
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
Use user-mode (SLIRP) networking with deterministic SSH port forwarding.
|
||||||
|
|
||||||
|
Port assignment: `10022 + (hash(vm_name) % 100)` — range 10022-10122.
|
||||||
|
|
||||||
|
Guest IP is always `127.0.0.1` from the orchestrator's perspective.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
### Positive
|
||||||
|
- **Container-friendly**: no NET_ADMIN, no bridge access, no host configuration
|
||||||
|
- **Trivial IP discovery**: always `127.0.0.1` with a known port
|
||||||
|
- **No host bridge dependency**: works on any host with just `/dev/kvm`
|
||||||
|
- **Network isolation**: VMs cannot reach each other or the host network directly
|
||||||
|
|
||||||
|
### Negative
|
||||||
|
- **Port collision risk**: with 100 ports and concurrent VMs, hash collisions are possible (mitigated by UUID-based VM names having good hash distribution)
|
||||||
|
- **No inbound connections**: external services cannot reach the VM directly (not needed for CI)
|
||||||
|
- **SLIRP performance**: slightly slower than TAP for network-heavy workloads (acceptable for CI)
|
||||||
|
- **No VM-to-VM communication**: VMs are fully isolated (acceptable for CI)
|
||||||
48
docs/ai/plans/001-vm-manager-migration.md
Normal file
48
docs/ai/plans/001-vm-manager-migration.md
Normal file
|
|
@ -0,0 +1,48 @@
|
||||||
|
# Plan: Migrate Orchestrator to vm-manager + Containerize
|
||||||
|
|
||||||
|
**Status:** Completed (2026-04-07)
|
||||||
|
**Planner ID:** `5fc6f5f5-33c1-4e3d-9201-c4c9c4fc43df`
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Replace the orchestrator's built-in libvirt hypervisor code with the `vm-manager` library crate, then containerize the orchestrator. This eliminates the libvirt dependency and makes deployment straightforward (only `/dev/kvm` needed).
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The orchestrator used libvirt (via the `virt` crate) requiring:
|
||||||
|
- Libvirt daemon on the host
|
||||||
|
- Libvirt sockets mounted into containers
|
||||||
|
- KVM device access
|
||||||
|
- Host-level libvirt configuration and networking
|
||||||
|
|
||||||
|
This made containerization painful — the orchestrator ran as a systemd service on the host.
|
||||||
|
|
||||||
|
## Approach
|
||||||
|
|
||||||
|
1. Extended vm-manager with console log tailing (`console` module)
|
||||||
|
2. Chose user-mode (SLIRP) networking over TAP for container simplicity
|
||||||
|
3. Created `vm_adapter.rs` bridging orchestrator's Hypervisor trait to vm-manager
|
||||||
|
4. Replaced scheduler's SSH/IP-discovery/console code with vm-manager APIs
|
||||||
|
5. Replaced image download with vm-manager's `ImageManager`
|
||||||
|
6. Removed 712 lines of libvirt-specific code
|
||||||
|
7. Updated Containerfile: libvirt packages replaced with QEMU + qemu-utils
|
||||||
|
|
||||||
|
## Tasks completed
|
||||||
|
|
||||||
|
| # | Task | Summary |
|
||||||
|
|---|------|---------|
|
||||||
|
| 1 | Add serial console tailing to vm-manager | `ConsoleTailer` for async Unix socket streaming |
|
||||||
|
| 2 | Verify networking | User-mode SLIRP chosen — no bridge needed |
|
||||||
|
| 3 | Add vm-manager adapter layer | `vm_adapter.rs` with VmSpec/VmHandle conversion |
|
||||||
|
| 4 | Update scheduler SSH + console | vm-manager SSH/connect_with_retry/upload/exec |
|
||||||
|
| 5 | Update image config | vm-manager `ImageManager::download()` |
|
||||||
|
| 6 | Remove libvirt dependencies | -712 lines, removed virt/ssh2/zstd crates |
|
||||||
|
| 7 | Update Containerfile | Ubuntu 24.04 runtime, QEMU direct, no libvirt |
|
||||||
|
| 8 | Integration test | End-to-end job via containerized orchestrator |
|
||||||
|
|
||||||
|
## Key decisions
|
||||||
|
|
||||||
|
- **QEMU direct over libvirt**: vm-manager spawns QEMU processes directly, manages via QMP socket. Simpler, no daemon dependency.
|
||||||
|
- **User-mode networking**: SSH via port forwarding (`hostfwd=tcp::{port}-:22`). No bridge, no NET_ADMIN, no TAP device creation.
|
||||||
|
- **IDE CDROM for seed ISO**: Ubuntu cloud images expect root disk as first virtio device. Seed ISO uses IDE CDROM to avoid device ordering conflicts.
|
||||||
|
- **Pre-built binary Containerfile**: vm-manager uses workspace-inherited deps making cross-workspace path deps difficult. Git dep used for CI, local patch for dev.
|
||||||
123
docs/ai/plans/002-runner-only-architecture.md
Normal file
123
docs/ai/plans/002-runner-only-architecture.md
Normal file
|
|
@ -0,0 +1,123 @@
|
||||||
|
# Plan: Runner-Only Architecture
|
||||||
|
|
||||||
|
**Status:** Active
|
||||||
|
**Created:** 2026-04-09
|
||||||
|
**Planner ID:** `5ea54391-2b17-4790-9f6a-27afcc410fa6`
|
||||||
|
|
||||||
|
## Summary
|
||||||
|
|
||||||
|
Simplify Solstice CI from 7+ services to 3 by acting exclusively as a native runner for GitHub and Forgejo. All logs, artifacts, and status flow through the platform's native UI. Our unique value is VM orchestration for non-Linux OSes (illumos, omnios, OpenIndiana).
|
||||||
|
|
||||||
|
## Motivation
|
||||||
|
|
||||||
|
The current architecture has Solstice CI reimplementing functionality that GitHub and Forgejo already provide:
|
||||||
|
- **Webhook ingestion** — both platforms have runner protocols that push jobs to runners
|
||||||
|
- **Log storage and viewing** — both platforms display logs in their own UI
|
||||||
|
- **Artifact storage** — both platforms have artifact APIs
|
||||||
|
- **Status reporting** — both platforms show build status natively
|
||||||
|
|
||||||
|
We are building and maintaining 4 extra services (forge-integration, github-integration, logs-service, custom dashboards) that provide a worse user experience than the native platform UI.
|
||||||
|
|
||||||
|
## Architecture Change
|
||||||
|
|
||||||
|
### Before (7+ services)
|
||||||
|
```
|
||||||
|
Forgejo webhooks --> forge-integration --> RabbitMQ --> orchestrator --> VMs
|
||||||
|
GitHub webhooks --> github-integration --> RabbitMQ /
|
||||||
|
logs-service <-- orchestrator
|
||||||
|
runner-integration --> Forgejo
|
||||||
|
```
|
||||||
|
|
||||||
|
### After (3 services)
|
||||||
|
```
|
||||||
|
Forgejo <--> forgejo-runner <--> RabbitMQ <--> orchestrator <--> VMs
|
||||||
|
GitHub <--> github-runner <--> RabbitMQ /
|
||||||
|
```
|
||||||
|
|
||||||
|
### Services retained
|
||||||
|
| Service | Role |
|
||||||
|
|---------|------|
|
||||||
|
| **forgejo-runner** (runner-integration) | Sole Forgejo interface via connect-rpc |
|
||||||
|
| **github-runner** (NEW) | Sole GitHub interface via Actions runner protocol |
|
||||||
|
| **orchestrator** | VM provisioning via vm-manager/QEMU |
|
||||||
|
|
||||||
|
### Services retired
|
||||||
|
| Service | Replacement |
|
||||||
|
|---------|-------------|
|
||||||
|
| forge-integration | forgejo-runner (runner protocol replaces webhooks) |
|
||||||
|
| github-integration | github-runner (runner protocol replaces GitHub App) |
|
||||||
|
| logs-service | Platform UI (logs sent via runner protocol) |
|
||||||
|
|
||||||
|
### Infrastructure retained
|
||||||
|
- **RabbitMQ** — job buffer between runners and orchestrator
|
||||||
|
- **PostgreSQL** — job state persistence in orchestrator
|
||||||
|
- **vm-manager** — QEMU VM lifecycle management
|
||||||
|
|
||||||
|
## Tasks
|
||||||
|
|
||||||
|
| # | Task | Priority | Effort | Depends on | Status |
|
||||||
|
|---|------|----------|--------|------------|--------|
|
||||||
|
| 1 | Evolve workflow-runner to execute Actions YAML run steps | 100 | M | — | pending |
|
||||||
|
| 2 | Orchestrator: accept step commands via JobRequest | 95 | M | — | pending |
|
||||||
|
| 3 | Clean up Forgejo runner as sole interface | 90 | L | 1 | pending |
|
||||||
|
| 4 | Implement GitHub Actions runner integration | 80 | XL | 1 | pending |
|
||||||
|
| 5 | Security: ephemeral SSH keys + opt-in debug SSH | 60 | M | 7 | pending |
|
||||||
|
| 6 | Documentation: image catalog + illumos guides | 50 | L | — | pending |
|
||||||
|
| 7 | Retire forge-integration, github-integration, logs-service | 40 | M | 3, 4 | pending |
|
||||||
|
|
||||||
|
### Task details
|
||||||
|
|
||||||
|
#### 1. Evolve workflow-runner to execute Actions YAML run steps
|
||||||
|
The workflow-runner currently parses `.solstice/workflow.kdl`. It also needs to execute standard GitHub Actions YAML `run` steps passed via `job.yaml`. The runner integrations translate Actions YAML into step commands before publishing to MQ. KDL support is kept as a superset for users who want setup scripts and multi-OS abstractions.
|
||||||
|
|
||||||
|
#### 2. Orchestrator: accept step commands via JobRequest
|
||||||
|
Add `steps: Option<Vec<StepCommand>>` to `JobRequest` (common/src/messages.rs). Each `StepCommand` has `name`, `run`, and optional `env`. The orchestrator writes these to `job.yaml` so the workflow-runner can execute them directly. If `steps` is `None`, workflow-runner falls back to `.solstice/workflow.kdl`.
|
||||||
|
|
||||||
|
#### 3. Clean up Forgejo runner as sole interface
|
||||||
|
Remove the tier-1 KDL workflow fetch from `translator.rs`. Actions YAML `run` steps become the primary translation path. Handle matrix builds by expanding into separate `JobRequest`s. Report unsupported `uses:` steps with clear errors. Remove dependency on `FORGEJO_BASE_URL`/`FORGEJO_TOKEN` for fetching workflow files.
|
||||||
|
|
||||||
|
#### 4. Implement GitHub Actions runner integration
|
||||||
|
New crate implementing the GitHub Actions self-hosted runner protocol (REST + JSON). Significantly more complex than Forgejo's connect-rpc: RSA JWT authentication, OAuth bearer tokens, 50-second long-poll, per-job heartbeat every 60s, encrypted job delivery. Same internal pattern as runner-integration (poller + reporter + state).
|
||||||
|
|
||||||
|
#### 5. Security: ephemeral SSH keys + opt-in debug SSH
|
||||||
|
Stop persisting SSH keys to the database. Generate in-memory, inject via cloud-init, forget after VM destroy. For failed builds with opt-in debug flag: keep VM alive for 30 minutes, expose SSH connection info in build log, rate-limit to 1 debug session per project.
|
||||||
|
|
||||||
|
#### 6. Documentation
|
||||||
|
User-facing docs for FLOSS projects: getting started guide, image catalog (`runs-on` labels), illumos/omnios-specific guide (pkg, tar, CA certs), FAQ (supported features, limitations).
|
||||||
|
|
||||||
|
#### 7. Retire old services
|
||||||
|
Remove forge-integration, github-integration, and logs-service from compose.yml. Clean up environment variables, Traefik routes, and database tables. Keep source code for reference but mark deprecated.
|
||||||
|
|
||||||
|
## Workflow format after migration
|
||||||
|
|
||||||
|
Users write **standard GitHub Actions YAML**. No custom format needed:
|
||||||
|
|
||||||
|
```yaml
|
||||||
|
name: CI
|
||||||
|
on: [push, pull_request]
|
||||||
|
jobs:
|
||||||
|
build:
|
||||||
|
runs-on: omnios-bloody # Solstice CI label
|
||||||
|
steps:
|
||||||
|
- run: pkg install developer/gcc13
|
||||||
|
- run: cargo build --release
|
||||||
|
- run: cargo test
|
||||||
|
```
|
||||||
|
|
||||||
|
Our documentation only needs to cover:
|
||||||
|
- Available `runs-on` labels (our OS images)
|
||||||
|
- What's pre-installed in each image
|
||||||
|
- OS-specific tips (illumos package managers, tar variants, etc.)
|
||||||
|
|
||||||
|
## Security model after migration
|
||||||
|
|
||||||
|
Most security concerns are **solved by delegation to the platform**:
|
||||||
|
|
||||||
|
| Concern | Solution |
|
||||||
|
|---------|----------|
|
||||||
|
| Log access control | Platform handles it (GitHub/Forgejo UI) |
|
||||||
|
| Webhook secrets | Platform handles per-repo secrets |
|
||||||
|
| Artifact storage | Platform handles it |
|
||||||
|
| User authentication | Platform handles it |
|
||||||
|
| SSH key storage | Ephemeral — destroyed with VM |
|
||||||
|
| Compute abuse | Per-runner concurrency limits + platform rate limiting |
|
||||||
Loading…
Add table
Reference in a new issue