Plans: - 001: vm-manager migration (completed) - 002: runner-only architecture (active) Decision records (ADRs): - 001: Runner-only architecture — retire webhooks + logs service - 002: Direct QEMU over libvirt - 003: Ephemeral SSH keys with opt-in debug access - 004: User-mode (SLIRP) networking for VMs
6.2 KiB
Plan: Runner-Only Architecture
Status: Active
Created: 2026-04-09
Planner ID: 5ea54391-2b17-4790-9f6a-27afcc410fa6
Summary
Simplify Solstice CI from 7+ services to 3 by acting exclusively as a native runner for GitHub and Forgejo. All logs, artifacts, and status flow through the platform's native UI. Our unique value is VM orchestration for non-Linux OSes (illumos, omnios, OpenIndiana).
Motivation
The current architecture has Solstice CI reimplementing functionality that GitHub and Forgejo already provide:
- Webhook ingestion — both platforms have runner protocols that push jobs to runners
- Log storage and viewing — both platforms display logs in their own UI
- Artifact storage — both platforms have artifact APIs
- Status reporting — both platforms show build status natively
We are building and maintaining 4 extra services (forge-integration, github-integration, logs-service, custom dashboards) that provide a worse user experience than the native platform UI.
Architecture Change
Before (7+ services)
Forgejo webhooks --> forge-integration --> RabbitMQ --> orchestrator --> VMs
GitHub webhooks --> github-integration --> RabbitMQ /
logs-service <-- orchestrator
runner-integration --> Forgejo
After (3 services)
Forgejo <--> forgejo-runner <--> RabbitMQ <--> orchestrator <--> VMs
GitHub <--> github-runner <--> RabbitMQ /
Services retained
| Service | Role |
|---|---|
| forgejo-runner (runner-integration) | Sole Forgejo interface via connect-rpc |
| github-runner (NEW) | Sole GitHub interface via Actions runner protocol |
| orchestrator | VM provisioning via vm-manager/QEMU |
Services retired
| Service | Replacement |
|---|---|
| forge-integration | forgejo-runner (runner protocol replaces webhooks) |
| github-integration | github-runner (runner protocol replaces GitHub App) |
| logs-service | Platform UI (logs sent via runner protocol) |
Infrastructure retained
- RabbitMQ — job buffer between runners and orchestrator
- PostgreSQL — job state persistence in orchestrator
- vm-manager — QEMU VM lifecycle management
Tasks
| # | Task | Priority | Effort | Depends on | Status |
|---|---|---|---|---|---|
| 1 | Evolve workflow-runner to execute Actions YAML run steps | 100 | M | — | pending |
| 2 | Orchestrator: accept step commands via JobRequest | 95 | M | — | pending |
| 3 | Clean up Forgejo runner as sole interface | 90 | L | 1 | pending |
| 4 | Implement GitHub Actions runner integration | 80 | XL | 1 | pending |
| 5 | Security: ephemeral SSH keys + opt-in debug SSH | 60 | M | 7 | pending |
| 6 | Documentation: image catalog + illumos guides | 50 | L | — | pending |
| 7 | Retire forge-integration, github-integration, logs-service | 40 | M | 3, 4 | pending |
Task details
1. Evolve workflow-runner to execute Actions YAML run steps
The workflow-runner currently parses .solstice/workflow.kdl. It also needs to execute standard GitHub Actions YAML run steps passed via job.yaml. The runner integrations translate Actions YAML into step commands before publishing to MQ. KDL support is kept as a superset for users who want setup scripts and multi-OS abstractions.
2. Orchestrator: accept step commands via JobRequest
Add steps: Option<Vec<StepCommand>> to JobRequest (common/src/messages.rs). Each StepCommand has name, run, and optional env. The orchestrator writes these to job.yaml so the workflow-runner can execute them directly. If steps is None, workflow-runner falls back to .solstice/workflow.kdl.
3. Clean up Forgejo runner as sole interface
Remove the tier-1 KDL workflow fetch from translator.rs. Actions YAML run steps become the primary translation path. Handle matrix builds by expanding into separate JobRequests. Report unsupported uses: steps with clear errors. Remove dependency on FORGEJO_BASE_URL/FORGEJO_TOKEN for fetching workflow files.
4. Implement GitHub Actions runner integration
New crate implementing the GitHub Actions self-hosted runner protocol (REST + JSON). Significantly more complex than Forgejo's connect-rpc: RSA JWT authentication, OAuth bearer tokens, 50-second long-poll, per-job heartbeat every 60s, encrypted job delivery. Same internal pattern as runner-integration (poller + reporter + state).
5. Security: ephemeral SSH keys + opt-in debug SSH
Stop persisting SSH keys to the database. Generate in-memory, inject via cloud-init, forget after VM destroy. For failed builds with opt-in debug flag: keep VM alive for 30 minutes, expose SSH connection info in build log, rate-limit to 1 debug session per project.
6. Documentation
User-facing docs for FLOSS projects: getting started guide, image catalog (runs-on labels), illumos/omnios-specific guide (pkg, tar, CA certs), FAQ (supported features, limitations).
7. Retire old services
Remove forge-integration, github-integration, and logs-service from compose.yml. Clean up environment variables, Traefik routes, and database tables. Keep source code for reference but mark deprecated.
Workflow format after migration
Users write standard GitHub Actions YAML. No custom format needed:
name: CI
on: [push, pull_request]
jobs:
build:
runs-on: omnios-bloody # Solstice CI label
steps:
- run: pkg install developer/gcc13
- run: cargo build --release
- run: cargo test
Our documentation only needs to cover:
- Available
runs-onlabels (our OS images) - What's pre-installed in each image
- OS-specific tips (illumos package managers, tar variants, etc.)
Security model after migration
Most security concerns are solved by delegation to the platform:
| Concern | Solution |
|---|---|
| Log access control | Platform handles it (GitHub/Forgejo UI) |
| Webhook secrets | Platform handles per-repo secrets |
| Artifact storage | Platform handles it |
| User authentication | Platform handles it |
| SSH key storage | Ephemeral — destroyed with VM |
| Compute abuse | Per-runner concurrency limits + platform rate limiting |