solstice-ci/docs/ai/plans/002-runner-only-architecture.md
Till Wegmueller d6c2c3662c Add architecture plans and decision records
Plans:
- 001: vm-manager migration (completed)
- 002: runner-only architecture (active)

Decision records (ADRs):
- 001: Runner-only architecture — retire webhooks + logs service
- 002: Direct QEMU over libvirt
- 003: Ephemeral SSH keys with opt-in debug access
- 004: User-mode (SLIRP) networking for VMs
2026-04-09 22:03:12 +02:00

6.2 KiB

Plan: Runner-Only Architecture

Status: Active Created: 2026-04-09 Planner ID: 5ea54391-2b17-4790-9f6a-27afcc410fa6

Summary

Simplify Solstice CI from 7+ services to 3 by acting exclusively as a native runner for GitHub and Forgejo. All logs, artifacts, and status flow through the platform's native UI. Our unique value is VM orchestration for non-Linux OSes (illumos, omnios, OpenIndiana).

Motivation

The current architecture has Solstice CI reimplementing functionality that GitHub and Forgejo already provide:

  • Webhook ingestion — both platforms have runner protocols that push jobs to runners
  • Log storage and viewing — both platforms display logs in their own UI
  • Artifact storage — both platforms have artifact APIs
  • Status reporting — both platforms show build status natively

We are building and maintaining 4 extra services (forge-integration, github-integration, logs-service, custom dashboards) that provide a worse user experience than the native platform UI.

Architecture Change

Before (7+ services)

Forgejo webhooks --> forge-integration --> RabbitMQ --> orchestrator --> VMs
GitHub  webhooks --> github-integration --> RabbitMQ /
                                           logs-service <-- orchestrator
                                           runner-integration --> Forgejo

After (3 services)

Forgejo <--> forgejo-runner  <--> RabbitMQ <--> orchestrator <--> VMs
GitHub  <--> github-runner   <--> RabbitMQ /

Services retained

Service Role
forgejo-runner (runner-integration) Sole Forgejo interface via connect-rpc
github-runner (NEW) Sole GitHub interface via Actions runner protocol
orchestrator VM provisioning via vm-manager/QEMU

Services retired

Service Replacement
forge-integration forgejo-runner (runner protocol replaces webhooks)
github-integration github-runner (runner protocol replaces GitHub App)
logs-service Platform UI (logs sent via runner protocol)

Infrastructure retained

  • RabbitMQ — job buffer between runners and orchestrator
  • PostgreSQL — job state persistence in orchestrator
  • vm-manager — QEMU VM lifecycle management

Tasks

# Task Priority Effort Depends on Status
1 Evolve workflow-runner to execute Actions YAML run steps 100 M pending
2 Orchestrator: accept step commands via JobRequest 95 M pending
3 Clean up Forgejo runner as sole interface 90 L 1 pending
4 Implement GitHub Actions runner integration 80 XL 1 pending
5 Security: ephemeral SSH keys + opt-in debug SSH 60 M 7 pending
6 Documentation: image catalog + illumos guides 50 L pending
7 Retire forge-integration, github-integration, logs-service 40 M 3, 4 pending

Task details

1. Evolve workflow-runner to execute Actions YAML run steps

The workflow-runner currently parses .solstice/workflow.kdl. It also needs to execute standard GitHub Actions YAML run steps passed via job.yaml. The runner integrations translate Actions YAML into step commands before publishing to MQ. KDL support is kept as a superset for users who want setup scripts and multi-OS abstractions.

2. Orchestrator: accept step commands via JobRequest

Add steps: Option<Vec<StepCommand>> to JobRequest (common/src/messages.rs). Each StepCommand has name, run, and optional env. The orchestrator writes these to job.yaml so the workflow-runner can execute them directly. If steps is None, workflow-runner falls back to .solstice/workflow.kdl.

3. Clean up Forgejo runner as sole interface

Remove the tier-1 KDL workflow fetch from translator.rs. Actions YAML run steps become the primary translation path. Handle matrix builds by expanding into separate JobRequests. Report unsupported uses: steps with clear errors. Remove dependency on FORGEJO_BASE_URL/FORGEJO_TOKEN for fetching workflow files.

4. Implement GitHub Actions runner integration

New crate implementing the GitHub Actions self-hosted runner protocol (REST + JSON). Significantly more complex than Forgejo's connect-rpc: RSA JWT authentication, OAuth bearer tokens, 50-second long-poll, per-job heartbeat every 60s, encrypted job delivery. Same internal pattern as runner-integration (poller + reporter + state).

5. Security: ephemeral SSH keys + opt-in debug SSH

Stop persisting SSH keys to the database. Generate in-memory, inject via cloud-init, forget after VM destroy. For failed builds with opt-in debug flag: keep VM alive for 30 minutes, expose SSH connection info in build log, rate-limit to 1 debug session per project.

6. Documentation

User-facing docs for FLOSS projects: getting started guide, image catalog (runs-on labels), illumos/omnios-specific guide (pkg, tar, CA certs), FAQ (supported features, limitations).

7. Retire old services

Remove forge-integration, github-integration, and logs-service from compose.yml. Clean up environment variables, Traefik routes, and database tables. Keep source code for reference but mark deprecated.

Workflow format after migration

Users write standard GitHub Actions YAML. No custom format needed:

name: CI
on: [push, pull_request]
jobs:
  build:
    runs-on: omnios-bloody  # Solstice CI label
    steps:
      - run: pkg install developer/gcc13
      - run: cargo build --release
      - run: cargo test

Our documentation only needs to cover:

  • Available runs-on labels (our OS images)
  • What's pre-installed in each image
  • OS-specific tips (illumos package managers, tar variants, etc.)

Security model after migration

Most security concerns are solved by delegation to the platform:

Concern Solution
Log access control Platform handles it (GitHub/Forgejo UI)
Webhook secrets Platform handles per-repo secrets
Artifact storage Platform handles it
User authentication Platform handles it
SSH key storage Ephemeral — destroyed with VM
Compute abuse Per-runner concurrency limits + platform rate limiting