Plans: - 001: vm-manager migration (completed) - 002: runner-only architecture (active) Decision records (ADRs): - 001: Runner-only architecture — retire webhooks + logs service - 002: Direct QEMU over libvirt - 003: Ephemeral SSH keys with opt-in debug access - 004: User-mode (SLIRP) networking for VMs
2 KiB
ADR-003: Ephemeral SSH Keys with Opt-In Debug Access
Date: 2026-04-09 Status: Accepted Deciders: Till Wegmueller
Context
The orchestrator generates an Ed25519 SSH keypair per job for authenticating to the provisioned VM. Currently, both public and private keys are persisted to PostgreSQL in plaintext (job_ssh_keys table). This creates a security risk — a database breach exposes all SSH keys.
The keys are only needed during the VM's lifetime: from provisioning (cloud-init injects the public key) through SSH execution (orchestrator authenticates with the private key) to VM destruction.
Decision
Make SSH keys fully ephemeral:
- Generate keypair in-memory
- Inject public key via cloud-init
- Use private key for SSH connection
- Forget both keys when VM is destroyed
- Never persist to database
Exception: opt-in debug SSH for failed builds. When a job fails and the user has opted in (e.g., via a workflow annotation or label), keep the VM alive for a TTL (30 minutes) and expose SSH connection info in the build log so the user can debug inside the target OS.
Consequences
Positive
- Zero persistent key storage: no database breach risk for SSH keys
- Simpler persistence layer:
job_ssh_keystable can be removed - Debug SSH feature: valuable for OS-specific debugging (illumos quirks, package issues) — similar to CircleCI's "rerun with SSH" and Buildkite's debug feature
Negative
- Cannot retroactively access a destroyed VM: if the key is gone, there's no way back in
- Debug SSH adds complexity: TTL management, rate limiting, VM lifecycle exception path
- Debug SSH is a security surface: must ensure TTL is enforced and connection info only goes to the authenticated user via the platform's log channel
Design for debug SSH
- Rate limit: max 1 debug session per project concurrently
- TTL: 30 minutes, non-renewable, force-destroy on expiry
- Connection info: printed as a build log step (platform controls access)
- Opt-in: explicit flag required (never default)