solstice-ci/docs/ai/decisions/003-ephemeral-ssh-keys.md

41 lines
2 KiB
Markdown
Raw Normal View History

# ADR-003: Ephemeral SSH Keys with Opt-In Debug Access
**Date:** 2026-04-09
**Status:** Accepted
**Deciders:** Till Wegmueller
## Context
The orchestrator generates an Ed25519 SSH keypair per job for authenticating to the provisioned VM. Currently, both public and private keys are persisted to PostgreSQL in plaintext (`job_ssh_keys` table). This creates a security risk — a database breach exposes all SSH keys.
The keys are only needed during the VM's lifetime: from provisioning (cloud-init injects the public key) through SSH execution (orchestrator authenticates with the private key) to VM destruction.
## Decision
Make SSH keys fully ephemeral:
1. Generate keypair in-memory
2. Inject public key via cloud-init
3. Use private key for SSH connection
4. Forget both keys when VM is destroyed
5. Never persist to database
Exception: **opt-in debug SSH for failed builds**. When a job fails and the user has opted in (e.g., via a workflow annotation or label), keep the VM alive for a TTL (30 minutes) and expose SSH connection info in the build log so the user can debug inside the target OS.
## Consequences
### Positive
- **Zero persistent key storage**: no database breach risk for SSH keys
- **Simpler persistence layer**: `job_ssh_keys` table can be removed
- **Debug SSH feature**: valuable for OS-specific debugging (illumos quirks, package issues) — similar to CircleCI's "rerun with SSH" and Buildkite's debug feature
### Negative
- **Cannot retroactively access a destroyed VM**: if the key is gone, there's no way back in
- **Debug SSH adds complexity**: TTL management, rate limiting, VM lifecycle exception path
- **Debug SSH is a security surface**: must ensure TTL is enforced and connection info only goes to the authenticated user via the platform's log channel
### Design for debug SSH
- Rate limit: max 1 debug session per project concurrently
- TTL: 30 minutes, non-renewable, force-destroy on expiry
- Connection info: printed as a build log step (platform controls access)
- Opt-in: explicit flag required (never default)