solstice-ci/deploy/podman/README.md
Till Wegmueller 4c5a8567a4
Add webhook crate for extensible signature validation and integration
- Introduce a new `webhook` crate to centralize signature validation for GitHub, Hookdeck, and Forgejo webhooks.
- Enable `github-integration` to perform unified webhook signature verification using the `webhook` crate.
- Refactor `github-integration`: replace legacy HMAC verification with the reusable `webhook` structure.
- Extend Podman configuration for Hookdeck webhook signature handling and improve documentation.
- Clean up unused dependencies by migrating to the new implementation.

Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2026-01-25 22:16:11 +01:00

176 lines
12 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

Solstice CI — Production deployment with Podman Compose + Traefik
This stack deploys Solstice CI services behind Traefik with automatic TLS certificates from Lets Encrypt. It uses upstream official images for system services and multi-stage Rust builds on official Rust/Debian images that rely on container layer caching (no sccache) for fast, reproducible builds.
Prerequisites
- Podman 4.9+ with podman-compose compatibility (podman compose)
- Public DNS records for subdomains pointing to the host running this stack
- Ports 80 and 443 open to the Internet (for ACME HTTP-01), see Rootless note below
- Email address for ACME registration
Rootless Podman note (ports 80/443)
- Rootless Podman cannot bind privileged ports (<1024). If you run this stack rootless, set high host ports in .env:
- TRAEFIK_HTTP_PORT=8080
- TRAEFIK_HTTPS_PORT=4443
- With high ports, public HTTPS will be served on 4443 and the ACME HTTP-01 challenge will not work unless you forward external port 80 to host 8080 (e.g., via a firewall/NAT) or place another reverse proxy in front.
- To use real public certificates with HTTP-01 directly on this host, either:
- Run Podman as root (rootful) for Traefik only, or
- Allow unprivileged port binding for your kernel by setting (requires root):
sysctl -w net.ipv4.ip_unprivileged_port_start=80
and add net.ipv4.ip_unprivileged_port_start=80 to /etc/sysctl.conf to persist.
- Alternatively, switch Traefik to a DNS-01 challenge (not configured here) if you control DNS.
DNS
Create A/AAAA records for the following hostnames under your base domain (no environment in hostname; env separation is logical via DB/vhost/buckets):
- traefik.svc.DOMAIN
- api.svc.DOMAIN
- grpc.svc.DOMAIN
- runner.svc.DOMAIN
- forge.svc.DOMAIN (Forge/Forgejo webhooks)
- github.svc.DOMAIN (GitHub App/webhooks)
- minio.svc.DOMAIN (console UI)
- s3.svc.DOMAIN (S3 API, TLS via TCP SNI)
- mq.svc.DOMAIN (RabbitMQ mgmt UI; AMQP remains internal)
Quick start
1. Copy env template and edit secrets and settings:
cp .env.sample .env
# Edit .env (ENV=staging|prod, DOMAIN, passwords, ACME email)
2. (Optional) Use Lets Encrypt staging CA to test issuance without rate limits by setting in .env:
TRAEFIK_ACME_CASERVER=https://acme-staging-v02.api.letsencrypt.org/directory
3. Bring up the stack:
podman compose -f compose.yml up -d --build
4. Monitor logs:
podman compose logs -f traefik
Services and routing
- Traefik dashboard: https://traefik.svc.${DOMAIN} (protect with TRAEFIK_DASHBOARD_AUTH in .env)
- Orchestrator HTTP: https://api.${ENV}.${DOMAIN}
- Orchestrator gRPC (h2/TLS via SNI): grpc.${ENV}.${DOMAIN}
- Forge webhooks: https://forge.${ENV}.${DOMAIN}
- GitHub webhooks: https://github.${ENV}.${DOMAIN}
- Runner static server: https://runner.${ENV}.${DOMAIN}
- MinIO console: https://minio.svc.${DOMAIN}
- S3 API: s3.svc.${DOMAIN}
- RabbitMQ management: https://mq.svc.${DOMAIN}
Environment scoping (single infra, logical separation)
- RabbitMQ: single broker; per-environment vhosts named solstice-${ENV} (staging/prod). Services connect to amqp://.../solstice-${ENV}.
- Postgres: single cluster; databases solstice_staging and solstice_prod are created by the postgres-setup job. Services use postgres://.../solstice_${ENV}.
- MinIO: single server; buckets solstice-logs-staging and solstice-logs-prod are created by the minio-setup job. Set S3 bucket per service to the env-appropriate bucket.
Security notes
- Secrets are provided via podman compose secrets referencing your environment variables. Do not commit real secrets.
- Only management UIs are exposed publicly via Traefik. Data planes (Postgres, AMQP, S3 API) terminate TLS at Traefik and route internally. Adjust exposure policy as needed.
Images and builds
- System services use Chainguard images (postgres, rabbitmq). MinIO uses upstream images.
- Rust services are built with multi-stage Containerfiles using cgr.dev/chainguard/rust and run on cgr.dev/chainguard/glibc-dynamic.
- Build caches are mounted in-build for cargo registry/git and the cargo target directory (via ~/.cargo/config target-dir=/cargo/target).
Maintenance
- Upgrade images by editing tags in compose.yml and rebuilding: podman compose build --pull
- Renewals are automatic via Traefik ACME. Certificates are stored in the traefik-acme volume.
- Backups: persist volumes (postgres-data, rabbitmq-data, minio-data, traefik-acme).
Tear down
- Stop: podman compose down
- Remove volumes (DANGEROUS: destroys data): podman volume rm solstice-ci_traefik-acme solstice-ci_postgres-data solstice-ci_rabbitmq-data solstice-ci_minio-data
Troubleshooting
- Certificate issues: check Traefik logs; verify DNS and ports 80/443. For testing, use ACME staging server.
- No routes: verify labels on services and that traefik sees the podman socket.
- Healthchecks failing: inspect service logs with podman logs <container>.
- Arch Linux/Podman DNS timeouts (ACME): If Traefik logs show errors like "dial tcp: lookup acme-v02.api.letsencrypt.org on 10.89.0.1:53: i/o timeout", this is typically a Podman network DNS (netavark/aardvark-dns) issue. Fixes:
- We now set explicit public DNS resolvers for the Traefik container in compose.yml (1.1.1.1, 8.8.8.8, 9.9.9.9). Redeploy: podman compose up -d traefik.
- Ensure Podmans network backend and DNS are installed and active (Arch): pacman -S netavark aardvark-dns; systemctl enable --now aardvark-dns.socket; verify `podman info | grep -i network` shows networkBackend: netavark.
- Alternatively, mount the host resolv.conf into Traefik: add to the traefik service volumes: - /etc/resolv.conf:/etc/resolv.conf:ro
- Check firewall (nftables): allow UDP/TCP 53 from the Podman bridge (e.g., 10.89.0.0/24) to host 10.89.0.1; allow FORWARD for ESTABLISHED,RELATED.
- Inspect network: podman network inspect podman; consider creating a custom network with explicit DNS servers: podman network create --dns 1.1.1.1 --dns 8.8.8.8 solstice-net and set networks.core.name to that network in compose.yml.
- As a last resort, run Traefik with host networking: network_mode: host (then remove ports and ensure only Traefik is exposed), or switch ACME to DNS-01.
Ubuntu host setup for libvirt/KVM and image directories
These steps prepare an Ubuntu host so the orchestrator (running in a container) can control KVM/libvirt and manage VM images stored on the host.
1) Install libvirt/KVM and tools
- sudo apt update
- sudo apt install -y qemu-kvm libvirt-daemon-system libvirt-clients virtinst bridge-utils genisoimage
- Ensure the libvirt service is running:
- systemctl status libvirtd
- If inactive: sudo systemctl enable --now libvirtd
2) User permissions (KVM and libvirt sockets)
- Add your deployment user (the one running podman compose) to the required groups:
- sudo usermod -aG libvirt $USER
- sudo usermod -aG kvm $USER
- Log out and back in (or new shell) for group membership to take effect.
3) Default libvirt network
- Make sure the default network exists and is active (compose defaults LIBVIRT_NETWORK=default):
- virsh net-list --all
- If missing, define it from the stock XML or create a new NAT network.
- If present but inactive:
- virsh net-start default
- virsh net-autostart default
4) Prepare host directories for images and work data
- Base images directory (bind-mounted read/write into the orchestrator container):
- sudo mkdir -p /var/lib/solstice/images
- sudo chown "$USER":"$USER" /var/lib/solstice/images
- Orchestrator work directory for overlays and console logs:
- sudo mkdir -p /var/lib/solstice-ci
- sudo chown "$USER":"$USER" /var/lib/solstice-ci
- In deploy/podman/.env(.sample), set:
- ORCH_IMAGES_DIR=/var/lib/solstice/images
- ORCH_WORK_DIR=/var/lib/solstice-ci
5) Map the image list (image map YAML)
- Point ORCH_IMAGE_MAP_PATH at your production image map on the host (kept in git or ops repo):
- ORCH_IMAGE_MAP_PATH=/etc/solstice/orchestrator-image-map.yaml
- The orchestrator looks for /examples/orchestrator-image-map.yaml in the container; compose binds your host file there read-only.
- Ensure each images[*].local_path in the YAML points inside /var/lib/solstice/images (the in-container path is the same via the bind mount). The provided example already uses that prefix.
6) Bring up the stack
- podman compose -f compose.yml up -d --build
- The orchestrator will, on first start, download missing base images as per the YAML into ORCH_IMAGES_DIR. Subsequent starts reuse the same files.
Notes
- Hardware acceleration: compose maps /dev/kvm into the container; verify kvm is available on the host: lsmod | grep kvm and that your CPU virtualization features are enabled in BIOS/UEFI.
- Sockets and configs: compose binds libvirt control sockets and common libvirt directories read-only so the orchestrator can read network definitions and create domains.
- If you change LIBVIRT_URI or LIBVIRT_NETWORK, update deploy/podman/.env and redeploy.
Runner binaries (served by the orchestrator)
- Purpose: Builder VMs download workflow runner binaries from the orchestrator over HTTP.
- Host directory: Set RUNNER_DIR_HOST in deploy/podman/.env. This path is bind-mounted read-only into the orchestrator at /runners.
- Example (prod default in .env): RUNNER_DIR_HOST=/var/lib/solstice/runners
- Example (dev default in .env.sample): RUNNER_DIR_HOST=../../target/runners
- URLs: Files are served at http(s)://runner.${ENV}.${DOMAIN}/runners/{filename}
- Example: https://runner.prod.${DOMAIN}/runners/solstice-runner-linux
- Orchestrator injection: The orchestrator auto-computes default runner URLs from its HTTP_ADDR and contact address and injects them into cloud-init.
- You can override via env: SOLSTICE_RUNNER_URL (single) and SOLSTICE_RUNNER_URLS (space-separated list) to point VMs at specific filenames.
- To build/place binaries:
- Build the workflow-runner crate for your target(s) and place the resulting artifacts in RUNNER_DIR_HOST with stable filenames (e.g., solstice-runner-linux, solstice-runner-illumos).
- Ensure file permissions allow read by the orchestrator user (world-readable is fine for static serving).
- Traefik routing: runner.${ENV}.${DOMAIN} routes to the orchestrators HTTP port (8081 by default).
Forge integration configuration
- The forge-integration service will warn if WEBHOOK_SECRET is not set: it will accept webhooks without signature validation (dev mode). Set WEBHOOK_SECRET in deploy/podman/.env to enable HMAC validation.
- To enable posting commit statuses back to Forgejo/Gitea, set FORGEJO_TOKEN and FORGEJO_BASE_URL in deploy/podman/.env. If they are not set, the service logs a warning (FORGEJO_* not set) and disables the job result consumer that reports statuses.
- The compose file passes these variables to the container. After editing .env, run: podman compose up -d forge-integration
GitHub integration configuration
- Set GITHUB_WEBHOOK_SECRET in deploy/podman/.env to validate webhook signatures (X-Hub-Signature-256). If unset, webhooks are accepted without validation (dev mode).
- If you proxy webhooks through Hookdeck, set HOOKDECK_SIGNING_SECRET to validate Hookdeck's signature header (X-Hookdeck-Signature). Either GitHub or Hookdeck signatures can satisfy verification.
- To enable check runs and workflow fetches, configure a GitHub App and set GITHUB_APP_ID plus either GITHUB_APP_KEY (PEM contents) or GITHUB_APP_KEY_PATH (path inside the container).
- Optional overrides: GITHUB_API_BASE for GitHub Enterprise and GITHUB_CHECK_NAME to customize the check run title.
- The compose file passes these variables to the container. After editing .env, run: podman compose up -d github-integration
Traefik ACME CA server note
- If you see a warning about TRAEFIK_ACME_CASERVER being unset, it is harmless. The compose file now defaults this value to empty so Traefik uses the production Lets Encrypt endpoint. To test with staging, set TRAEFIK_ACME_CASERVER=https://acme-staging-v02.api.letsencrypt.org/directory in .env and redeploy Traefik.