solstice-ci/docs/ai/2025-10-25-orchestrator-scheduling-and-libvirt.md

60 lines
4.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### Solstice CI — Orchestrator Scheduling, Image Map Config, and Libvirt Lifecycle (MVP)
This document reflects the current implementation of the Orchestrator: scheduling/capacity, a YAML-based image map, cloud image preparation, and a hypervisor abstraction with a working Linux/KVM (libvirt) backend and an illumos zones scaffold.
#### Whats included (current status)
- Scheduler and capacity
- Global max concurrency (`MAX_CONCURRENCY`, default 2) with backpressure by aligning AMQP `prefetch` to concurrency.
- Optional per-label capacity via `CAPACITY_MAP` (e.g., `illumos-latest=2,ubuntu-22.04=4`).
- Ack-on-accept: AMQP message is acked after basic validation and enqueue to scheduler; errors during provisioning are handled internally.
- YAML image map configuration (backend-agnostic images)
- Loaded at startup from `--config` / `ORCH_CONFIG`; defaults to `examples/orchestrator-image-map.yaml`.
- Keys: `default_label`, `aliases`, optional `sizes` presets, and `images` map with `source` URL, `local_path`, `decompress` (`zstd` or none), `nocloud` (bool), and per-image default resources.
- Default mapping provided:
- `default_label: illumos-latest`
- Alias: `illumos-latest → openindiana-hipster`
- `openindiana-hipster` image points to current OI cloud image: `https://dlc.openindiana.org/isos/hipster/20250402/OI-hipster-cloudimage.img.zstd` and is marked `nocloud: true`.
- Size presets (operator convenience): `small` (1 CPU, 1 GiB), `medium` (2 CPU, 2 GiB), `large` (4 CPU, 4 GiB).
- Image preparation (downloader)
- On startup, the orchestrator ensures each configured image exists at `local_path`.
- If missing, downloads from `source` and optionally decompresses with Zstd into the target path.
- Hypervisor abstraction
- `Hypervisor` trait and `RouterHypervisor` dispatcher.
- Backends:
- `libvirt` (Linux/KVM): IMPLEMENTED — creates qcow2 overlays (qemu-img), generates domain XML with virtio devices, builds/attaches NoCloud seed ISO (mkisofs/genisoimage), defines and starts the domain, shuts down via ACPI with timeout and forces destroy if needed. Ensures the libvirt network (`default` by default) is active and autostarted.
- `zones` (illumos/bhyve): scaffold (not yet functional); will integrate with `zone` crate + ZFS clones in a follow-up.
- `NoopHypervisor` for development on hosts without privileges.
- Orchestrator MQ wiring
- Consumes `JobRequest` messages and builds `VmSpec` from resolved label and image defaults.
- Injects minimal cloud-init user-data content (NoCloud) into the spec for seeding.
- Graceful shutdown
- On SIGINT/SIGTERM the consumer is stopped, the scheduler is allowed to drain, and active VMs are asked to shutdown gracefully before being destroyed.
#### Configuration (CLI/env)
- `--config`, `ORCH_CONFIG` — path to YAML image map (default `examples/orchestrator-image-map.yaml`).
- `--max-concurrency`, `MAX_CONCURRENCY` — global VM concurrency (default 2).
- `--capacity-map`, `CAPACITY_MAP` — per-label capacity (e.g., `illumos-latest=2,ubuntu-22.04=4`).
- AMQP: `AMQP_URL`, `AMQP_EXCHANGE`, `AMQP_QUEUE`, `AMQP_ROUTING_KEY`, `AMQP_PREFETCH` (defaulted to `MAX_CONCURRENCY`).
- Libvirt (Linux): `LIBVIRT_URI` (default `qemu:///system`), `LIBVIRT_NETWORK` (default `default`).
- Requirements for libvirt lifecycle on Linux: `libvirtd` running, `qemu-img`, and `mkisofs` (or `genisoimage`) available on PATH.
#### Local usage (dev)
1. Ensure RabbitMQ is running (docker-compose service `rabbitmq`).
2. Start the Orchestrator:
```bash
cargo run -p orchestrator -- \
--config examples/orchestrator-image-map.yaml \
--max-concurrency 2
```
On first run, the OI cloud image will be downloaded and decompressed to the configured `local_path`.
3. In another terminal, enqueue a job (Forge Integration webhook or CLI `enqueue`). On Linux with libvirt enabled, the orchestrator will resolve `runs_on` (or default label), prepare an overlay and seed ISO, define `job-<uuid>` and start it.
#### Whats next (planned)
- Illumos zones backend:
- Integrate `oxidecomputer/zone` and ZFS clone workflow; set bhyve attributes (`vcpus`, `ram`, `bootdisk`), networking, and SMF.
- Lifecycle and runner coordination:
- gRPC Orchestrator↔Runner for logs/status, job completion handling, and cleanup.
- Persistence and recovery:
- Store job/VM state in Postgres; reconcile on restart.
- Tests and docs:
- Unit tests for config parsing, scheduler capacity accounting, and cloud-init seed creation; an opt-in libvirt smoke test; expand docs.