solstice-ci/docs/ai/2025-10-25-orchestrator-scheduling-and-libvirt.md
Till Wegmueller a71f9cc7d1
Initial Commit
Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2025-10-25 20:01:08 +02:00

62 lines
4.3 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

### Solstice CI — Orchestrator Scheduling, Image Map Config, and Libvirt/Zones Backends (MVP)
This document summarizes the initial implementation of Orchestrator scheduling, a YAML-based image map configuration, cloud image preparation, and a hypervisor abstraction with Linux/KVM (libvirt) and illumos zones scaffolding.
#### Whats included (MVP)
- Scheduler and capacity
- Global max concurrency (`MAX_CONCURRENCY`, default 2) with backpressure by aligning AMQP `prefetch` to concurrency.
- Optional per-label capacity via `CAPACITY_MAP` (e.g., `illumos-latest=2,ubuntu-22.04=4`).
- Ack-on-accept: AMQP message is acked after basic validation and enqueue to scheduler; errors during provisioning are handled internally.
- YAML image map configuration
- Loaded at startup from `--config` / `ORCH_CONFIG`; defaults to `examples/orchestrator-image-map.yaml`.
- Keys: `default_label`, `aliases`, optional `sizes` presets, and `images` map with backend (`zones` or `libvirt`), `source` URL, `local_path`, `decompress` (`zstd` or none), `nocloud` (bool), and per-image default resources.
- Default mapping provided:
- `default_label: illumos-latest`
- Alias: `illumos-latest → openindiana-hipster`
- `openindiana-hipster` image points to current OI cloud image: `https://dlc.openindiana.org/isos/hipster/20250402/OI-hipster-cloudimage.img.zstd`, marked `nocloud: true` and `backend: zones`.
- Size presets (not yet consumed directly by jobs): `small` (1 CPU, 1 GiB), `medium` (2 CPU, 2 GiB), `large` (4 CPU, 4 GiB).
- Image preparation (downloader)
- On startup, the orchestrator ensures each configured image exists at `local_path`.
- If missing, downloads from `source` and optionally decompresses with Zstd into the target path.
- Hypervisor abstraction
- `Hypervisor` trait and `RouterHypervisor` dispatcher.
- Backends:
- `libvirt` (Linux/KVM): skeleton that connects to libvirt in `prepare`; domain XML/overlay/NoCloud seed wiring to follow.
- `zones` (illumos/bhyve): stub scaffold (not yet functional); will integrate with `zone` crate + ZFS clones in a follow-up.
- `NoopHypervisor` for development on hosts without privileges.
- Orchestrator MQ wiring
- Consumes `JobRequest` messages and builds `VmSpec` from resolved label and image defaults.
- Injects minimal cloud-init user-data content (NoCloud) into the spec for future seeding.
#### Configuration (CLI/env)
- `--config`, `ORCH_CONFIG` — path to YAML image map (default `examples/orchestrator-image-map.yaml`).
- `--max-concurrency`, `MAX_CONCURRENCY` — global VM concurrency (default 2).
- `--capacity-map`, `CAPACITY_MAP` — per-label capacity (e.g., `illumos-latest=2,ubuntu-22.04=4`).
- AMQP: `AMQP_URL`, `AMQP_EXCHANGE`, `AMQP_QUEUE`, `AMQP_ROUTING_KEY`, `AMQP_PREFETCH` (defaulted to `MAX_CONCURRENCY`).
- Libvirt (Linux): `LIBVIRT_URI` (default `qemu:///system`), `LIBVIRT_NETWORK` (default `default`).
#### Local usage (dev)
1. Ensure RabbitMQ is running (docker-compose service `rabbitmq`).
2. Start the Orchestrator:
```bash
cargo run -p orchestrator -- \
--config examples/orchestrator-image-map.yaml \
--max-concurrency 2
```
On first run, the OI cloud image will be downloaded and decompressed to the configured `local_path`.
3. In another terminal, enqueue a job (Forge Integration webhook or CLI `enqueue`). The orchestrator will resolve `runs_on` (or default label) and schedule a VM using the configured backend.
Note: The current libvirt/zones backends are partial; actual VM boot is a follow-up. The scheduler and config wiring are complete and ready for backend integration.
#### Whats next (planned)
- Libvirt backend:
- Create qcow2 overlays, generate domain XML (virtio devices), attach NoCloud ISO seed, define, start, shutdown, and destroy.
- Ensure libvirt default network is active at startup if necessary.
- Illumos zones backend:
- Integrate `oxidecomputer/zone` and ZFS clone workflow; set bhyve attributes (`vcpus`, `ram`, `bootdisk`), networking, and SMF.
- Lifecycle and runner coordination:
- gRPC Orchestrator↔Runner for logs/status, job completion handling, and cleanup.
- Persistence and recovery:
- Store job/VM state in Postgres; graceful recovery on restart.
- Tests and docs:
- Unit tests for config parsing and scheduler; feature-gated libvirt smoke test; expand docs.