solstice-ci/docs/ai/2025-10-25-orchestrator-scheduling-and-libvirt.md
Till Wegmueller a71f9cc7d1
Initial Commit
Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2025-10-25 20:01:08 +02:00

4.3 KiB
Raw Blame History

Solstice CI — Orchestrator Scheduling, Image Map Config, and Libvirt/Zones Backends (MVP)

This document summarizes the initial implementation of Orchestrator scheduling, a YAML-based image map configuration, cloud image preparation, and a hypervisor abstraction with Linux/KVM (libvirt) and illumos zones scaffolding.

Whats included (MVP)

  • Scheduler and capacity
    • Global max concurrency (MAX_CONCURRENCY, default 2) with backpressure by aligning AMQP prefetch to concurrency.
    • Optional per-label capacity via CAPACITY_MAP (e.g., illumos-latest=2,ubuntu-22.04=4).
    • Ack-on-accept: AMQP message is acked after basic validation and enqueue to scheduler; errors during provisioning are handled internally.
  • YAML image map configuration
    • Loaded at startup from --config / ORCH_CONFIG; defaults to examples/orchestrator-image-map.yaml.
    • Keys: default_label, aliases, optional sizes presets, and images map with backend (zones or libvirt), source URL, local_path, decompress (zstd or none), nocloud (bool), and per-image default resources.
    • Default mapping provided:
      • default_label: illumos-latest
      • Alias: illumos-latest → openindiana-hipster
      • openindiana-hipster image points to current OI cloud image: https://dlc.openindiana.org/isos/hipster/20250402/OI-hipster-cloudimage.img.zstd, marked nocloud: true and backend: zones.
    • Size presets (not yet consumed directly by jobs): small (1 CPU, 1 GiB), medium (2 CPU, 2 GiB), large (4 CPU, 4 GiB).
  • Image preparation (downloader)
    • On startup, the orchestrator ensures each configured image exists at local_path.
    • If missing, downloads from source and optionally decompresses with Zstd into the target path.
  • Hypervisor abstraction
    • Hypervisor trait and RouterHypervisor dispatcher.
    • Backends:
      • libvirt (Linux/KVM): skeleton that connects to libvirt in prepare; domain XML/overlay/NoCloud seed wiring to follow.
      • zones (illumos/bhyve): stub scaffold (not yet functional); will integrate with zone crate + ZFS clones in a follow-up.
    • NoopHypervisor for development on hosts without privileges.
  • Orchestrator MQ wiring
    • Consumes JobRequest messages and builds VmSpec from resolved label and image defaults.
    • Injects minimal cloud-init user-data content (NoCloud) into the spec for future seeding.

Configuration (CLI/env)

  • --config, ORCH_CONFIG — path to YAML image map (default examples/orchestrator-image-map.yaml).
  • --max-concurrency, MAX_CONCURRENCY — global VM concurrency (default 2).
  • --capacity-map, CAPACITY_MAP — per-label capacity (e.g., illumos-latest=2,ubuntu-22.04=4).
  • AMQP: AMQP_URL, AMQP_EXCHANGE, AMQP_QUEUE, AMQP_ROUTING_KEY, AMQP_PREFETCH (defaulted to MAX_CONCURRENCY).
  • Libvirt (Linux): LIBVIRT_URI (default qemu:///system), LIBVIRT_NETWORK (default default).

Local usage (dev)

  1. Ensure RabbitMQ is running (docker-compose service rabbitmq).
  2. Start the Orchestrator:
    cargo run -p orchestrator -- \
      --config examples/orchestrator-image-map.yaml \
      --max-concurrency 2
    
    On first run, the OI cloud image will be downloaded and decompressed to the configured local_path.
  3. In another terminal, enqueue a job (Forge Integration webhook or CLI enqueue). The orchestrator will resolve runs_on (or default label) and schedule a VM using the configured backend.

Note: The current libvirt/zones backends are partial; actual VM boot is a follow-up. The scheduler and config wiring are complete and ready for backend integration.

Whats next (planned)

  • Libvirt backend:
    • Create qcow2 overlays, generate domain XML (virtio devices), attach NoCloud ISO seed, define, start, shutdown, and destroy.
    • Ensure libvirt default network is active at startup if necessary.
  • Illumos zones backend:
    • Integrate oxidecomputer/zone and ZFS clone workflow; set bhyve attributes (vcpus, ram, bootdisk), networking, and SMF.
  • Lifecycle and runner coordination:
    • gRPC Orchestrator↔Runner for logs/status, job completion handling, and cleanup.
  • Persistence and recovery:
    • Store job/VM state in Postgres; graceful recovery on restart.
  • Tests and docs:
    • Unit tests for config parsing and scheduler; feature-gated libvirt smoke test; expand docs.