reddwarf/AUDIT.md

100 lines
5.1 KiB
Markdown
Raw Normal View History

# Reddwarf Production Readiness Audit
**Last updated:** 2026-02-14
**Baseline commit:** `58171c7` (Add periodic reconciliation, node health checker, and graceful pod termination)
---
## 1. Zone Runtime (`reddwarf-runtime`)
| Requirement | Status | Notes |
|---|---|---|
| Pod spec to zonecfg | DONE | `zone/config.rs`, `controller.rs:pod_to_zone_config()` |
| Zone lifecycle (zoneadm) | DONE | `illumos.rs` — create, install, boot, halt, uninstall, delete |
| Container to Zone mapping | DONE | Naming, sanitization, 64-char truncation |
| CPU limits to capped-cpu | DONE | Aggregates across containers, limits preferred over requests |
| Memory limits to capped-memory | DONE | Aggregates across containers, illumos G/M/K suffixes |
| Network to Crossbow VNIC | DONE | `dladm create-etherstub`, `create-vnic`, per-pod VNIC+IP |
| Volumes to ZFS datasets | DONE | Create, destroy, clone, quota, snapshot support |
| Image pull / clone | PARTIAL | ZFS clone works; LX tarball `-s` works. Missing: no image pull/registry, no `.zar` archive, no golden image bootstrap |
| Health probes (zlogin) | DONE | exec-in-zone via `zlogin`, liveness/readiness/startup probes with exec/HTTP/TCP actions, probe tracker state machine integrated into reconcile loop. v1 limitation: probes run at reconcile cadence, not per-probe `periodSeconds` |
## 2. Reconciliation / Controller Loop
| Requirement | Status | Notes |
|---|---|---|
| Event bus / watch | DONE | tokio broadcast channel, SSE watch API, multi-subscriber |
| Pod controller | DONE | Event-driven + full reconcile on lag, provision/deprovision |
| Node controller (NotReady) | DONE | `node_health.rs` — checks every 15s, marks stale (>40s) nodes NotReady with reason NodeStatusUnknown |
| Continuous reconciliation | DONE | `controller.rs` — periodic `reconcile_all()` every 30s via `tokio::time::interval` in select! loop |
| Graceful termination | DONE | DELETE sets `deletion_timestamp` + phase=Terminating; controller drives shutdown state machine; POST `.../finalize` for actual removal |
## 3. Pod Status Tracking
| Requirement | Status | Notes |
|---|---|---|
| Zone state to pod phase | DONE | 8 zone states mapped to pod phases |
| Status subresource (`/status`) | DONE | PUT endpoint, spec/status separation, fires MODIFIED events |
| ShuttingDown mapping | DONE | Fixed in `58171c7` — maps to "Terminating" |
## 4. Node Agent / Heartbeat
| Requirement | Status | Notes |
|---|---|---|
| Self-registration | DONE | Creates Node resource with allocatable CPU/memory |
| Periodic heartbeat | DONE | 10-second interval, Ready condition |
| Report zone states | NOT DONE | Heartbeat doesn't query actual zone states |
| Dynamic resource reporting | DONE | `sysinfo.rs` — detects CPU/memory via `sys-info`, capacity vs allocatable split with configurable reservations (`--system-reserved-cpu`, `--system-reserved-memory`, `--max-pods`). Done in `d3eb0b2` |
## 5. Main Binary
| Requirement | Status | Notes |
|---|---|---|
| API + scheduler + runtime wired | DONE | All 4 components spawned as tokio tasks |
| CLI via clap | DONE | `serve` and `agent` subcommands |
| Graceful shutdown | DONE | SIGINT + CancellationToken + 5s timeout |
| TLS (rustls) | DONE | Auto-generated self-signed CA + server cert, or user-provided PEM. Added in `cb6ca8c` |
| SMF service manifest | DONE | SMF manifest + method script in `smf/`. Added in `cb6ca8c` |
## 6. Networking
| Requirement | Status | Notes |
|---|---|---|
| Etherstub creation | DONE | `dladm create-etherstub` |
| VNIC per zone | DONE | `dladm create-vnic -l etherstub` |
| ipadm IP assignment | PARTIAL | IP set in zonecfg `allowed-address` but no explicit `ipadm create-addr` call |
| IPAM | DONE | Sequential alloc, idempotent, persistent, pool exhaustion handling |
| Service ClusterIP / NAT | NOT DONE | Services stored at API level but no backend controller, no ipnat rules, no proxy, no DNS |
## 7. Scheduler
| Requirement | Status | Notes |
|---|---|---|
| Versioned bind_pod() | DONE | Fixed in `c50ecb2` — creates versioned commits |
| Zone brand constraints | DONE | `ZoneBrandMatch` filter checks `reddwarf.io/zone-brand` annotation vs `reddwarf.io/zone-brands` node label. Done in `4c7f50a` |
| Actual resource usage | NOT DONE | Only compares requests vs static allocatable — no runtime metrics |
---
## Priority Order
### Critical (blocks production)
- [x] TLS — done in `cb6ca8c`
- [x] SMF manifest — done in `cb6ca8c`
### High (limits reliability)
- [x] Node health checker — done in `58171c7`
- [x] Periodic reconciliation — done in `58171c7`
- [x] Graceful pod termination — done in `58171c7`
### Medium (limits functionality)
- [ ] Service networking — no ClusterIP, no NAT/proxy, no DNS
- [x] Health probes — exec/HTTP/TCP liveness/readiness/startup probes via zlogin
- [ ] Image management — no pull/registry, no `.zar` support, no golden image bootstrap
- [x] Dynamic node resources — done in `d3eb0b2`
### Low (nice to have)
- [x] Zone brand scheduling filter — done in `4c7f50a`
- [x] ShuttingDown to Terminating mapping fix — done in `58171c7`
- [ ] bhyve brand — type exists but no implementation