mirror of
https://github.com/CloudNebulaProject/reddwarf.git
synced 2026-04-10 13:20:40 +00:00
Service networking:
- ClusterIP IPAM allocation on service create/delete via reusable Ipam with_prefix()
- ServiceController watches Pod/Service events + periodic reconcile to track endpoints
- NatManager generates ipnat rdr rules for ClusterIP -> pod IP forwarding
- Embedded DNS server resolves {svc}.{ns}.svc.cluster.local to ClusterIP
- New CLI flags: --service-cidr (default 10.96.0.0/12), --cluster-dns (default 0.0.0.0:10053)
Quick wins:
- ipadm IP assignment: configure_zone_ip() runs ipadm/route inside zone via zlogin after boot
- Node heartbeat zone state reporting: reddwarf.io/zone-count and zone-summary annotations
- bhyve brand support: ZoneBrand::Bhyve, install args, zonecfg device generation, controller integration
189 tests passing, clippy clean.
https://claude.ai/code/session_016QLFjAyYGzMPbBjEGMe75j
5.4 KiB
5.4 KiB
Reddwarf Production Readiness Audit
Last updated: 2026-03-19
Baseline commit: 58171c7 (Add periodic reconciliation, node health checker, and graceful pod termination)
1. Zone Runtime (reddwarf-runtime)
| Requirement | Status | Notes |
|---|---|---|
| Pod spec to zonecfg | DONE | zone/config.rs, controller.rs:pod_to_zone_config() |
| Zone lifecycle (zoneadm) | DONE | illumos.rs — create, install, boot, halt, uninstall, delete |
| Container to Zone mapping | DONE | Naming, sanitization, 64-char truncation |
| CPU limits to capped-cpu | DONE | Aggregates across containers, limits preferred over requests |
| Memory limits to capped-memory | DONE | Aggregates across containers, illumos G/M/K suffixes |
| Network to Crossbow VNIC | DONE | dladm create-etherstub, create-vnic, per-pod VNIC+IP |
| Volumes to ZFS datasets | DONE | Create, destroy, clone, quota, snapshot support |
| Image pull / clone | PARTIAL | ZFS clone works; LX tarball -s works. Missing: no image pull/registry, no .zar archive, no golden image bootstrap |
| Health probes (zlogin) | DONE | exec-in-zone via zlogin, liveness/readiness/startup probes with exec/HTTP/TCP actions, probe tracker state machine integrated into reconcile loop. v1 limitation: probes run at reconcile cadence, not per-probe periodSeconds |
2. Reconciliation / Controller Loop
| Requirement | Status | Notes |
|---|---|---|
| Event bus / watch | DONE | tokio broadcast channel, SSE watch API, multi-subscriber |
| Pod controller | DONE | Event-driven + full reconcile on lag, provision/deprovision |
| Node controller (NotReady) | DONE | node_health.rs — checks every 15s, marks stale (>40s) nodes NotReady with reason NodeStatusUnknown |
| Continuous reconciliation | DONE | controller.rs — periodic reconcile_all() every 30s via tokio::time::interval in select! loop |
| Graceful termination | DONE | DELETE sets deletion_timestamp + phase=Terminating; controller drives shutdown state machine; POST .../finalize for actual removal |
3. Pod Status Tracking
| Requirement | Status | Notes |
|---|---|---|
| Zone state to pod phase | DONE | 8 zone states mapped to pod phases |
Status subresource (/status) |
DONE | PUT endpoint, spec/status separation, fires MODIFIED events |
| ShuttingDown mapping | DONE | Fixed in 58171c7 — maps to "Terminating" |
4. Node Agent / Heartbeat
| Requirement | Status | Notes |
|---|---|---|
| Self-registration | DONE | Creates Node resource with allocatable CPU/memory |
| Periodic heartbeat | DONE | 10-second interval, Ready condition |
| Report zone states | DONE | Heartbeat queries list_zones(), reports reddwarf.io/zone-count and reddwarf.io/zone-summary annotations |
| Dynamic resource reporting | DONE | sysinfo.rs — detects CPU/memory via sys-info, capacity vs allocatable split with configurable reservations (--system-reserved-cpu, --system-reserved-memory, --max-pods). Done in d3eb0b2 |
5. Main Binary
| Requirement | Status | Notes |
|---|---|---|
| API + scheduler + runtime wired | DONE | All 4 components spawned as tokio tasks |
| CLI via clap | DONE | serve and agent subcommands |
| Graceful shutdown | DONE | SIGINT + CancellationToken + 5s timeout |
| TLS (rustls) | DONE | Auto-generated self-signed CA + server cert, or user-provided PEM. Added in cb6ca8c |
| SMF service manifest | DONE | SMF manifest + method script in smf/. Added in cb6ca8c |
6. Networking
| Requirement | Status | Notes |
|---|---|---|
| Etherstub creation | DONE | dladm create-etherstub |
| VNIC per zone | DONE | dladm create-vnic -l etherstub |
| ipadm IP assignment | DONE | configure_zone_ip() runs ipadm create-if, ipadm create-addr, route add default inside zone via zlogin after boot |
| IPAM | DONE | Sequential alloc, idempotent, persistent, pool exhaustion handling |
| Service ClusterIP / NAT | DONE | ClusterIP IPAM allocation on service create/delete, ServiceController watches events + periodic reconcile, NatManager generates ipnat rdr rules, embedded DnsServer resolves {svc}.{ns}.svc.cluster.local → ClusterIP |
7. Scheduler
| Requirement | Status | Notes |
|---|---|---|
| Versioned bind_pod() | DONE | Fixed in c50ecb2 — creates versioned commits |
| Zone brand constraints | DONE | ZoneBrandMatch filter checks reddwarf.io/zone-brand annotation vs reddwarf.io/zone-brands node label. Done in 4c7f50a |
| Actual resource usage | NOT DONE | Only compares requests vs static allocatable — no runtime metrics |
Priority Order
Critical (blocks production)
- TLS — done in
cb6ca8c - SMF manifest — done in
cb6ca8c
High (limits reliability)
- Node health checker — done in
58171c7 - Periodic reconciliation — done in
58171c7 - Graceful pod termination — done in
58171c7
Medium (limits functionality)
- Service networking — ClusterIP IPAM, ServiceController endpoint tracking, ipnat NAT rules, embedded DNS server
- Health probes — exec/HTTP/TCP liveness/readiness/startup probes via zlogin
- Image management — no pull/registry, no
.zarsupport, no golden image bootstrap - Dynamic node resources — done in
d3eb0b2
Low (nice to have)
- Zone brand scheduling filter — done in
4c7f50a - ShuttingDown to Terminating mapping fix — done in
58171c7 - bhyve brand —
ZoneBrand::Bhyve, install args, zonecfg device generation, controller brand selection