Three high-priority reliability features that close gaps identified in AUDIT.md:
1. Periodic reconciliation: PodController now runs reconcile_all() every 30s
via a tokio::time::interval branch in the select! loop, detecting zone
crashes between events.
2. Node health checker: New NodeHealthChecker polls node heartbeats every 15s
and marks nodes with stale heartbeats (>40s) as NotReady with reason
NodeStatusUnknown, preserving last_transition_time correctly.
3. Graceful pod termination: DELETE sets deletion_timestamp and phase=Terminating
instead of immediate removal. Controller drives a state machine (shutdown →
halt on grace expiry → deprovision → finalize) with periodic reconcile
advancing it. New POST .../finalize endpoint performs actual storage removal.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Enable the API server to optionally serve HTTPS (disabled by default).
When --tls is passed without explicit cert/key paths, a self-signed CA
and server certificate are auto-generated via rcgen and persisted to
disk for reuse across restarts. The internal ApiClient learns to trust
the self-signed CA so controller/agent components work seamlessly over
TLS.
Also adds SIGTERM signal handling (alongside SIGINT) and graceful
shutdown via CancellationToken for both `serve` and `agent` modes,
plus an SMF manifest and method script so reddwarf can run as
svc:/system/reddwarf:default on illumos.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Implement an in-process broadcast event bus for resource mutations
(ADDED/MODIFIED/DELETED) with SSE watch endpoints on all list handlers,
following the Kubernetes watch protocol. Add the reddwarf-runtime crate
with a trait-based zone runtime abstraction targeting illumos zones,
including LX and custom reddwarf brand support, etherstub/direct VNIC
networking, ZFS dataset management, and a MockRuntime for testing on
non-illumos platforms.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>