reddwarf

mirror of https://github.com/CloudNebulaProject/reddwarf.git synced 2026-04-10 13:20:40 +00:00

Author	SHA1	Message	Date
Till Wegmueller	d79f8ce011	Add health probes (liveness/readiness/startup) with exec, HTTP, and TCP checks Implement Kubernetes-style health probes that run during the reconcile loop to detect unhealthy applications inside running zones. Previously the pod controller only checked zone liveness via get_zone_state(), missing cases where the zone is running but the application inside has crashed. - Add exec_in_zone() to ZoneRuntime trait, implemented via zlogin on illumos and with configurable mock results for testing - Add probe type system (ProbeKind, ProbeAction, ContainerProbeConfig) that decouples from k8s_openapi and extracts probes from pod container specs with proper k8s defaults (period=10s, timeout=1s, failure=3, success=1) - Add ProbeExecutor for exec/HTTP/TCP checks with tokio timeout support (HTTPS falls back to TCP-only with warning) - Add ProbeTracker state machine that tracks per-pod/container/probe-kind state, respects initial delays and periods, gates liveness on startup probes, and aggregates results into PodProbeStatus - Integrate into PodController reconcile loop: on liveness failure set phase=Failed with reason LivenessProbeFailure; on readiness failure set Ready=False; on all-pass restore Ready=True - Add ProbeFailed error variant with miette diagnostic Known v1 limitation: probes execute at reconcile cadence (~30s), not at their configured periodSeconds. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 22:41:30 +01:00
Till Wegmueller	d3eb0b2511	Add dynamic node resource detection with configurable system reservations Replace hardcoded memory (8Gi) and pod limits (110) in the node agent with actual system detection via the sys-info crate. CPU and memory are detected once at NodeAgent construction and reused on every heartbeat. Capacity reports raw hardware values while allocatable subtracts configurable reservations (--system-reserved-cpu, --system-reserved-memory, --max-pods), giving the scheduler accurate data for filtering and scoring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 21:17:43 +01:00
Till Wegmueller	58171c7555	Add periodic reconciliation, node health checker, and graceful pod termination Three high-priority reliability features that close gaps identified in AUDIT.md: 1. Periodic reconciliation: PodController now runs reconcile_all() every 30s via a tokio::time::interval branch in the select! loop, detecting zone crashes between events. 2. Node health checker: New NodeHealthChecker polls node heartbeats every 15s and marks nodes with stale heartbeats (>40s) as NotReady with reason NodeStatusUnknown, preserving last_transition_time correctly. 3. Graceful pod termination: DELETE sets deletion_timestamp and phase=Terminating instead of immediate removal. Controller drives a state machine (shutdown → halt on grace expiry → deprovision → finalize) with periodic reconcile advancing it. New POST .../finalize endpoint performs actual storage removal. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-14 20:39:36 +01:00
Till Wegmueller	0ac169e1bd	Add ZFS storage engine: configurable pool, dataset hierarchy, extensible trait Decouple storage from the ZoneRuntime trait into a dedicated StorageEngine trait with ZfsStorageEngine (illumos) and MockStorageEngine (testing) implementations. Replace the per-zone ZfsConfig with a global StoragePoolConfig that derives dataset hierarchy from a single --storage-pool flag, with optional per-dataset overrides. This enables persistent volumes, auto-created base datasets on startup, and a clean extension point for future storage backends. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 00:47:28 +01:00
Till Wegmueller	57186ebe68	Add pod networking: IPAM, per-pod VNICs, and zone IP configuration Each pod now gets a unique VNIC name and IP address from a configurable CIDR pool, with IPs released on pod deletion. This replaces the hardcoded single VNIC/IP that prevented multiple pods from running. - Add redb-backed IPAM module with allocate/release/idempotent semantics - Add prefix_len to EtherstubConfig and DirectNicConfig - Generate allowed-address and defrouter in zonecfg net blocks - Wire vnic_name_for_pod() into controller for unique VNIC names - Add --pod-cidr and --etherstub-name CLI flags to agent subcommand - Add StorageError and IpamPoolExhausted error variants with diagnostics Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-09 00:17:45 +01:00
Till Wegmueller	8d9ae6ac62	Add pod controller, status subresource, node agent, and main binary Implement the core reconciliation loop that connects Pod events to zone lifecycle. Status subresource endpoints allow updating pod/node status without triggering spec-level changes. The main binary now provides `serve` (API server only) and `agent` (full node: API + scheduler + controller + heartbeat) subcommands via clap. - Status subresource: generic update_status in common.rs, PUT endpoints for /pods/{name}/status and /nodes/{name}/status - Pod controller: polls pods assigned to this node, provisions zones via ZoneRuntime, updates status to Running/Failed, monitors zone health - Node agent: registers host as a Node, sends periodic heartbeats with Ready condition - API client: lightweight reqwest-based HTTP client for controller and node agent to talk to the API server - Main binary: clap CLI with serve/agent commands, wires all components together with graceful shutdown via ctrl-c Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 22:53:18 +01:00
Till Wegmueller	a47784797b	Add event bus and reddwarf-runtime crate Implement an in-process broadcast event bus for resource mutations (ADDED/MODIFIED/DELETED) with SSE watch endpoints on all list handlers, following the Kubernetes watch protocol. Add the reddwarf-runtime crate with a trait-based zone runtime abstraction targeting illumos zones, including LX and custom reddwarf brand support, etherstub/direct VNIC networking, ZFS dataset management, and a MockRuntime for testing on non-illumos platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-08 21:29:17 +01:00

7 commits