solstice-ci/docs/ai/Rust CI for Illumos VMs.md
Till Wegmueller a71f9cc7d1
Initial Commit
Signed-off-by: Till Wegmueller <toasterson@gmail.com>
2025-10-25 20:01:08 +02:00

373 lines
No EOL
53 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# **Helios CI: An Architectural Blueprint for a Native Illumos Continuous Integration System**
## **I. Executive Summary & Architectural Blueprint**
### **A. Vision Statement**
This document outlines the architecture for "Helios CI," a next-generation, self-hosted Continuous Integration (CI) platform engineered for security, performance, and operational simplicity. Helios CI leverages the robust foundation of the illumos operating system, the efficiency of the Rust programming language, and the native virtualization capabilities of bHyve to provide ephemeral, fully-isolated build environments. Its primary design goal is to offer a developer experience on par with market-leading cloud-native solutions through deep, native-like integration with both GitHub and Forgejo, running entirely on-premises. The system is designed for organizations that require absolute control over their build infrastructure, uncompromising security isolation between jobs, and the performance benefits of a purpose-built, vertically-integrated solution.
### **B. Core Architectural Pillars**
The Helios CI architecture is composed of three distinct, decoupled services that communicate over internal APIs. This separation of concerns ensures modularity, scalability, and maintainability.
* **Forge Integration Layer:** A stateless, public-facing web service responsible for all communication with external forges (GitHub, Forgejo). It authenticates and processes incoming webhook events, translates them into a standardized internal format, and uses the forge's API to report back detailed status and results. This layer acts as the system's ambassador to the outside world.
* **Orchestration Engine:** The stateful heart of the system. It receives validated job requests from the Integration Layer, manages the complete lifecycle of bHyve virtual machines (provisioning, booting, monitoring, teardown), and acts as a conduit for streaming logs and results between the Job Agent and the Integration Layer. This engine is the master of the illumos host's virtualization capabilities.
* **Job Execution Agent:** A lightweight, ephemeral agent residing within the bHyve VM guest operating system. It is responsible for receiving a job definition from the Orchestration Engine, executing the user-defined workflow steps, capturing all output, and communicating status back to the Orchestration Engine.
### **C. System Flow Diagram**
The end-to-end process of a CI run is a choreographed sequence of events spanning all three architectural pillars, triggered by a single developer action. The flow is as follows:
1. A developer pushes a commit to a tracked branch in a GitHub or Forgejo repository.
2. The forge detects this event and sends a webhook (e.g., check\_suite from GitHub, push from Forgejo) to the publicly exposed endpoint of the Forge Integration Layer.
3. The Integration Layer validates the webhook's signature to ensure authenticity. It then performs the necessary authentication with the forge's API. For GitHub, this involves a multi-step JWT and installation token exchange; for Forgejo, it uses a pre-configured API token.
4. Immediately upon successful authentication, the Integration Layer makes an API call back to the forge to create an initial "pending" or "queued" status on the commit. This provides immediate feedback to the developer in the pull request or commit history UI.
5. The Integration Layer translates the webhook payload into a standardized job request and sends it to the Orchestration Engine via an internal, private API.
6. The Orchestration Engine receives the request and begins provisioning a new, isolated build environment. It uses the illumos ZFS filesystem to create a near-instantaneous, copy-on-write clone of a pre-configured base VM image.
7. Using the zonecfg utility or a corresponding Rust library, the Orchestrator defines a new bhyve branded zone, attaching the cloned ZFS volume as the VM's primary disk and configuring its virtual network interface. It then boots the zone, which starts the bHyve VM.
8. Inside the newly booted VM, the Job Execution Agent starts automatically. It contacts the Orchestration Engine to fetch the full job definition, which includes the repository URL, commit SHA, and the parsed steps from the workflow YAML file.
9. The Agent sets up the workspace by cloning the specified commit from the repository. It then begins executing each step defined in the workflow file sequentially.
10. As each step runs, the Agent captures its stdout and stderr in real-time and streams this log data back to the Orchestration Engine. The Orchestrator, in turn, forwards these log chunks to the Forge Integration Layer.
11. The Integration Layer continuously updates the status check on the forge's platform, appending the new log data. This allows developers to watch the build log live from their browser, directly within the GitHub or Forgejo UI.
12. Upon completion of all steps, the Agent reports the final status (e.g., success, failure) and any structured results (like test failures or code annotations) to the Orchestrator and then terminates.
13. The Orchestration Engine receives the final status and immediately begins teardown. It halts the bhyve zone and issues a zfs destroy command on the cloned ZFS volume, completely and irrevocably wiping the entire build environment and all its artifacts.
14. The Orchestrator forwards the final job result to the Forge Integration Layer, which makes a final API call to the forge, updating the check with its terminal conclusion (success or failure) and any detailed summary or annotations. The developer now sees the final green checkmark or red 'X' next to their commit.
## **II. The Forge Integration Layer: The System's Public Face**
This layer serves as the critical bridge between the external developer platforms and the internal CI logic. Its design must prioritize security, robustness, and the creation of an abstract interface that can gracefully handle the significant differences in capabilities between target forges like GitHub and Forgejo.
### **A. Integrating with GitHub: The Gold Standard via GitHub Apps**
To achieve a truly "native-like experience" on GitHub, the use of a GitHub App is not merely a preference but a strict architectural requirement. The GitHub Checks API, which enables the rich, multi-stage, and annotated feedback that defines modern CI systems, is exclusively available to GitHub Apps.1 This stands in stark contrast to older methods like Personal Access Tokens (PATs) or standard OAuth Apps, which are restricted to the much simpler Commit Status API.
Furthermore, from a security standpoint, GitHub Apps represent a significant leap forward. They employ a model of fine-grained permissions, allowing the application to request only the specific access it needs (e.g., checks:write, contents:read) rather than broad, all-encompassing scopes like repo. Access can be granted on a per-repository basis by the installing user or organization, and the operational tokens are deliberately short-lived. This principle of least privilege is a cornerstone of modern security design and is strongly recommended by GitHub's own best practices.2
#### **Authentication Flow Deep Dive**
The authentication process for a GitHub App is a sophisticated, multi-step dance designed to maximize security by minimizing the exposure of long-lived credentials.
1. **Secure Credential Storage:** Upon registration, the GitHub App is assigned an App ID and allows for the generation of a private key in .pem format. These two pieces of information are the root credentials for the application. The private key is paramount and must be treated with the utmost care, stored in a secure, managed secrets store such as HashiCorp Vault or a cloud provider's equivalent (e.g., Azure Key Vault).7 It must never be stored in plaintext in configuration files or environment variables.
2. **Generating a JSON Web Token (JWT):** To initiate any API communication, the service must first authenticate *as the app itself*. It does this by creating and signing a JWT using its private key. This JWT is a short-lived credential (maximum 10-minute validity) that proves the service has access to the private key without ever transmitting the key over the network. The JWT payload must contain specific claims as mandated by GitHub: iat (issued at time), exp (expiration time), and iss (issuer, which is the App ID).8
3. **Requesting an Installation Access Token:** The generated JWT is then used to acquire a token that can act on behalf of a specific installation (i.e., a specific user or organization that has installed the app). The service makes a POST request to the GitHub API endpoint /app/installations/{installation\_id}/access\_tokens. The installation\_id is conveniently provided in the payload of every webhook event the app receives. The request must include the JWT in the Authorization header, formatted as Bearer \<JWT\>.10
4. **Using the Installation Access Token:** The response from the API contains a temporary *installation access token*. This token is typically valid for one hour. All subsequent API calls made to perform actions on the repository (such as creating or updating a check run) will use this token in the Authorization header, again formatted as Bearer \<TOKEN\>.8 This two-step process ensures that the powerful private key is used only briefly and indirectly, while the operational token that interacts with repository data has a limited lifetime, drastically reducing the potential impact if it were ever compromised.
#### **Implementing a "Native" Experience with the Checks API**
The Checks API is the key to unlocking a rich user experience within the GitHub UI.
* **Check Suites and Check Runs:** When a developer pushes code, GitHub automatically creates a check\_suite for the commit. It then sends a check\_suite webhook to all installed GitHub Apps with the checks:write permission. Upon receiving this webhook, the Integration Layer should immediately use the API to create a corresponding check\_run. This single action provides instant feedback in the pull request UI, showing that the CI process has been acknowledged and is underway.1 The initial check\_run should be created with a status of queued or in\_progress.
* **Rich, Granular Feedback:** A check\_run is a mutable object that can be updated throughout the job's lifecycle. The service can post a title, a detailed summary (which supports Markdown for formatting), and, most powerfully, annotations. Annotations are messages that can be tied to specific files and line numbers within the commit, complete with a severity level (notice, warning, or failure). This allows the CI system to report linting errors, test failures, or security vulnerabilities directly in the "Files changed" view of a pull request, providing context-rich feedback exactly where the developer is looking.1
* **Interactive UI Elements:** The Checks API also supports the definition of actions. These are rendered as buttons in the GitHub Checks UI and can be configured to send a new webhook event back to the app when clicked. This opens up possibilities for interactive features like "Re-run failed tests" or "Apply suggested fix" without requiring the developer to leave the GitHub interface.1
#### **Recommended Rust Crate: octocrab**
For implementing this integration, the octocrab crate is the standout choice. It is a modern, well-maintained, and extensible GitHub API client for Rust. Crucially, it has explicit, high-level support for the GitHub App authentication flow and provides a dedicated checks module as part of its strongly-typed semantic API.12 Its comprehensive set of data models for webhook payloads will also greatly simplify the process of deserializing and handling incoming events, reducing boilerplate and the risk of parsing errors.12 Using octocrab significantly de-risks the implementation by abstracting away the complexities of raw HTTP requests and manual JSON parsing.
### **B. Integrating with Forgejo & Gitea: A Pragmatic Approach**
Forgejo is a soft fork of Gitea, a popular self-hosted Git service.14 A critical piece of information for this project is that Forgejo maintains a high degree of API compatibility with Gitea, even providing a Gitea-compatible /api/v1 endpoint.14 This allows the project to leverage the more extensive documentation, community support, and broader ecosystem of SDKs available for Gitea, treating a Forgejo instance as a Gitea target for all practical purposes.16
#### **Authentication**
Authentication for the Forgejo/Gitea API is more straightforward than for GitHub Apps. It relies on a standard API access token, which can be generated by a user through the web interface. This token is then included in API requests within the Authorization header, using the format token \<TOKEN\>.19 These tokens are typically long-lived and must be securely stored and managed by the administrator of the Helios CI system.
#### **The Commit Status API**
A pivotal distinction between GitHub and Forgejo/Gitea is the absence of a direct equivalent to the rich Checks API in the latter. The mechanism available for reporting build status is the Commit Status API.16 This API allows an external service to attach a status to a specific commit SHA. The primary endpoint for this is POST /api/v1/repos/{owner}/{repo}/statuses/{sha}.25
#### **Capabilities and Limitations**
The Commit Status API is functional but limited. It accepts a payload containing:
* A state: pending, success, failure, error, or warning.
* A target\_url: A URL that the status will link to, typically the CI job log view.
* A description: A short, one-line string summarizing the status.
* A context: A string used to differentiate this status from others (e.g., ci/helios/build, ci/helios/test).
This API provides a monolithic, per-job status. It lacks the granularity of the GitHub Checks API; there is no built-in support for reporting per-step feedback, streaming logs directly into the UI, adding line-level code annotations, or creating interactive UI elements. This represents a fundamental capability gap that will result in a less integrated and less rich user experience on Forgejo compared to GitHub.
#### **Recommended Rust Crates**
Given the API compatibility, a Gitea-focused Rust crate is the most pragmatic choice. The gitea-sdk crate appears to be a modern and well-structured option, offering a fluent builder pattern for API requests that is conceptually similar to octocrab.26 While other crates like gritea 28 and gitea 29 exist, they appear less actively maintained or documented. The native forgejo-api crate is still nascent and has sparse documentation, making the mature Gitea SDKs a lower-risk choice.30
### **C. Designing a Unified Forge Abstraction in Rust**
To prevent the core logic of the Orchestration Engine from being polluted with if github {... } else if forgejo {... } conditional blocks, a strong abstraction layer is essential. This will be achieved by defining a Forge trait in Rust, which will present a unified, idealized interface for all interactions required by the CI system.
The core challenge in designing this trait is the feature disparity between the two platforms. A "native-like experience" means different things on GitHub versus Forgejo. The GitHub experience is defined by the rich, interactive feedback of the Checks API, while the Forgejo experience is limited to the simpler, monolithic updates of the Commit Status API. The abstraction must be designed to accommodate the richer feature set of GitHub, while allowing for a "best-effort" or graceful degradation on platforms that lack those features.
Rust
use async\_trait::async\_trait;
// Represents a line-level annotation.
pub struct Annotation {
pub path: String,
pub line: u32,
pub message: String,
pub level: AnnotationLevel, // e.g., Notice, Warning, Failure
}
// Represents the state of a running job.
pub trait CheckRun {
//... methods to manage internal state...
}
// The core abstraction for interacting with a forge.
\#\[async\_trait\]
pub trait Forge {
/// Posts an initial "pending" status to the forge, returning a handle
/// to the check run that can be used for subsequent updates.
async fn report\_pending(\&self, job\_context: \&JobContext) \-\> Result\<Box\<dyn CheckRun\>\>;
/// Updates the status of an in-progress check run, typically by
/// appending new log output.
async fn update\_progress(\&self, check\_run: \&mut Box\<dyn CheckRun\>, new\_log\_chunk: \&str);
/// Adds a specific code annotation to the check run.
/// This will be a no-op for forges that do not support annotations.
async fn add\_annotation(\&self, check\_run: \&mut Box\<dyn CheckRun\>, annotation: Annotation);
/// Posts the final result of the job to the forge.
async fn report\_final\_status(\&self, check\_run: Box\<dyn CheckRun\>, result: JobResult);
}
This design allows the rest of the system to operate against the idealized Forge trait. The system will attempt to add\_annotation regardless of the target forge. The GitHubForge implementation will translate this into a Checks API call, while the ForgejoForge implementation will simply do nothing. This cleanly isolates the platform-specific logic and makes the system extensible to other forges in the future.
The following table provides a clear, at-a-glance comparison for stakeholders and developers, highlighting the feature gap that informs the design of this unified abstraction.
| Feature | GitHub Checks API | Forgejo Commit Status API | Implication for Helios CI |
| :---- | :---- | :---- | :---- |
| **Overall Status** | Supported (queued, in\_progress, completed with conclusion) | Supported (pending, success, failure, etc.) | Core functionality is available on both platforms. |
| **Per-Step Status Updates** | Supported via check\_run updates. Can show "Step 2/5: Running tests..." | Not supported. A single description for the entire job. | The UI experience on Forgejo will be less granular, showing only the overall job status. |
| **Code Annotations** | Supported. Line-specific feedback with severity levels. | Not supported. | A major feature gap. Linting/test failures cannot be shown inline on Forgejo PRs. |
| **Detailed Log Streaming** | Supported. The output.text field can be updated in real-time. | Not supported. Status links to an external target\_url for logs. | Live log viewing must happen on the Helios CI web UI for Forgejo, not within the Forgejo UI itself. |
| **Custom UI Actions** | Supported. Can add buttons to the Checks UI to trigger new webhooks. | Not supported. | Interactive features like "re-run" must be initiated from outside the Forgejo UI. |
| **Authentication Model** | GitHub App (short-lived, scoped, per-installation tokens) | User API Token (long-lived, user-scoped) | The security model for GitHub is inherently stronger and more flexible. |
## **III. The Orchestration Engine: Managing Execution Environments on illumos**
The Orchestration Engine is the core of the Helios CI system, where the unique and powerful features of the illumos operating system are leveraged to create a highly efficient and secure job execution environment. The design of this component is critical to delivering on the promises of performance and isolation.
### **A. The bhyve Zone Brand: A Superior Model for VM Management**
Illumos zones are a mature and robust form of OS-level virtualization, providing strong process, filesystem, and network isolation for applications running in a shared kernel environment.31 A key feature of the zones framework is the concept of "brands," which allows a zone to run an environment other than the native illumos one. While brands like ipkg run a full, independent copy of illumos and lx runs Linux binaries, the bhyve brand is particularly relevant for our use case. It allows a full hardware-virtualized bHyve virtual machine to be managed as a standard illumos zone.31
Managing a bHyve VM through the zone framework is vastly superior to scripting raw bhyve command-line invocations. This approach integrates the VM's entire lifecycle into the operating system's core management facilities:
* **Service Management:** The VM becomes a standard Service Management Facility (SMF) service, manageable with svcadm and observable with svcs.
* **Resource Controls:** Standard zone resource controls (e.g., for CPU shares, memory caps) can be applied to the VM.
* **Unified Tooling:** The same set of commands (zonecfg for configuration, zoneadm for administration, zlogin for console access) are used for the bHyve VM as for a simple container-like zone, providing a consistent and powerful management paradigm.32
This elevates the VM from a mere process to a first-class citizen of the operating system, simplifying automation and enhancing reliability.
### **B. Programmatic VM Provisioning with zonecfg and ZFS**
The Orchestration Engine will not rely on statically pre-configured zones. To achieve true on-demand, ephemeral environments, it will programmatically generate a unique zone configuration for every CI job it processes.
#### **Dynamic Configuration with zonecfg**
The zonecfg utility is the standard tool for defining a zone's configuration.33 The Orchestrator will either generate a command file to be passed to zonecfg \-f or, preferably, use a native Rust library that performs the equivalent operations. The configuration for a bhyve branded zone will include several key properties:
* create \-b: This command initiates the creation of a new zone configuration.
* set brand=bhyve: This specifies that the zone will host a bHyve VM.32
* set zonepath=/path/to/zone/root: This defines the directory where the zone's configuration and runtime state will be stored. This path will point to a temporary, job-specific ZFS dataset.
* set ip-type=exclusive: This grants the zone its own dedicated virtual network interface and IP stack, ensuring complete network isolation from the host and other zones.32
* add net: This resource block configures the virtual network interface, specifying its physical link (typically a virtual switch or vnic) and allowed IP addresses.32
* add device: This resource is used to pass a block device from the global zone into the zone. This is how the VM's virtual disk will be provided.32
* add attr: The bhyve brand is configured almost entirely through these generic attribute resources. Key attributes include ram (e.g., 4G), vcpus (e.g., 2), and bootdisk (which points to the device added previously).32
#### **The Central Role of ZFS**
ZFS is not merely a filesystem in this architecture; it is the foundational technology that enables the system's efficiency and security.
* **Base VM Image:** A "golden" VM image, containing the guest OS and the pre-installed Job Execution Agent, will be maintained on a ZFS volume (a zvol), for example, at rpool/bhyve-images/ubuntu-agent-v1. This image is read-only.
* **Instantaneous Clones:** When a new CI job arrives, the Orchestrator's first action is to execute a zfs clone command. For example: zfs clone rpool/bhyve-images/ubuntu-agent-v1@latest rpool/ci-vms/job-123-disk. This operation creates a new, writable ZFS volume that initially shares all its data blocks with the parent image. It is a copy-on-write clone, meaning it is created almost instantaneously and consumes negligible disk space initially. This completely sidesteps the slow process of copying a multi-gigabyte disk image, which is a major bottleneck in traditional VM-based CI systems.
* **Guaranteed Isolation and Atomic Cleanup:** The cloned ZFS volume is dedicated to a single job. The zonecfg configuration will pass the path to this clone (e.g., /dev/zvol/rdsk/rpool/ci-vms/job-123-disk) into the zone as its boot disk. When the job is complete, the Orchestrator halts the zone and executes a single zfs destroy command on the clone. This atomically and irrevocably removes all traces of the build environment, including any modifications, downloaded dependencies, or generated artifacts. This provides a forensically-sound guarantee of a clean slate for every job.
This combination of bhyve branded zones and ZFS clones offers a unique and powerful architectural advantage, providing the strong security isolation of full hardware virtualization with the speed and efficiency approaching that of container-based systems.
### **C. Rust-based Orchestration using the oxidecomputer/zone Crate**
While it is possible to orchestrate this process by shelling out to zonecfg, zoneadm, and zfs commands, this approach is brittle, difficult to maintain, and presents potential security risks (e.g., command injection). A native Rust library that provides a safe, typed API for these operations is the professionally sound choice.
The oxidecomputer/zone crate is purpose-built for creating and managing illumos zones from within a Rust application.35 Although its public documentation and usage examples are sparse, its origin within the Oxide Computer Company—a company built on illumos—suggests it is a production-quality library designed for exactly this type of systems management task.36
The reliance on this sparsely documented crate represents the most significant technical risk in the project's implementation. Public information is limited, and the crate is not widely discussed in community forums.38 Oxide's own development philosophy indicates that their open-source contributions are primarily for their own use and customer support, not necessarily for building a broad user community.38
Therefore, a critical first step in the implementation phase must be a dedicated "discovery and de-risking" task. An engineer must be allocated time to thoroughly analyze the zone crate's source code, understand its API surface, and build small proof-of-concept applications to validate its capabilities for creating, configuring, booting, and destroying bhyve branded zones. This upfront investment is essential to mitigate the risk and ensure the project's success.
Based on the crate's stated purpose, the anticipated usage pattern within the Orchestrator will be:
1. Instantiate the zone crate's data structures to build a complete zone configuration in memory, programmatically setting the brand, zonepath, and adding network, device, and attribute resources.
2. Invoke a function within the crate that serializes this configuration and applies it to the system, equivalent to running zonecfg.
3. Utilize other functions in the crate that serve as safe wrappers around zoneadm commands to boot, halt, and ultimately destroy the zone after the job is complete.
The following table serves as a quick-reference guide for the key zonecfg properties that will need to be set programmatically for each bhyve zone.
| Resource Type | Property/Attribute Name | Type | Example Value | Description |
| :---- | :---- | :---- | :---- | :---- |
| (global) | brand | string | bhyve | Sets the zone brand to bHyve for hardware virtualization.32 |
| (global) | zonepath | string | /zones/job-123 | The root directory for the zone's configuration on a ZFS dataset.33 |
| (global) | ip-type | enum | exclusive | Gives the zone its own dedicated IP stack for network isolation.33 |
| net | physical | string | vnic0 | The name of the virtual NIC in the global zone to connect to.32 |
| device | match | string | /dev/zvol/rdsk/rpool/ci-vms/job-123-disk | Passes the job-specific ZFS volume into the zone as a block device.32 |
| attr | name=ram | string | 4G | Sets the amount of RAM allocated to the virtual machine.32 |
| attr | name=vcpus | string | 2 | Sets the number of virtual CPUs for the VM.32 |
| attr | name=bootdisk | string | rpool/ci-vms/job-123-disk | Specifies which device (matched above) is the primary boot disk.32 |
| attr | name=vnc | string | on | Enables VNC access for debugging or graphical installers.32 |
## **IV. CI Job Definition and Execution**
This section addresses the user-facing aspect of the CI system: how developers define their build and test pipelines, and how those definitions are translated into actions executed within the ephemeral virtual machines.
### **A. Defining Workflows: Adopting a Familiar YAML Syntax**
Rather than inventing a proprietary workflow syntax, which would create a significant barrier to adoption, Helios CI will adopt a schema that is largely compatible with the common features of GitHub Actions and Forgejo Actions.39 Both platforms utilize a similar YAML structure based on concepts like jobs, steps, and triggers. This approach allows developers to leverage their existing knowledge and, in many cases, use basic workflow files with minimal modification.
#### **Core Schema Elements**
The workflow file, typically located at .github/workflows/ci.yml or a similar path, will be structured around the following core keys:
* name: An optional string that provides a human-readable name for the workflow.
* on: A required key that specifies the events that trigger the workflow, such as push or pull\_request, potentially filtered by branch or path.
* jobs: A map where each key is a unique job ID. Each job runs in its own, separate VM environment.
* Within each job:
* runs-on: A string that specifies the type of build environment required. The Orchestrator will map this string (e.g., ubuntu-22.04, illumos-stable) to a specific "golden" VM image to be cloned.
* steps: A list of sequential steps to be executed. Each step is an individual task.
* Within each step:
* name: An optional descriptive name for the step, which will be displayed in the UI.
* uses: Specifies a reusable action to be run.
* run: A string or multi-line string containing a shell command to be executed.
Adopting this syntax provides immediate familiarity but also requires careful management of user expectations. While the *syntax* is compatible, the *execution environment* is unique to Helios CI. This means that complex, marketplace-style actions from GitHub or Forgejo (which are often JavaScript or Docker-based and rely on a specific runner environment) will not be compatible out of the box.46 The initial implementation of uses should be clearly documented to support only "local actions" (uses:./path/to/action), where the action's code is checked into the user's own repository.47 This focuses the system on its core strength—executing arbitrary run commands in a secure environment—while avoiding the immense complexity of replicating the full GitHub Actions runner environment.
#### **Parsing with serde\_yaml**
The Rust ecosystem provides first-class tools for parsing and handling structured data. The serde framework, in combination with a YAML parsing crate, will be used to deserialize the workflow file into strongly-typed Rust structs. While serde\_yaml has been a popular choice, it is now largely unmaintained.48 A modern, maintained fork such as serde\_yaml\_ng is the recommended choice to ensure ongoing support and security.50 This approach provides compile-time safety, automatic validation of the workflow file's structure, and a clean, safe way to pass job definitions between the system's components.
### **B. The Ephemeral Job Agent**
The Job Agent is a small, self-contained, and statically-linked Rust binary that is pre-installed on every base VM image. Its design prioritizes simplicity and robustness, as it runs in an untrusted environment executing user-provided code.
#### **Agent Lifecycle**
1. **Startup:** When the bHyve VM is booted by the Orchestrator, a startup service (e.g., an SMF service on an illumos guest, or a systemd service on a Linux guest) immediately launches the Job Agent binary.
2. **Configuration and Job Fetching:** The agent needs to receive its specific job context (repository URL, commit SHA, parsed workflow steps, etc.). This information can be passed from the Orchestrator to the guest environment through several mechanisms, such as a small, mounted configuration drive, environment variables injected by the bhyve brand, or an initial API call from the agent back to a private metadata endpoint on the Orchestration Engine.
3. **Workspace Setup:** The agent's first task is to prepare the build environment. It will use an embedded Git library or shell out to git to clone the specified repository and check out the exact commit SHA into a local working directory.
4. **Step Execution:** The agent iterates through the list of steps provided in its job context. For each step containing a run command, it will spawn a shell process (/bin/sh or /bin/bash), execute the command, and meticulously capture every byte of its stdout and stderr streams in real-time.
5. **Live Log Streaming:** As the output is captured, it is immediately streamed back to the Orchestration Engine over a simple, persistent gRPC or TCP connection established at startup. This is the mechanism that enables live log viewing for the developer.
6. **Status Reporting:** After each step completes, the agent inspects its exit code. If the code is non-zero, the step is marked as failed. By default, a failed step will halt the execution of the entire job. The agent will immediately report the step's success or failure back to the Orchestrator.
7. **Shutdown:** Once all steps have been executed, or if a step fails and the job is aborted, the agent sends a final job status report (including the overall success or failure conclusion) to the Orchestration Engine. It then cleanly terminates its own process. This termination signals to the Orchestrator that the job is complete and the VM is ready for destruction.
## **V. Tying It All Together: Anatomy of a CI Run**
To synthesize the interactions between these components, this section provides a narrative, step-by-step walkthrough of a complete CI process. It follows a single git push from the developer's machine to the final, detailed result appearing in the GitHub pull request interface.
1. **The Trigger:** A developer on their local machine finalizes a feature and executes git push origin feature-branch. The commit is pushed to the corresponding repository on GitHub.
2. **The Webhook:** GitHub receives the push, identifies that it corresponds to an open pull request, and creates a check\_suite for the new commit SHA. It then dispatches a check\_suite webhook event to the pre-configured public endpoint of the Helios CI Forge Integration Layer. The JSON payload of this webhook contains the repository details, the commit SHA, and the unique installation\_id for the GitHub App.
3. **Authentication and Initial Feedback (Forge Integration Layer):**
* The web service receives the HTTP POST request. It first validates the X-Hub-Signature-256 header to confirm the request is authentic and originated from GitHub.
* Using its securely stored App ID and private key, the service generates a short-lived JWT. It immediately uses this JWT to request a one-hour installation access token from GitHub's API, specific to the installation\_id from the webhook.10
* With the newly acquired installation access token, it makes its first API call back to GitHub: a POST request to create a new check\_run. It sets the status to queued and the name to something descriptive, like 'Helios CI / build'. Within seconds of the push, a new pending check appears in the developer's pull request UI.
* The service then sanitizes the relevant information from the webhook payload (repo URL, SHA, etc.) and dispatches a validated job request to the Orchestration Engine's internal gRPC API.
4. **Provisioning the Environment (Orchestration Engine):**
* The Orchestration Engine receives the job request from its internal queue.
* It executes zfs clone rpool/bhyve-images/ubuntu-agent-v1@latest rpool/ci-vms/job-451-disk, creating an instantaneous, writable disk for the new VM.
* Leveraging the oxidecomputer/zone Rust crate, it programmatically constructs a new zone configuration in memory for a zone named job-451. It sets the brand to bhyve, points the zone's boot disk device to the newly created ZFS volume, and configures its networking to connect to an isolated virtual switch.32
* It commits this configuration to the system and then calls the equivalent of zoneadm \-z job-451 boot to start the VM.
5. **Job Execution (Job Agent):**
* The bHyve VM boots the Ubuntu guest OS. A systemd service automatically starts the pre-installed Rust-based Job Agent.
* The agent establishes a gRPC connection back to the Orchestrator and receives the parsed YAML steps for its assigned job.
* It clones the repository (https://github.com/org/repo.git) and checks out the specific commit SHA.
* It begins executing the defined steps: run: cargo fmt \--check, then run: cargo clippy \-- \-D warnings, and finally run: cargo test.
6. **Real-time Reporting:**
* As cargo test executes, it prints test status lines to stdout. The Job Agent captures this output line-by-line and immediately sends each line over its gRPC stream to the Orchestrator.
* The Orchestrator forwards these log chunks to the Forge Integration Layer.
* The Integration Layer makes a series of PATCH requests to the GitHub Checks API, updating the output.text field of the check\_run. The developer, watching the pull request in their browser, can see the test output appearing in real-time within the GitHub UI.
7. **A Test Failure:** One of the integration tests fails, causing cargo test to exit with a non-zero status code.
* The Job Agent detects the failure. It can optionally be configured to parse the cargo test output to identify the exact file and line number of the failing test assertion.
* It constructs a final report for the Orchestrator, indicating an overall job failure and including a structured Annotation object with the file path, line number, and error message of the failed test.
8. **Final Status and Teardown:**
* The Orchestrator receives the failure report. It immediately commands the bhyve zone to shut down via zoneadm \-z job-451 halt. Once the zone is halted, it executes zfs destroy \-r rpool/ci-vms/job-451-disk. The entire build environment, including the failed test's artifacts and logs, is instantly and completely destroyed.
* The Orchestrator forwards the final result, including the structured annotation, to the Forge Integration Layer.
* The Integration Layer makes one last PATCH request to the check\_run on GitHub. It sets the conclusion to failure and includes the annotation in the output object. Instantly, the pending check in the PR turns into a red 'X'. When the developer expands the details, they see the full log, and an annotation is placed directly on the line of code containing the failed assertion in the "Files changed" tab.
## **VI. Advanced Topics and Strategic Recommendations**
A robust CI system requires more than just job execution. This section outlines critical considerations for security, scalability, and performance that must be addressed to create a production-ready platform.
### **A. Security Hardening**
Security must be a foundational principle of the CI system's design, especially as it will be executing untrusted code from pull requests.
* **Private Key Management:** The GitHub App's private key is the ultimate credential for the system's identity on GitHub. If compromised, an attacker could impersonate the CI system across all installed repositories. This key must never be stored in a configuration file or environment variable in plain text. It should be stored in a dedicated secrets management system like HashiCorp Vault or a Hardware Security Module (HSM). The Forge Integration Layer should be configured to fetch this key at startup, and access to the secrets manager should be tightly controlled.
* **Network Isolation:** By default, the bHyve VMs should be provisioned on a completely isolated virtual network. On illumos, this can be achieved using an etherstub and a dedicated vnic for each zone. This network should have no route to the public internet or to sensitive internal production networks. For jobs that legitimately need to download dependencies from the internet (e.g., from crates.io), egress should be explicitly enabled and routed through a dedicated, filtering proxy that can enforce policies and log all outbound traffic.
* **Secrets Management for Jobs:** CI jobs often require access to secrets like deployment credentials or API keys. The Helios CI system must provide a secure mechanism for injecting these into the build environment. A robust solution would involve integrating the Orchestration Engine with a secrets backend (like Vault). The workflow YAML could specify which secrets are needed, and the Orchestrator would fetch them from Vault and inject them into the VM at boot time, making them available to the Job Agent as environment variables or temporary files in a tmpfs. These secrets must be masked from the build logs to prevent accidental exposure.
### **B. Scalability and Concurrency**
While the initial design can operate on a single powerful illumos host, a true production system must be able to scale horizontally.
* **Decoupling with a Job Queue:** The direct API call from the Integration Layer to the Orchestrator can become a bottleneck. A more scalable architecture would introduce a message queue (e.g., RabbitMQ, NATS) between the two services. The Integration Layer would simply publish a job request message to the queue.
* **Pool of Orchestrator Nodes:** The system can be scaled by creating a cluster of multiple physical illumos servers, each running an instance of the Orchestration Engine. These engines would act as consumers, pulling job requests from the central message queue. Each node would manage its own pool of local resources (CPU, RAM, ZFS storage) and run a certain number of concurrent bHyve VMs. This distributed model allows the system's total capacity to be scaled horizontally simply by adding more illumos hosts to the cluster.
### **C. Artifact and Cache Management with ZFS**
The capabilities of ZFS extend beyond just provisioning, offering powerful solutions for managing build artifacts and caching.
* **Artifacts:** For successful builds that produce artifacts (e.g., compiled binaries, documentation, container images), the Job Agent can package them into an archive. This archive can be streamed back to the Orchestrator before the VM is destroyed. The Orchestrator can then store this artifact on a dedicated ZFS dataset. The Integration Layer can either expose a secure download link for this artifact or use the forge's API to upload it to a "release" or as a job artifact.
* **Intelligent Caching:** The long pole in many CI jobs is downloading dependencies and recompiling code that hasn't changed. ZFS provides a uniquely elegant solution to this problem.
1. After a successful build for a job on feature-branch, before destroying its ZFS clone (rpool/ci-vms/job-451-disk), the Orchestrator can take a snapshot of it: zfs snapshot rpool/ci-vms/job-451-disk@cache.
2. It can then identify directories that are good candidates for caching, such as /root/.cargo/registry or the project's target directory.
3. When the next job for feature-branch (job 452\) comes in, the Orchestrator creates a new clone as usual (rpool/ci-vms/job-452-disk).
4. Instead of starting with a clean slate, it can use zfs send/recv to efficiently stream the data from the cached directories in the previous snapshot into the new volume. This process is extremely fast as it operates at the block level.
5. When the VM for job 452 boots, its Cargo registry and target directory are already populated, potentially saving many minutes of download and compilation time. This provides a highly efficient, storage-level caching mechanism with minimal overhead.
#### **Works cited**
1. Using the REST API to interact with checks \- GitHub Docs, accessed on October 25, 2025, [https://docs.github.com/en/rest/guides/using-the-rest-api-to-interact-with-checks](https://docs.github.com/en/rest/guides/using-the-rest-api-to-interact-with-checks)
2. PAT vs oAuth vs GitHub App · community · Discussion \#109668, accessed on October 25, 2025, [https://github.com/orgs/community/discussions/109668](https://github.com/orgs/community/discussions/109668)
3. Differences between GitHub Apps and OAuth apps \- GitHub Docs, accessed on October 25, 2025, [https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps](https://docs.github.com/en/apps/oauth-apps/building-oauth-apps/differences-between-github-apps-and-oauth-apps)
4. Deciding when to build a GitHub App, accessed on October 25, 2025, [https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/deciding-when-to-build-a-github-app](https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/deciding-when-to-build-a-github-app)
5. Best practices for creating a GitHub App, accessed on October 25, 2025, [https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/best-practices-for-creating-a-github-app](https://docs.github.com/en/apps/creating-github-apps/about-creating-github-apps/best-practices-for-creating-a-github-app)
6. Replacing a GitHub Personal Access Token with a GitHub Application \- Aembit, accessed on October 25, 2025, [https://aembit.io/blog/replacing-a-github-personal-access-token-with-a-github-application/](https://aembit.io/blog/replacing-a-github-personal-access-token-with-a-github-application/)
7. Making GitHub API Requests with a JWT \- Thomas Stringer, accessed on October 25, 2025, [https://trstringer.com/github-api-requests-with-jwt/](https://trstringer.com/github-api-requests-with-jwt/)
8. GitHub App Token Authorization: A Complete Guide | by Abhishek Tiwari | Medium, accessed on October 25, 2025, [https://medium.com/@tiwari09abhi/github-app-token-authorization-a-complete-guide-169461f2953f](https://medium.com/@tiwari09abhi/github-app-token-authorization-a-complete-guide-169461f2953f)
9. Generating a JSON Web Token (JWT) for a GitHub App, accessed on October 25, 2025, [https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/generating-a-json-web-token-jwt-for-a-github-app](https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/generating-a-json-web-token-jwt-for-a-github-app)
10. Authenticating as a GitHub App installation, accessed on October 25, 2025, [https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/authenticating-as-a-github-app-installation](https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/authenticating-as-a-github-app-installation)
11. About authentication with a GitHub App, accessed on October 25, 2025, [https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/about-authentication-with-a-github-app](https://docs.github.com/en/apps/creating-github-apps/authenticating-with-a-github-app/about-authentication-with-a-github-app)
12. XAMPPRocky/octocrab: A modern, extensible GitHub API ... \- GitHub, accessed on October 25, 2025, [https://github.com/XAMPPRocky/octocrab](https://github.com/XAMPPRocky/octocrab)
13. octocrab \- crates.io: Rust Package Registry, accessed on October 25, 2025, [https://crates.io/crates/octocrab](https://crates.io/crates/octocrab)
14. Forgejo numbering scheme | Forgejo Beyond coding. We forge., accessed on October 25, 2025, [https://forgejo.org/docs/latest/user/versions/](https://forgejo.org/docs/latest/user/versions/)
15. Gitea Documentation, accessed on October 25, 2025, [https://docs.gitea.cn/en-us/1.19/](https://docs.gitea.cn/en-us/1.19/)
16. Gitea API | Gitea Documentation, accessed on October 25, 2025, [https://docs.gitea.com/api/1.24/](https://docs.gitea.com/api/1.24/)
17. API Reference — gitea v1.1.11 \- HexDocs, accessed on October 25, 2025, [https://hexdocs.pm/gitea/](https://hexdocs.pm/gitea/)
18. Gitea API. | Documentation | Postman API Network, accessed on October 25, 2025, [https://www.postman.com/api-evangelist/gitea/documentation/1jqejxn/gitea-api](https://www.postman.com/api-evangelist/gitea/documentation/1jqejxn/gitea-api)
19. API Usage \- Gitea Documentation, accessed on October 25, 2025, [https://docs.gitea.com/development/api-usage](https://docs.gitea.com/development/api-usage)
20. Gitea Official Website, accessed on October 25, 2025, [https://about.gitea.com/](https://about.gitea.com/)
21. API Usage | Forgejo Beyond coding. We forge., accessed on October 25, 2025, [https://forgejo.org/docs/latest/user/api-usage/](https://forgejo.org/docs/latest/user/api-usage/)
22. API Usage | Forgejo Beyond coding. We forge., accessed on October 25, 2025, [https://forgejo.org/docs/v1.20/user/api-usage/](https://forgejo.org/docs/v1.20/user/api-usage/)
23. set gitea status \- Tekton task \- Artifact Hub, accessed on October 25, 2025, [https://artifacthub.io/packages/tekton-task/tekton-tasks/gitea-set-status](https://artifacthub.io/packages/tekton-task/tekton-tasks/gitea-set-status)
24. Gitea Checks | Jenkins plugin, accessed on October 25, 2025, [https://plugins.jenkins.io/gitea-checks/](https://plugins.jenkins.io/gitea-checks/)
25. Create a commit status | Gitea API. \- Postman, accessed on October 25, 2025, [https://www.postman.com/api-evangelist/gitea/request/t0hjvmx/create-a-commit-status](https://www.postman.com/api-evangelist/gitea/request/t0hjvmx/create-a-commit-status)
26. gitea-sdk \- crates.io: Rust Package Registry, accessed on October 25, 2025, [https://crates.io/crates/gitea-sdk](https://crates.io/crates/gitea-sdk)
27. gitea\_sdk \- Rust \- Docs.rs, accessed on October 25, 2025, [https://docs.rs/gitea-sdk](https://docs.rs/gitea-sdk)
28. Gritea — async Rust library // Lib.rs, accessed on October 25, 2025, [https://lib.rs/crates/gritea](https://lib.rs/crates/gritea)
29. gitea \- Rust \- Docs.rs, accessed on October 25, 2025, [https://docs.rs/gitea](https://docs.rs/gitea)
30. forgejo\_api \- Rust \- Docs.rs, accessed on October 25, 2025, [https://docs.rs/forgejo-api](https://docs.rs/forgejo-api)
31. OmniOS zones, accessed on October 25, 2025, [https://omnios.org/setup/zones](https://omnios.org/setup/zones)
32. bhyve and KVM branded zones \- OmniOS, accessed on October 25, 2025, [https://omnios.org/info/bhyve\_kvm\_brand](https://omnios.org/info/bhyve_kvm_brand)
33. illumos: manual page: zonecfg.8 \- SmartOS, accessed on October 25, 2025, [https://smartos.org/man/8/zonecfg](https://smartos.org/man/8/zonecfg)
34. Using the zonecfg Command to Modify a Zone Configuration \- Oracle Solaris 11.1 Administration, accessed on October 25, 2025, [https://docs.oracle.com/cd/E26502\_01/html/E29024/z.conf.start-115.html](https://docs.oracle.com/cd/E26502_01/html/E29024/z.conf.start-115.html)
35. zone \- crates.io: Rust Package Registry, accessed on October 25, 2025, [https://crates.io/crates/zone](https://crates.io/crates/zone)
36. oxidecomputer/zone \- GitHub, accessed on October 25, 2025, [https://github.com/oxidecomputer/zone](https://github.com/oxidecomputer/zone)
37. Oxide Computer Company \- GitHub, accessed on October 25, 2025, [https://github.com/oxidecomputer](https://github.com/oxidecomputer)
38. GitHub \- oxidecomputer/dropshot: expose REST APIs from a Rust program \- Reddit, accessed on October 25, 2025, [https://www.reddit.com/r/rust/comments/1ixqzlx/github\_oxidecomputerdropshot\_expose\_rest\_apis/](https://www.reddit.com/r/rust/comments/1ixqzlx/github_oxidecomputerdropshot_expose_rest_apis/)
39. Getting Started with GitHub Actions \- Waylon Walker, accessed on October 25, 2025, [https://waylonwalker.com/github-actions-syntax/](https://waylonwalker.com/github-actions-syntax/)
40. Understanding GitHub Actions, accessed on October 25, 2025, [https://docs.github.com/articles/getting-started-with-github-actions](https://docs.github.com/articles/getting-started-with-github-actions)
41. GitHub Actions documentation, accessed on October 25, 2025, [https://docs.github.com/actions](https://docs.github.com/actions)
42. Forgejo Actions user guide, accessed on October 25, 2025, [https://forgejo.org/docs/v1.21/user/actions/](https://forgejo.org/docs/v1.21/user/actions/)
43. Forgejo Actions user guide, accessed on October 25, 2025, [https://forgejo.org/docs/v1.20/user/actions/](https://forgejo.org/docs/v1.20/user/actions/)
44. Forgejo Actions | Basic concepts, accessed on October 25, 2025, [https://forgejo.org/docs/latest/user/actions/basic-concepts/](https://forgejo.org/docs/latest/user/actions/basic-concepts/)
45. Forgejo Actions | Reference | Forgejo Beyond coding. We forge., accessed on October 25, 2025, [https://forgejo.org/docs/latest/user/actions/](https://forgejo.org/docs/latest/user/actions/)
46. About custom actions \- GitHub Docs, accessed on October 25, 2025, [https://docs.github.com/actions/creating-actions/about-custom-actions](https://docs.github.com/actions/creating-actions/about-custom-actions)
47. Using Actions | Forgejo Beyond coding. We forge., accessed on October 25, 2025, [https://forgejo.org/docs/latest/user/actions/actions/](https://forgejo.org/docs/latest/user/actions/actions/)
48. serde\_yaml \- Rust \- Docs.rs, accessed on October 25, 2025, [https://docs.rs/serde-yaml](https://docs.rs/serde-yaml)
49. serde\_yaml \- crates.io: Rust Package Registry, accessed on October 25, 2025, [https://crates.io/crates/serde\_yaml](https://crates.io/crates/serde_yaml)
50. Serde and YAML-support status? \- community \- The Rust Programming Language Forum, accessed on October 25, 2025, [https://users.rust-lang.org/t/serde-and-yaml-support-status/125684](https://users.rust-lang.org/t/serde-and-yaml-support-status/125684)