# Solstice CI — Engineering Guidelines This document records project-specific build, test, and development conventions for the Solstice CI workspace. It is written for experienced Rust developers and focuses only on details that are unique to this repository. - Workspace root: `solstice-ci/` - Crates: `crates/agent`, `crates/orchestrator`, `crates/forge-integration`, `crates/github-integration`, `crates/common` - Rust edition: `2024` - Docs policy: Any AI-generated markdown summaries not explicitly requested in a prompt must live under `docs/ai/` with a timestamp prefix in the filename (e.g., `2025-10-25-some-topic.md`). ## 1. Build and Configuration - Use the stable toolchain unless explicitly noted. The workspace uses the 2024 edition; keep `rustup` and `cargo` updated. - Top-level build: - Build everything: `cargo build --workspace` - Run individual binaries during development using `cargo run -p `. - mise file tasks: - Tasks live under `.mise/tasks/` and are discovered automatically by mise. - List all available tasks: `mise tasks` - Run tasks with namespace-style names (directory -> `:`). Examples: - Build all (debug): `mise run build:all` - Build all (release): `mise run build:release` - Test all: `mise run test:all` - Start local deps (RabbitMQ): `mise run dev:up` - Stop local deps: `mise run dev:down` - Run orchestrator with local defaults: `mise run run:orchestrator` - Enqueue a sample job for the current repo/commit: `mise run run:forge-enqueue` - Serve the workflow runner for VMs to download (local dev): `mise run run:runner-serve` - End-to-end local flow (bring up deps, start orchestrator, enqueue one job, tail logs): `mise run ci:local` - Notes for local VM downloads: - The orchestrator injects a SOLSTICE_RUNNER_URL into cloud-init; ci:local sets this automatically by serving the runner from your host. - You can set ORCH_CONTACT_ADDR to the host:port where the runner should stream logs back (defaults to GRPC_ADDR). - Lints and formatting follow the default Rust style unless a crate specifies otherwise. Prefer `cargo fmt` and `cargo clippy --workspace --all-targets --all-features` before committing. - Secrets and credentials are never committed. For local runs, use environment variables or a `.env` provider (do not add `.env` to VCS). In CI/deployments, use a secret store (e.g., Vault, KMS) — see the Integration layer notes. Common configuration environment variables (pattern; per-service variables may diverge): - Logging and tracing: - `RUST_LOG` (or use `tracing-subscriber` filters) - `OTEL_EXPORTER_OTLP_ENDPOINT` (e.g., `http://localhost:4317`) - `OTEL_SERVICE_NAME` (e.g., `solstice-orchestrator`) - Database (Postgres via SeaORM): - `DATABASE_URL=postgres://user:pass@host:5432/solstice` - Object storage (S3-compatible) and filesystem: - `S3_ENDPOINT`, `AWS_ACCESS_KEY_ID`, `AWS_SECRET_ACCESS_KEY`, `AWS_REGION`, `S3_BUCKET` - `BLOB_FS_ROOT` (filesystem-backed blob store root) ## 2. Error Handling, Tracing, and Telemetry We standardize on the `miette` + `thiserror` error pattern and `tracing` with OpenTelemetry. - `thiserror` is used for domain error enums/structs; avoid stringly-typed errors. - Wrap top-level errors with `miette::Report` for rich diagnostic output in CLIs and service logs. Prefer `eyre`-style ergonomics via `miette::Result` where appropriate. - Use `tracing` over `log`. Never mix ad-hoc `println!` in services; `println!` remains acceptable in intentionally minimal binaries or for very early bootstrap before subscribers are set. - Emit spans/resources with OpenTelemetry. The shared initialization lives in the `common` crate so all binaries get consistent wiring. Recommended initialization (placed in `crates/common` as a single public function and called from each service `main`): ```rust // common/src/telemetry.rs (example; keep in `common`) pub struct TelemetryGuard { // Drop to flush and shutdown OTEL pipeline. _guard: Option, } pub fn init_tracing(service_name: &str) -> miette::Result { use miette::IntoDiagnostic; use opentelemetry::sdk::{propagation::TraceContextPropagator, Resource}; use opentelemetry::sdk::trace as sdktrace; use opentelemetry::KeyValue; use opentelemetry_otlp::WithExportConfig; use tracing_subscriber::{EnvFilter, fmt, layer::SubscriberExt, util::SubscriberInitExt}; // Resource describing this service. let resource = Resource::new(vec![KeyValue::new("service.name", service_name.to_string())]); // OTLP exporter (gRPC) if endpoint present; otherwise, only console output. let tracer = opentelemetry_otlp::new_pipeline() .tracing() .with_trace_config(sdktrace::Config::default().with_resource(resource)) .with_exporter(opentelemetry_otlp::new_exporter().tonic()) .install_batch(opentelemetry::runtime::Tokio) .into_diagnostic()?; // Optional non-blocking file appender (example) — keep simple unless needed. let (nb_writer, guard) = tracing_appender::non_blocking(std::io::stderr()); let fmt_layer = fmt::layer() .with_target(false) .with_writer(nb_writer) .with_ansi(atty::is(atty::Stream::Stderr)); let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer); let filter = EnvFilter::try_from_default_env() .or_else(|_| Ok(EnvFilter::new("info"))) .into_diagnostic()?; opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new()); tracing_subscriber::registry() .with(filter) .with(fmt_layer) .with(otel_layer) .init(); Ok(TelemetryGuard { _guard: Some(guard) }) } ``` In each service binary: ```rust fn main() -> miette::Result<()> { let _telemetry = common::telemetry::init_tracing("solstice-orchestrator")?; // ... Ok(()) } ``` Guidance: - Favor spans around orchestration lifecycles (VM provisioning, job execution, webhook processing). Include high-cardinality attributes judiciously; prefer stable keys (e.g., job_id) and sample IDs for verbose content. - When returning errors from service entrypoints, prefer `miette::Report` with context to produce actionable diagnostics. ## 3. Database Access — SeaORM + Postgres - Use SeaORM for all relational access; keep raw SQL to migrations or perf-critical paths only. - Model organization: - Entities live in a dedicated `entities` module/crate generated by SeaORM tooling or handwritten as needed. - Queries should be implemented in small, testable functions; avoid mixing business logic with query composition. - Migrations: - Create a migrations crate (e.g., `crates/migration`) using SeaORM CLI and point it at Postgres via `DATABASE_URL`. - Typical workflow: - `sea-orm-cli migrate init` - `sea-orm-cli migrate generate ` - Edit up/down migrations. - `sea-orm-cli migrate up` (or `refresh` in dev) - In services, run migrations on boot behind a feature flag or dedicated admin command. - Connection management: - Create a connection pool per service at startup (Tokio + async). Inject the pool into subsystems rather than using globals. - Use timeouts and statement logging (via `tracing`) for observability. ### 3.1 Connection Pooling — deadpool (Postgres) We standardize on `deadpool` for async connection pooling to Postgres, using `deadpool-postgres` (+ `tokio-postgres`) alongside SeaORM. Services construct a single pool at startup and pass it into subsystems. Guidelines: - Configuration via env: - `DATABASE_URL` (Postgres connection string) - `DB_POOL_MAX_SIZE` (e.g., 16–64 depending on workload) - `DB_POOL_TIMEOUT_MS` (acquire timeout; default 5_000ms) - Enable TLS when required by your Postgres deployment (use `rustls` variants where possible). Avoid embedding credentials in code; prefer env/secret stores. - Instrument queries with `tracing`. Prefer statement logging at `debug` with sampling in high-traffic paths. Minimal setup sketch: ```rust use std::time::Duration; use deadpool_postgres::{Config as PgConfig, ManagerConfig, RecyclingMethod}; use miette::{IntoDiagnostic, Result}; use tokio_postgres::NoTls; // or a TLS connector pub struct DbPool(deadpool_postgres::Pool); pub fn build_db_pool() -> Result { let mut cfg = PgConfig::new(); cfg.dbname = None; // use url cfg.user = None; // prefer url/env cfg.password = None; cfg.host = None; cfg.port = None; cfg.manager = Some(ManagerConfig { recycling_method: RecyclingMethod::Fast }); // Use URL from env; deadpool-postgres supports it via from_env as well. let mut cfg = PgConfig::from_env("DATABASE").unwrap_or(cfg); let pool = cfg .create_pool(Some(deadpool_postgres::Runtime::Tokio1), NoTls) .into_diagnostic()?; // Optionally wrap with a newtype to pass around. Ok(DbPool(pool)) } pub async fn with_client(pool: &DbPool, f: F) -> Result where F: FnOnce(deadpool_postgres::Client) -> Fut, Fut: std::future::Future>, { use tokio::time::timeout; let client = timeout( Duration::from_millis(std::env::var("DB_POOL_TIMEOUT_MS").ok() .and_then(|s| s.parse().ok()).unwrap_or(5_000)), pool.0.get(), ) .await .into_diagnostic()??; f(client).await } ``` Notes: - Use one pool per service. Do not create pools per request. - If using SeaORM, prefer constructing `Database::connect` from a `tokio_postgres::Client` via `sqlx` feature compatibility where applicable, or keep SeaORM for ORM and use raw pooled clients for admin/utility tasks. ## 4. Blob Storage — S3 and Filesystem We support both S3-compatible object storage and a local filesystem backend. - Prefer a storage abstraction in `common` with a trait like `BlobStore` and two implementations: - `S3BlobStore` using the official `aws-sdk-s3` client (or a vetted S3-compatible client). - `FsBlobStore` rooted at `BLOB_FS_ROOT` for local/dev and tests. - Selection is via configuration (env or CLI). Keep keys and endpoints out of the repo. - For S3, follow best practices: - Use instance/role credentials where available; otherwise use env creds. - Set timeouts, retries, and backoff via the client config. - Keep bucket and path layout stable; include job IDs in keys for traceability. - For filesystem, ensure directories are created atomically and validate paths to avoid traversal issues. ## 5. Argument Parsing — Clap - All binaries use `clap` for flags/subcommands. Derive-based APIs (`#[derive(Parser)]`) are strongly preferred for consistency and help text quality. - Align flags across services where semantics match (e.g., `--log-level`, `--database-url`, `--s3-endpoint`). - Emit `--help` that documents env var fallbacks where appropriate. Example skeleton: ```rust use clap::Parser; #[derive(Parser, Debug)] #[command(name = "solstice-orchestrator", version, about)] struct Opts { /// Postgres connection string #[arg(long, env = "DATABASE_URL")] database_url: String, /// OTLP endpoint (e.g., http://localhost:4317) #[arg(long, env = "OTEL_EXPORTER_OTLP_ENDPOINT")] otlp: Option, } fn main() -> miette::Result<()> { let _t = common::telemetry::init_tracing("solstice-orchestrator")?; let opts = Opts::parse(); // ... Ok(()) } ``` ## 6. Testing — How We Configure and Run Tests - Unit tests: colocated in module files under `#[cfg(test)]`. - Integration tests: per-crate `tests/` directory. Each `*.rs` compiles as a separate test binary. - Doc tests: keep examples correct or mark them `no_run` if they require external services. - Workspace commands we actually verified: - Run all tests: `cargo test --workspace` - Run a single crate’s integration test (example we executed): `cargo test -p agent --test smoke` Adding a new integration test: 1. Create `crates//tests/.rs`. 2. Write tests using the public API of the crate; avoid `#[cfg(test)]` internals. 3. Use `#[tokio::test(flavor = "multi_thread")]` when async runtime is needed. 4. Gate external dependencies (DB, S3) behind env flags or mark tests `ignore` by default. Example minimal integration test (we created and validated this during documentation): ```rust #[test] fn smoke_passes() { assert_eq!(2 + 2, 4); } ``` Running subsets and filters: - By crate: `cargo test -p orchestrator` - By test target: `cargo test -p agent --test smoke` - By name filter: `cargo test smoke` ## 7. Code Style and Conventions - Error design: - Domain errors via `thiserror`. Convert them to `miette::Report` at boundaries (CLI/service entry) with context. - Prefer `Result` where `E: Into` for top-level flows. - Telemetry: - Every request/job gets a root span; child spans for significant phases. Include IDs in span fields, not only logs. - When logging sensitive data, strip or hash before emitting. - Async and runtime: - Tokio everywhere for async services. Use a single runtime per process; don’t nest. - Use `tracing` instrument macros (`#[instrument]`) for important async fns. - Crate versions: - Always prefer the most recent compatible releases for `miette`, `thiserror`, `tracing`, `tracing-subscriber`, `opentelemetry`, `opentelemetry-otlp`, `tracing-opentelemetry`, `sea-orm`, `sea-orm-cli`, `aws-sdk-s3`, and `clap`. - Avoid pinning minor/patch versions unless required for reproducibility or to work around regressions. ## 8. Local Development Tips - Reproducible tasks: For interactions with external systems, prefer containerized dependencies (e.g., Postgres, MinIO) or `devcontainer`/`nix` flows if provided in the future. - Migration safety: Never auto-apply migrations in production without a maintenance window and backup. In dev, `refresh` is acceptable. - Storage backends: Provide a no-op or filesystem fallback so developers can run without cloud credentials. - Observability: Keep logs structured and at `info` by default. For deep debugging, use `RUST_LOG=trace` with sampling to avoid log storms. ## 9. Troubleshooting Quick Notes - No logs emitted: Ensure the tracing subscriber is initialized exactly once; double-init will panic. Also check `RUST_LOG` filters. - OTLP export fails: Verify `OTEL_EXPORTER_OTLP_ENDPOINT` and that an OTLP collector (e.g., `otelcol`) is reachable. Fall back to console-only tracing if needed. - DB connection errors: Validate `DATABASE_URL` and SSL/TLS requirements. Confirm the service can resolve the host and reach the port. - S3 errors: Check credentials and bucket permissions; verify endpoint (for MinIO use path-style or specific region settings). ## 10. Documentation Routing - Architecture and AI-generated summaries: place in `docs/ai/` with a timestamp prefix. - This guidelines file intentionally lives under `.junie/guidelines.md` for developer tooling. ## 11. Messaging — RabbitMQ (lapin) We use `lapin` (AMQP 0-9-1 client) for RabbitMQ access with Tokio. Keep producers and consumers simple and observable; centralize connection setup in the `common` crate where practical and inject channels into subsystems. This channel is used for asynchronous communication. For direct, synchronous service-to-service RPC, use gRPC via `tonic` (see §12). Configuration (env; per-service may extend): - `AMQP_URL` (e.g., `amqp://user:pass@host:5672/%2f` or `amqps://...` for TLS) - `AMQP_PREFETCH` (QoS prefetch; default 32–256 depending on workload) - `AMQP_EXCHANGE` (default exchange name; often empty-string for default direct exchange) - `AMQP_QUEUE` (queue name) - `AMQP_ROUTING_KEY` (routing key when publishing) Guidelines: - Establish one `Connection` per process with automatic heartbeats; create dedicated `Channel`s per producer/consumer task. - Declare exchanges/queues idempotently at startup with `durable = true` and `auto_delete = false` unless explicitly ephemeral. - Set channel QoS with `basic_qos(prefetch_count)` to control backpressure. Use ack/nack to preserve at-least-once delivery. - Prefer publisher confirms (`confirm_select`) and handle `BasicReturn` for unroutable messages when `mandatory` is set. - Instrument with `tracing`: tag spans with `exchange`, `queue`, `routing_key`, and message IDs; avoid logging bodies. - Reconnection: on connection/channel error, back off with jitter and recreate connection/channels; ensure consumers re-declare topology. Minimal producer example: ```rust use lapin::{options::*, types::FieldTable, BasicProperties, Connection, ConnectionProperties}; use miette::{IntoDiagnostic as _, Result}; use tokio_amqp::*; // enables Tokio reactor for lapin pub async fn publish_one(msg: &[u8]) -> Result<()> { let url = std::env::var("AMQP_URL").unwrap_or_else(|_| "amqp://127.0.0.1:5672/%2f".into()); let exchange = std::env::var("AMQP_EXCHANGE").unwrap_or_default(); let routing_key = std::env::var("AMQP_ROUTING_KEY").unwrap_or_else(|_| "jobs".into()); let conn = Connection::connect(&url, ConnectionProperties::default().with_tokio()) .await .into_diagnostic()?; let channel = conn.create_channel().await.into_diagnostic()?; // Optional: declare exchange if non-empty. if !exchange.is_empty() { channel .exchange_declare( &exchange, lapin::ExchangeKind::Direct, ExchangeDeclareOptions { durable: true, auto_delete: false, ..Default::default() }, FieldTable::default(), ) .await .into_diagnostic()?; } // Publisher confirms channel.confirm_select(ConfirmSelectOptions::default()).await.into_diagnostic()?; let confirm = channel .basic_publish( &exchange, &routing_key, BasicPublishOptions { mandatory: true, ..Default::default() }, msg, BasicProperties::default().with_content_type("application/octet-stream".into()), ) .await .into_diagnostic()?; confirm.await.into_diagnostic()?; // wait for confirm Ok(()) } ``` Minimal consumer example: ```rust use lapin::{options::*, types::FieldTable, Connection, ConnectionProperties}; use miette::{IntoDiagnostic as _, Result}; use tokio_amqp::*; pub async fn consume() -> Result<()> { let url = std::env::var("AMQP_URL").unwrap_or_else(|_| "amqp://127.0.0.1:5672/%2f".into()); let queue = std::env::var("AMQP_QUEUE").unwrap_or_else(|_| "jobs".into()); let prefetch: u16 = std::env::var("AMQP_PREFETCH").ok().and_then(|s| s.parse().ok()).unwrap_or(64); let conn = Connection::connect(&url, ConnectionProperties::default().with_tokio()) .await .into_diagnostic()?; let channel = conn.create_channel().await.into_diagnostic()?; channel .queue_declare( &queue, QueueDeclareOptions { durable: true, auto_delete: false, ..Default::default() }, FieldTable::default(), ) .await .into_diagnostic()?; channel.basic_qos(prefetch, BasicQosOptions { global: false }).await.into_diagnostic()?; let mut consumer = channel .basic_consume( &queue, "worker", BasicConsumeOptions { no_ack: false, ..Default::default() }, FieldTable::default(), ) .await .into_diagnostic()?; while let Some(delivery) = consumer.next().await { let delivery = delivery.into_diagnostic()?; // process delivery.data ... channel.basic_ack(delivery.delivery_tag, BasicAckOptions::default()).await.into_diagnostic()?; } Ok(()) } ``` Notes: - Use TLS (`amqps://`) where brokers require it; configure certificates via the underlying TLS connector if needed. - Keep message payloads schematized (e.g., JSON/CBOR/Protobuf) and versioned; include an explicit `content_type` and version header where applicable. ## 12. RPC — gRPC with tonic We use `gRPC` for direct, synchronous service-to-service communication and standardize on the Rust `tonic` stack with `prost` for code generation. Prefer RabbitMQ (see §11) for asynchronous workflows, fan-out, or buffering; prefer gRPC when a caller needs an immediate response, strong request/response semantics, deadlines, and backpressure at the transport layer. Guidance: - Crates: `tonic`, `prost`, `prost-types`, optionally `tonic-health` (liveness/readiness) and `tonic-reflection` (dev only). - Service boundaries: Define protobuf packages per domain (`orchestrator.v1`, `agent.v1`). Version packages; keep backward compatible changes (field additions with new tags, do not reuse/rename tags). - Errors: Map domain errors to `tonic::Status` with appropriate `Code` (e.g., `InvalidArgument`, `NotFound`, `FailedPrecondition`, `Unavailable`, `Internal`). Preserve rich diagnostics in logs via `miette` and `tracing`; avoid leaking internals to clients. - Deadlines and cancellation: Require callers to set deadlines; servers must honor `request.deadline()` and `request.extensions()` cancellation. Set sensible server timeouts. - Observability: Propagate W3C TraceContext over gRPC metadata and create spans per RPC. Emit attributes for `rpc.system = "grpc"`, `rpc.service`, `rpc.method`, stable IDs (job_id) as fields. - Security: Prefer TLS with `rustls`. Use mTLS where feasible, or bearer tokens in metadata (e.g., `authorization: Bearer ...`). Rotate certs without restarts where possible. - Operations: Configure keepalive, max message sizes, and concurrency. Use streaming RPCs for log streaming and large artifact transfers when applicable. Common env configuration (typical; per-service may extend): - `GRPC_ADDR` or `GRPC_HOST`/`GRPC_PORT` (listen/connect endpoint, e.g., `0.0.0.0:50051`) - `GRPC_TLS_CERT`, `GRPC_TLS_KEY`, `GRPC_TLS_CA` (PEM paths for TLS/mTLS) - `GRPC_KEEPALIVE_MS` (e.g., 20_000), `GRPC_MAX_MESSAGE_MB` (e.g., 32), `GRPC_TIMEOUT_MS` (client default) Minimal server skeleton (centralize wiring in `common`): ```rust use miette::{IntoDiagnostic as _, Result}; use tonic::{transport::Server, Request, Response, Status}; use tracing::{info, instrument}; // Assume prost-generated module: pub mod orchestrator { pub mod v1 { include!("orchestrator.v1.rs"); }} use orchestrator::v1::{job_service_server::{JobService, JobServiceServer}, StartJobRequest, StartJobResponse}; pub struct JobSvc; #[tonic::async_trait] impl JobService for JobSvc { #[instrument(name = "rpc_start_job", skip(self, req), fields(rpc.system = "grpc", rpc.service = "orchestrator.v1.JobService", rpc.method = "StartJob"))] async fn start_job(&self, req: Request) -> Result, Status> { // Respect caller deadline/cancellation if req.deadline().elapsed().is_ok() { return Err(Status::deadline_exceeded("deadline passed")); } let r = req.into_inner(); info!(job_id = %r.job_id, "start_job request"); // ... do work, map domain errors to Status ... Ok(Response::new(StartJobResponse { accepted: true })) } } pub async fn serve(addr: std::net::SocketAddr) -> Result<()> { let _t = common::telemetry::init_tracing("solstice-orchestrator"); Server::builder() .add_service(JobServiceServer::new(JobSvc)) .serve(addr) .await .into_diagnostic() } ``` Minimal client sketch (with timeout and TLS optional): ```rust use miette::{IntoDiagnostic as _, Result}; use tonic::{transport::{Channel, ClientTlsConfig, Endpoint}, Request, Code}; use orchestrator::v1::{job_service_client::JobServiceClient, StartJobRequest}; pub async fn start_job(addr: &str, job_id: String) -> Result { let mut ep = Endpoint::from_shared(format!("https://{addr}"))?.tcp_keepalive(Some(std::time::Duration::from_secs(20))); // ep = ep.tls_config(ClientTlsConfig::new())?; // enable when TLS configured let channel: Channel = ep.connect_timeout(std::time::Duration::from_secs(3)).connect().await.into_diagnostic()?; let mut client = JobServiceClient::new(channel); let mut req = Request::new(StartJobRequest { job_id }); req.set_timeout(std::time::Duration::from_secs(10)); match client.start_job(req).await { Ok(resp) => Ok(resp.into_inner().accepted), Err(status) if status.code() == Code::Unavailable => Ok(false), // map transient Err(e) => Err(miette::miette!("gRPC error: {e}")), } } ``` Health and reflection: - Expose `grpc.health.v1.Health` via `tonic-health` for k8s/consumers. Include a readiness check (DB pool, AMQP connection) before reporting `SERVING`. - Enable `tonic-reflection` only in dev/test to assist tooling; disable in prod. Testing notes: - Use `#[tokio::test(flavor = "multi_thread")]` and bind the server to `127.0.0.1:0` (ephemeral port) for integration tests. - Assert deadlines and cancellation by setting short timeouts on the client `Request` and verifying `Status::deadline_exceeded`. Cross-reference: - Asynchronous communication: RabbitMQ via `lapin` (§11). - Direct synchronous RPC: gRPC via `tonic` (this section).