24 KiB
Solstice CI — Engineering Guidelines
This document records project-specific build, test, and development conventions for the Solstice CI workspace. It is written for experienced Rust developers and focuses only on details that are unique to this repository.
- Workspace root:
solstice-ci/ - Crates:
crates/agent,crates/orchestrator,crates/forge-integration,crates/github-integration,crates/common - Rust edition:
2024 - Docs policy: Any AI-generated markdown summaries not explicitly requested in a prompt must live under
docs/ai/with a timestamp prefix in the filename (e.g.,2025-10-25-some-topic.md).
1. Build and Configuration
- Use the stable toolchain unless explicitly noted. The workspace uses the 2024 edition; keep
rustupandcargoupdated. - Top-level build:
- Build everything:
cargo build --workspace - Run individual binaries during development using
cargo run -p <crate>.
- Build everything:
- Lints and formatting follow the default Rust style unless a crate specifies otherwise. Prefer
cargo fmtandcargo clippy --workspace --all-targets --all-featuresbefore committing. - Secrets and credentials are never committed. For local runs, use environment variables or a
.envprovider (do not add.envto VCS). In CI/deployments, use a secret store (e.g., Vault, KMS) — see the Integration layer notes.
Common configuration environment variables (pattern; per-service variables may diverge):
- Logging and tracing:
RUST_LOG(or usetracing-subscriberfilters)OTEL_EXPORTER_OTLP_ENDPOINT(e.g.,http://localhost:4317)OTEL_SERVICE_NAME(e.g.,solstice-orchestrator)
- Database (Postgres via SeaORM):
DATABASE_URL=postgres://user:pass@host:5432/solstice
- Object storage (S3-compatible) and filesystem:
S3_ENDPOINT,AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_REGION,S3_BUCKETBLOB_FS_ROOT(filesystem-backed blob store root)
2. Error Handling, Tracing, and Telemetry
We standardize on the miette + thiserror error pattern and tracing with OpenTelemetry.
thiserroris used for domain error enums/structs; avoid stringly-typed errors.- Wrap top-level errors with
miette::Reportfor rich diagnostic output in CLIs and service logs. Prefereyre-style ergonomics viamiette::Result<T>where appropriate. - Use
tracingoverlog. Never mix ad-hocprintln!in services;println!remains acceptable in intentionally minimal binaries or for very early bootstrap before subscribers are set. - Emit spans/resources with OpenTelemetry. The shared initialization lives in the
commoncrate so all binaries get consistent wiring.
Recommended initialization (placed in crates/common as a single public function and called from each service main):
// common/src/telemetry.rs (example; keep in `common`)
pub struct TelemetryGuard {
// Drop to flush and shutdown OTEL pipeline.
_guard: Option<tracing_appender::non_blocking::WorkerGuard>,
}
pub fn init_tracing(service_name: &str) -> miette::Result<TelemetryGuard> {
use miette::IntoDiagnostic;
use opentelemetry::sdk::{propagation::TraceContextPropagator, Resource};
use opentelemetry::sdk::trace as sdktrace;
use opentelemetry::KeyValue;
use opentelemetry_otlp::WithExportConfig;
use tracing_subscriber::{EnvFilter, fmt, layer::SubscriberExt, util::SubscriberInitExt};
// Resource describing this service.
let resource = Resource::new(vec![KeyValue::new("service.name", service_name.to_string())]);
// OTLP exporter (gRPC) if endpoint present; otherwise, only console output.
let tracer = opentelemetry_otlp::new_pipeline()
.tracing()
.with_trace_config(sdktrace::Config::default().with_resource(resource))
.with_exporter(opentelemetry_otlp::new_exporter().tonic())
.install_batch(opentelemetry::runtime::Tokio)
.into_diagnostic()?;
// Optional non-blocking file appender (example) — keep simple unless needed.
let (nb_writer, guard) = tracing_appender::non_blocking(std::io::stderr());
let fmt_layer = fmt::layer()
.with_target(false)
.with_writer(nb_writer)
.with_ansi(atty::is(atty::Stream::Stderr));
let otel_layer = tracing_opentelemetry::layer().with_tracer(tracer);
let filter = EnvFilter::try_from_default_env()
.or_else(|_| Ok(EnvFilter::new("info")))
.into_diagnostic()?;
opentelemetry::global::set_text_map_propagator(TraceContextPropagator::new());
tracing_subscriber::registry()
.with(filter)
.with(fmt_layer)
.with(otel_layer)
.init();
Ok(TelemetryGuard { _guard: Some(guard) })
}
In each service binary:
fn main() -> miette::Result<()> {
let _telemetry = common::telemetry::init_tracing("solstice-orchestrator")?;
// ...
Ok(())
}
Guidance:
- Favor spans around orchestration lifecycles (VM provisioning, job execution, webhook processing). Include high-cardinality attributes judiciously; prefer stable keys (e.g., job_id) and sample IDs for verbose content.
- When returning errors from service entrypoints, prefer
miette::Reportwith context to produce actionable diagnostics.
3. Database Access — SeaORM + Postgres
- Use SeaORM for all relational access; keep raw SQL to migrations or perf-critical paths only.
- Model organization:
- Entities live in a dedicated
entitiesmodule/crate generated by SeaORM tooling or handwritten as needed. - Queries should be implemented in small, testable functions; avoid mixing business logic with query composition.
- Entities live in a dedicated
- Migrations:
- Create a migrations crate (e.g.,
crates/migration) using SeaORM CLI and point it at Postgres viaDATABASE_URL. - Typical workflow:
sea-orm-cli migrate initsea-orm-cli migrate generate <name>- Edit up/down migrations.
sea-orm-cli migrate up(orrefreshin dev)
- In services, run migrations on boot behind a feature flag or dedicated admin command.
- Create a migrations crate (e.g.,
- Connection management:
- Create a connection pool per service at startup (Tokio + async). Inject the pool into subsystems rather than using globals.
- Use timeouts and statement logging (via
tracing) for observability.
3.1 Connection Pooling — deadpool (Postgres)
We standardize on deadpool for async connection pooling to Postgres, using deadpool-postgres (+ tokio-postgres) alongside SeaORM. Services construct a single pool at startup and pass it into subsystems.
Guidelines:
- Configuration via env:
DATABASE_URL(Postgres connection string)DB_POOL_MAX_SIZE(e.g., 16–64 depending on workload)DB_POOL_TIMEOUT_MS(acquire timeout; default 5_000ms)
- Enable TLS when required by your Postgres deployment (use
rustlsvariants where possible). Avoid embedding credentials in code; prefer env/secret stores. - Instrument queries with
tracing. Prefer statement logging atdebugwith sampling in high-traffic paths.
Minimal setup sketch:
use std::time::Duration;
use deadpool_postgres::{Config as PgConfig, ManagerConfig, RecyclingMethod};
use miette::{IntoDiagnostic, Result};
use tokio_postgres::NoTls; // or a TLS connector
pub struct DbPool(deadpool_postgres::Pool);
pub fn build_db_pool() -> Result<DbPool> {
let mut cfg = PgConfig::new();
cfg.dbname = None; // use url
cfg.user = None; // prefer url/env
cfg.password = None;
cfg.host = None;
cfg.port = None;
cfg.manager = Some(ManagerConfig { recycling_method: RecyclingMethod::Fast });
// Use URL from env; deadpool-postgres supports it via from_env as well.
let mut cfg = PgConfig::from_env("DATABASE").unwrap_or(cfg);
let pool = cfg
.create_pool(Some(deadpool_postgres::Runtime::Tokio1), NoTls)
.into_diagnostic()?;
// Optionally wrap with a newtype to pass around.
Ok(DbPool(pool))
}
pub async fn with_client<F, Fut, T>(pool: &DbPool, f: F) -> Result<T>
where
F: FnOnce(deadpool_postgres::Client) -> Fut,
Fut: std::future::Future<Output = Result<T>>,
{
use tokio::time::timeout;
let client = timeout(
Duration::from_millis(std::env::var("DB_POOL_TIMEOUT_MS").ok()
.and_then(|s| s.parse().ok()).unwrap_or(5_000)),
pool.0.get(),
)
.await
.into_diagnostic()??;
f(client).await
}
Notes:
- Use one pool per service. Do not create pools per request.
- If using SeaORM, prefer constructing
Database::connectfrom atokio_postgres::Clientviasqlxfeature compatibility where applicable, or keep SeaORM for ORM and use raw pooled clients for admin/utility tasks.
4. Blob Storage — S3 and Filesystem
We support both S3-compatible object storage and a local filesystem backend.
- Prefer a storage abstraction in
commonwith a trait likeBlobStoreand two implementations:S3BlobStoreusing the officialaws-sdk-s3client (or a vetted S3-compatible client).FsBlobStorerooted atBLOB_FS_ROOTfor local/dev and tests.
- Selection is via configuration (env or CLI). Keep keys and endpoints out of the repo.
- For S3, follow best practices:
- Use instance/role credentials where available; otherwise use env creds.
- Set timeouts, retries, and backoff via the client config.
- Keep bucket and path layout stable; include job IDs in keys for traceability.
- For filesystem, ensure directories are created atomically and validate paths to avoid traversal issues.
5. Argument Parsing — Clap
- All binaries use
clapfor flags/subcommands. Derive-based APIs (#[derive(Parser)]) are strongly preferred for consistency and help text quality. - Align flags across services where semantics match (e.g.,
--log-level,--database-url,--s3-endpoint). - Emit
--helpthat documents env var fallbacks where appropriate.
Example skeleton:
use clap::Parser;
#[derive(Parser, Debug)]
#[command(name = "solstice-orchestrator", version, about)]
struct Opts {
/// Postgres connection string
#[arg(long, env = "DATABASE_URL")]
database_url: String,
/// OTLP endpoint (e.g., http://localhost:4317)
#[arg(long, env = "OTEL_EXPORTER_OTLP_ENDPOINT")]
otlp: Option<String>,
}
fn main() -> miette::Result<()> {
let _t = common::telemetry::init_tracing("solstice-orchestrator")?;
let opts = Opts::parse();
// ...
Ok(())
}
6. Testing — How We Configure and Run Tests
- Unit tests: colocated in module files under
#[cfg(test)]. - Integration tests: per-crate
tests/directory. Each*.rscompiles as a separate test binary. - Doc tests: keep examples correct or mark them
no_runif they require external services. - Workspace commands we actually verified:
- Run all tests:
cargo test --workspace - Run a single crate’s integration test (example we executed):
cargo test -p agent --test smoke
- Run all tests:
Adding a new integration test:
- Create
crates/<crate>/tests/<name>.rs. - Write tests using the public API of the crate; avoid
#[cfg(test)]internals. - Use
#[tokio::test(flavor = "multi_thread")]when async runtime is needed. - Gate external dependencies (DB, S3) behind env flags or mark tests
ignoreby default.
Example minimal integration test (we created and validated this during documentation):
#[test]
fn smoke_passes() {
assert_eq!(2 + 2, 4);
}
Running subsets and filters:
- By crate:
cargo test -p orchestrator - By test target:
cargo test -p agent --test smoke - By name filter:
cargo test smoke
7. Code Style and Conventions
- Error design:
- Domain errors via
thiserror. Convert them tomiette::Reportat boundaries (CLI/service entry) with context. - Prefer
Result<T, E>whereE: Into<miette::Report>for top-level flows.
- Domain errors via
- Telemetry:
- Every request/job gets a root span; child spans for significant phases. Include IDs in span fields, not only logs.
- When logging sensitive data, strip or hash before emitting.
- Async and runtime:
- Tokio everywhere for async services. Use a single runtime per process; don’t nest.
- Use
tracinginstrument macros (#[instrument]) for important async fns.
- Crate versions:
- Always prefer the most recent compatible releases for
miette,thiserror,tracing,tracing-subscriber,opentelemetry,opentelemetry-otlp,tracing-opentelemetry,sea-orm,sea-orm-cli,aws-sdk-s3, andclap. - Avoid pinning minor/patch versions unless required for reproducibility or to work around regressions.
- Always prefer the most recent compatible releases for
8. Local Development Tips
- Reproducible tasks: For interactions with external systems, prefer containerized dependencies (e.g., Postgres, MinIO) or
devcontainer/nixflows if provided in the future. - Migration safety: Never auto-apply migrations in production without a maintenance window and backup. In dev,
refreshis acceptable. - Storage backends: Provide a no-op or filesystem fallback so developers can run without cloud credentials.
- Observability: Keep logs structured and at
infoby default. For deep debugging, useRUST_LOG=tracewith sampling to avoid log storms.
9. Troubleshooting Quick Notes
- No logs emitted: Ensure the tracing subscriber is initialized exactly once; double-init will panic. Also check
RUST_LOGfilters. - OTLP export fails: Verify
OTEL_EXPORTER_OTLP_ENDPOINTand that an OTLP collector (e.g.,otelcol) is reachable. Fall back to console-only tracing if needed. - DB connection errors: Validate
DATABASE_URLand SSL/TLS requirements. Confirm the service can resolve the host and reach the port. - S3 errors: Check credentials and bucket permissions; verify endpoint (for MinIO use path-style or specific region settings).
10. Documentation Routing
- Architecture and AI-generated summaries: place in
docs/ai/with a timestamp prefix. - This guidelines file intentionally lives under
.junie/guidelines.mdfor developer tooling.
11. Messaging — RabbitMQ (lapin)
We use lapin (AMQP 0-9-1 client) for RabbitMQ access with Tokio. Keep producers and consumers simple and observable; centralize connection setup in the common crate where practical and inject channels into subsystems. This channel is used for asynchronous communication. For direct, synchronous service-to-service RPC, use gRPC via tonic (see §12).
Configuration (env; per-service may extend):
AMQP_URL(e.g.,amqp://user:pass@host:5672/%2foramqps://...for TLS)AMQP_PREFETCH(QoS prefetch; default 32–256 depending on workload)AMQP_EXCHANGE(default exchange name; often empty-string for default direct exchange)AMQP_QUEUE(queue name)AMQP_ROUTING_KEY(routing key when publishing)
Guidelines:
- Establish one
Connectionper process with automatic heartbeats; create dedicatedChannels per producer/consumer task. - Declare exchanges/queues idempotently at startup with
durable = trueandauto_delete = falseunless explicitly ephemeral. - Set channel QoS with
basic_qos(prefetch_count)to control backpressure. Use ack/nack to preserve at-least-once delivery. - Prefer publisher confirms (
confirm_select) and handleBasicReturnfor unroutable messages whenmandatoryis set. - Instrument with
tracing: tag spans withexchange,queue,routing_key, and message IDs; avoid logging bodies. - Reconnection: on connection/channel error, back off with jitter and recreate connection/channels; ensure consumers re-declare topology.
Minimal producer example:
use lapin::{options::*, types::FieldTable, BasicProperties, Connection, ConnectionProperties};
use miette::{IntoDiagnostic as _, Result};
use tokio_amqp::*; // enables Tokio reactor for lapin
pub async fn publish_one(msg: &[u8]) -> Result<()> {
let url = std::env::var("AMQP_URL").unwrap_or_else(|_| "amqp://127.0.0.1:5672/%2f".into());
let exchange = std::env::var("AMQP_EXCHANGE").unwrap_or_default();
let routing_key = std::env::var("AMQP_ROUTING_KEY").unwrap_or_else(|_| "jobs".into());
let conn = Connection::connect(&url, ConnectionProperties::default().with_tokio())
.await
.into_diagnostic()?;
let channel = conn.create_channel().await.into_diagnostic()?;
// Optional: declare exchange if non-empty.
if !exchange.is_empty() {
channel
.exchange_declare(
&exchange,
lapin::ExchangeKind::Direct,
ExchangeDeclareOptions { durable: true, auto_delete: false, ..Default::default() },
FieldTable::default(),
)
.await
.into_diagnostic()?;
}
// Publisher confirms
channel.confirm_select(ConfirmSelectOptions::default()).await.into_diagnostic()?;
let confirm = channel
.basic_publish(
&exchange,
&routing_key,
BasicPublishOptions { mandatory: true, ..Default::default() },
msg,
BasicProperties::default().with_content_type("application/octet-stream".into()),
)
.await
.into_diagnostic()?;
confirm.await.into_diagnostic()?; // wait for confirm
Ok(())
}
Minimal consumer example:
use lapin::{options::*, types::FieldTable, Connection, ConnectionProperties};
use miette::{IntoDiagnostic as _, Result};
use tokio_amqp::*;
pub async fn consume() -> Result<()> {
let url = std::env::var("AMQP_URL").unwrap_or_else(|_| "amqp://127.0.0.1:5672/%2f".into());
let queue = std::env::var("AMQP_QUEUE").unwrap_or_else(|_| "jobs".into());
let prefetch: u16 = std::env::var("AMQP_PREFETCH").ok().and_then(|s| s.parse().ok()).unwrap_or(64);
let conn = Connection::connect(&url, ConnectionProperties::default().with_tokio())
.await
.into_diagnostic()?;
let channel = conn.create_channel().await.into_diagnostic()?;
channel
.queue_declare(
&queue,
QueueDeclareOptions { durable: true, auto_delete: false, ..Default::default() },
FieldTable::default(),
)
.await
.into_diagnostic()?;
channel.basic_qos(prefetch, BasicQosOptions { global: false }).await.into_diagnostic()?;
let mut consumer = channel
.basic_consume(
&queue,
"worker",
BasicConsumeOptions { no_ack: false, ..Default::default() },
FieldTable::default(),
)
.await
.into_diagnostic()?;
while let Some(delivery) = consumer.next().await {
let delivery = delivery.into_diagnostic()?;
// process delivery.data ...
channel.basic_ack(delivery.delivery_tag, BasicAckOptions::default()).await.into_diagnostic()?;
}
Ok(())
}
Notes:
- Use TLS (
amqps://) where brokers require it; configure certificates via the underlying TLS connector if needed. - Keep message payloads schematized (e.g., JSON/CBOR/Protobuf) and versioned; include an explicit
content_typeand version header where applicable.
12. RPC — gRPC with tonic
We use gRPC for direct, synchronous service-to-service communication and standardize on the Rust tonic stack with prost for code generation. Prefer RabbitMQ (see §11) for asynchronous workflows, fan-out, or buffering; prefer gRPC when a caller needs an immediate response, strong request/response semantics, deadlines, and backpressure at the transport layer.
Guidance:
- Crates:
tonic,prost,prost-types, optionallytonic-health(liveness/readiness) andtonic-reflection(dev only). - Service boundaries: Define protobuf packages per domain (
orchestrator.v1,agent.v1). Version packages; keep backward compatible changes (field additions with new tags, do not reuse/rename tags). - Errors: Map domain errors to
tonic::Statuswith appropriateCode(e.g.,InvalidArgument,NotFound,FailedPrecondition,Unavailable,Internal). Preserve rich diagnostics in logs viamietteandtracing; avoid leaking internals to clients. - Deadlines and cancellation: Require callers to set deadlines; servers must honor
request.deadline()andrequest.extensions()cancellation. Set sensible server timeouts. - Observability: Propagate W3C TraceContext over gRPC metadata and create spans per RPC. Emit attributes for
rpc.system = "grpc",rpc.service,rpc.method, stable IDs (job_id) as fields. - Security: Prefer TLS with
rustls. Use mTLS where feasible, or bearer tokens in metadata (e.g.,authorization: Bearer ...). Rotate certs without restarts where possible. - Operations: Configure keepalive, max message sizes, and concurrency. Use streaming RPCs for log streaming and large artifact transfers when applicable.
Common env configuration (typical; per-service may extend):
GRPC_ADDRorGRPC_HOST/GRPC_PORT(listen/connect endpoint, e.g.,0.0.0.0:50051)GRPC_TLS_CERT,GRPC_TLS_KEY,GRPC_TLS_CA(PEM paths for TLS/mTLS)GRPC_KEEPALIVE_MS(e.g., 20_000),GRPC_MAX_MESSAGE_MB(e.g., 32),GRPC_TIMEOUT_MS(client default)
Minimal server skeleton (centralize wiring in common):
use miette::{IntoDiagnostic as _, Result};
use tonic::{transport::Server, Request, Response, Status};
use tracing::{info, instrument};
// Assume prost-generated module: pub mod orchestrator { pub mod v1 { include!("orchestrator.v1.rs"); }}
use orchestrator::v1::{job_service_server::{JobService, JobServiceServer}, StartJobRequest, StartJobResponse};
pub struct JobSvc;
#[tonic::async_trait]
impl JobService for JobSvc {
#[instrument(name = "rpc_start_job", skip(self, req), fields(rpc.system = "grpc", rpc.service = "orchestrator.v1.JobService", rpc.method = "StartJob"))]
async fn start_job(&self, req: Request<StartJobRequest>) -> Result<Response<StartJobResponse>, Status> {
// Respect caller deadline/cancellation
if req.deadline().elapsed().is_ok() { return Err(Status::deadline_exceeded("deadline passed")); }
let r = req.into_inner();
info!(job_id = %r.job_id, "start_job request");
// ... do work, map domain errors to Status ...
Ok(Response::new(StartJobResponse { accepted: true }))
}
}
pub async fn serve(addr: std::net::SocketAddr) -> Result<()> {
let _t = common::telemetry::init_tracing("solstice-orchestrator");
Server::builder()
.add_service(JobServiceServer::new(JobSvc))
.serve(addr)
.await
.into_diagnostic()
}
Minimal client sketch (with timeout and TLS optional):
use miette::{IntoDiagnostic as _, Result};
use tonic::{transport::{Channel, ClientTlsConfig, Endpoint}, Request, Code};
use orchestrator::v1::{job_service_client::JobServiceClient, StartJobRequest};
pub async fn start_job(addr: &str, job_id: String) -> Result<bool> {
let mut ep = Endpoint::from_shared(format!("https://{addr}"))?.tcp_keepalive(Some(std::time::Duration::from_secs(20)));
// ep = ep.tls_config(ClientTlsConfig::new())?; // enable when TLS configured
let channel: Channel = ep.connect_timeout(std::time::Duration::from_secs(3)).connect().await.into_diagnostic()?;
let mut client = JobServiceClient::new(channel);
let mut req = Request::new(StartJobRequest { job_id });
req.set_timeout(std::time::Duration::from_secs(10));
match client.start_job(req).await {
Ok(resp) => Ok(resp.into_inner().accepted),
Err(status) if status.code() == Code::Unavailable => Ok(false), // map transient
Err(e) => Err(miette::miette!("gRPC error: {e}")),
}
}
Health and reflection:
- Expose
grpc.health.v1.Healthviatonic-healthfor k8s/consumers. Include a readiness check (DB pool, AMQP connection) before reportingSERVING. - Enable
tonic-reflectiononly in dev/test to assist tooling; disable in prod.
Testing notes:
- Use
#[tokio::test(flavor = "multi_thread")]and bind the server to127.0.0.1:0(ephemeral port) for integration tests. - Assert deadlines and cancellation by setting short timeouts on the client
Requestand verifyingStatus::deadline_exceeded.
Cross-reference:
- Asynchronous communication: RabbitMQ via
lapin(§11). - Direct synchronous RPC: gRPC via
tonic(this section).