This document records project-specific build, test, and development conventions for the Solstice CI workspace. It is written for experienced Rust developers and focuses only on details that are unique to this repository.
- Docs policy: Any AI-generated markdown summaries not explicitly requested in a prompt must live under `docs/ai/` with a timestamp prefix in the filename (e.g., `2025-10-25-some-topic.md`).
## 1. Build and Configuration
- Use the stable toolchain unless explicitly noted. The workspace uses the 2024 edition; keep `rustup` and `cargo` updated.
- Top-level build:
- Build everything: `cargo build --workspace`
- Run individual binaries during development using `cargo run -p <crate>`.
- Lints and formatting follow the default Rust style unless a crate specifies otherwise. Prefer `cargo fmt` and `cargo clippy --workspace --all-targets --all-features` before committing.
- Secrets and credentials are never committed. For local runs, use environment variables or a `.env` provider (do not add `.env` to VCS). In CI/deployments, use a secret store (e.g., Vault, KMS) — see the Integration layer notes.
Common configuration environment variables (pattern; per-service variables may diverge):
-`BLOB_FS_ROOT` (filesystem-backed blob store root)
## 2. Error Handling, Tracing, and Telemetry
We standardize on the `miette` + `thiserror` error pattern and `tracing` with OpenTelemetry.
-`thiserror` is used for domain error enums/structs; avoid stringly-typed errors.
- Wrap top-level errors with `miette::Report` for rich diagnostic output in CLIs and service logs. Prefer `eyre`-style ergonomics via `miette::Result<T>` where appropriate.
- Use `tracing` over `log`. Never mix ad-hoc `println!` in services; `println!` remains acceptable in intentionally minimal binaries or for very early bootstrap before subscribers are set.
- Emit spans/resources with OpenTelemetry. The shared initialization lives in the `common` crate so all binaries get consistent wiring.
Recommended initialization (placed in `crates/common` as a single public function and called from each service `main`):
```rust
// common/src/telemetry.rs (example; keep in `common`)
let _telemetry = common::telemetry::init_tracing("solstice-orchestrator")?;
// ...
Ok(())
}
```
Guidance:
- Favor spans around orchestration lifecycles (VM provisioning, job execution, webhook processing). Include high-cardinality attributes judiciously; prefer stable keys (e.g., job_id) and sample IDs for verbose content.
- When returning errors from service entrypoints, prefer `miette::Report` with context to produce actionable diagnostics.
## 3. Database Access — SeaORM + Postgres
- Use SeaORM for all relational access; keep raw SQL to migrations or perf-critical paths only.
- Model organization:
- Entities live in a dedicated `entities` module/crate generated by SeaORM tooling or handwritten as needed.
- Queries should be implemented in small, testable functions; avoid mixing business logic with query composition.
- Migrations:
- Create a migrations crate (e.g., `crates/migration`) using SeaORM CLI and point it at Postgres via `DATABASE_URL`.
- Typical workflow:
-`sea-orm-cli migrate init`
-`sea-orm-cli migrate generate <name>`
- Edit up/down migrations.
-`sea-orm-cli migrate up` (or `refresh` in dev)
- In services, run migrations on boot behind a feature flag or dedicated admin command.
- Connection management:
- Create a connection pool per service at startup (Tokio + async). Inject the pool into subsystems rather than using globals.
- Use timeouts and statement logging (via `tracing`) for observability.
### 3.1 Connection Pooling — deadpool (Postgres)
We standardize on `deadpool` for async connection pooling to Postgres, using `deadpool-postgres` (+ `tokio-postgres`) alongside SeaORM. Services construct a single pool at startup and pass it into subsystems.
Guidelines:
- Configuration via env:
-`DATABASE_URL` (Postgres connection string)
-`DB_POOL_MAX_SIZE` (e.g., 16–64 depending on workload)
- Enable TLS when required by your Postgres deployment (use `rustls` variants where possible). Avoid embedding credentials in code; prefer env/secret stores.
- Instrument queries with `tracing`. Prefer statement logging at `debug` with sampling in high-traffic paths.
Minimal setup sketch:
```rust
use std::time::Duration;
use deadpool_postgres::{Config as PgConfig, ManagerConfig, RecyclingMethod};
- Use one pool per service. Do not create pools per request.
- If using SeaORM, prefer constructing `Database::connect` from a `tokio_postgres::Client` via `sqlx` feature compatibility where applicable, or keep SeaORM for ORM and use raw pooled clients for admin/utility tasks.
## 4. Blob Storage — S3 and Filesystem
We support both S3-compatible object storage and a local filesystem backend.
- Prefer a storage abstraction in `common` with a trait like `BlobStore` and two implementations:
-`S3BlobStore` using the official `aws-sdk-s3` client (or a vetted S3-compatible client).
-`FsBlobStore` rooted at `BLOB_FS_ROOT` for local/dev and tests.
- Selection is via configuration (env or CLI). Keep keys and endpoints out of the repo.
- For S3, follow best practices:
- Use instance/role credentials where available; otherwise use env creds.
- Set timeouts, retries, and backoff via the client config.
- Keep bucket and path layout stable; include job IDs in keys for traceability.
- For filesystem, ensure directories are created atomically and validate paths to avoid traversal issues.
## 5. Argument Parsing — Clap
- All binaries use `clap` for flags/subcommands. Derive-based APIs (`#[derive(Parser)]`) are strongly preferred for consistency and help text quality.
- Align flags across services where semantics match (e.g., `--log-level`, `--database-url`, `--s3-endpoint`).
- Emit `--help` that documents env var fallbacks where appropriate.
let _t = common::telemetry::init_tracing("solstice-orchestrator")?;
let opts = Opts::parse();
// ...
Ok(())
}
```
## 6. Testing — How We Configure and Run Tests
- Unit tests: colocated in module files under `#[cfg(test)]`.
- Integration tests: per-crate `tests/` directory. Each `*.rs` compiles as a separate test binary.
- Doc tests: keep examples correct or mark them `no_run` if they require external services.
- Workspace commands we actually verified:
- Run all tests: `cargo test --workspace`
- Run a single crate’s integration test (example we executed): `cargo test -p agent --test smoke`
Adding a new integration test:
1. Create `crates/<crate>/tests/<name>.rs`.
2. Write tests using the public API of the crate; avoid `#[cfg(test)]` internals.
3. Use `#[tokio::test(flavor = "multi_thread")]` when async runtime is needed.
4. Gate external dependencies (DB, S3) behind env flags or mark tests `ignore` by default.
Example minimal integration test (we created and validated this during documentation):
```rust
#[test]
fn smoke_passes() {
assert_eq!(2 + 2, 4);
}
```
Running subsets and filters:
- By crate: `cargo test -p orchestrator`
- By test target: `cargo test -p agent --test smoke`
- By name filter: `cargo test smoke`
## 7. Code Style and Conventions
- Error design:
- Domain errors via `thiserror`. Convert them to `miette::Report` at boundaries (CLI/service entry) with context.
- Prefer `Result<T, E>` where `E: Into<miette::Report>` for top-level flows.
- Telemetry:
- Every request/job gets a root span; child spans for significant phases. Include IDs in span fields, not only logs.
- When logging sensitive data, strip or hash before emitting.
- Async and runtime:
- Tokio everywhere for async services. Use a single runtime per process; don’t nest.
- Use `tracing` instrument macros (`#[instrument]`) for important async fns.
- Crate versions:
- Always prefer the most recent compatible releases for `miette`, `thiserror`, `tracing`, `tracing-subscriber`, `opentelemetry`, `opentelemetry-otlp`, `tracing-opentelemetry`, `sea-orm`, `sea-orm-cli`, `aws-sdk-s3`, and `clap`.
- Avoid pinning minor/patch versions unless required for reproducibility or to work around regressions.
## 8. Local Development Tips
- Reproducible tasks: For interactions with external systems, prefer containerized dependencies (e.g., Postgres, MinIO) or `devcontainer`/`nix` flows if provided in the future.
- Migration safety: Never auto-apply migrations in production without a maintenance window and backup. In dev, `refresh` is acceptable.
- Storage backends: Provide a no-op or filesystem fallback so developers can run without cloud credentials.
- Observability: Keep logs structured and at `info` by default. For deep debugging, use `RUST_LOG=trace` with sampling to avoid log storms.
## 9. Troubleshooting Quick Notes
- No logs emitted: Ensure the tracing subscriber is initialized exactly once; double-init will panic. Also check `RUST_LOG` filters.
- OTLP export fails: Verify `OTEL_EXPORTER_OTLP_ENDPOINT` and that an OTLP collector (e.g., `otelcol`) is reachable. Fall back to console-only tracing if needed.
- DB connection errors: Validate `DATABASE_URL` and SSL/TLS requirements. Confirm the service can resolve the host and reach the port.
- S3 errors: Check credentials and bucket permissions; verify endpoint (for MinIO use path-style or specific region settings).
## 10. Documentation Routing
- Architecture and AI-generated summaries: place in `docs/ai/` with a timestamp prefix.
- This guidelines file intentionally lives under `.junie/guidelines.md` for developer tooling.
## 11. Messaging — RabbitMQ (lapin)
We use `lapin` (AMQP 0-9-1 client) for RabbitMQ access with Tokio. Keep producers and consumers simple and observable; centralize connection setup in the `common` crate where practical and inject channels into subsystems. This channel is used for asynchronous communication. For direct, synchronous service-to-service RPC, use gRPC via `tonic` (see §12).
Configuration (env; per-service may extend):
-`AMQP_URL` (e.g., `amqp://user:pass@host:5672/%2f` or `amqps://...` for TLS)
-`AMQP_PREFETCH` (QoS prefetch; default 32–256 depending on workload)
-`AMQP_EXCHANGE` (default exchange name; often empty-string for default direct exchange)
-`AMQP_QUEUE` (queue name)
-`AMQP_ROUTING_KEY` (routing key when publishing)
Guidelines:
- Establish one `Connection` per process with automatic heartbeats; create dedicated `Channel`s per producer/consumer task.
- Declare exchanges/queues idempotently at startup with `durable = true` and `auto_delete = false` unless explicitly ephemeral.
- Set channel QoS with `basic_qos(prefetch_count)` to control backpressure. Use ack/nack to preserve at-least-once delivery.
- Prefer publisher confirms (`confirm_select`) and handle `BasicReturn` for unroutable messages when `mandatory` is set.
- Instrument with `tracing`: tag spans with `exchange`, `queue`, `routing_key`, and message IDs; avoid logging bodies.
- Reconnection: on connection/channel error, back off with jitter and recreate connection/channels; ensure consumers re-declare topology.
Minimal producer example:
```rust
use lapin::{options::*, types::FieldTable, BasicProperties, Connection, ConnectionProperties};
use miette::{IntoDiagnostic as _, Result};
use tokio_amqp::*; // enables Tokio reactor for lapin
- Use TLS (`amqps://`) where brokers require it; configure certificates via the underlying TLS connector if needed.
- Keep message payloads schematized (e.g., JSON/CBOR/Protobuf) and versioned; include an explicit `content_type` and version header where applicable.
## 12. RPC — gRPC with tonic
We use `gRPC` for direct, synchronous service-to-service communication and standardize on the Rust `tonic` stack with `prost` for code generation. Prefer RabbitMQ (see §11) for asynchronous workflows, fan-out, or buffering; prefer gRPC when a caller needs an immediate response, strong request/response semantics, deadlines, and backpressure at the transport layer.
- Service boundaries: Define protobuf packages per domain (`orchestrator.v1`, `agent.v1`). Version packages; keep backward compatible changes (field additions with new tags, do not reuse/rename tags).
- Errors: Map domain errors to `tonic::Status` with appropriate `Code` (e.g., `InvalidArgument`, `NotFound`, `FailedPrecondition`, `Unavailable`, `Internal`). Preserve rich diagnostics in logs via `miette` and `tracing`; avoid leaking internals to clients.
- Deadlines and cancellation: Require callers to set deadlines; servers must honor `request.deadline()` and `request.extensions()` cancellation. Set sensible server timeouts.
- Observability: Propagate W3C TraceContext over gRPC metadata and create spans per RPC. Emit attributes for `rpc.system = "grpc"`, `rpc.service`, `rpc.method`, stable IDs (job_id) as fields.
- Security: Prefer TLS with `rustls`. Use mTLS where feasible, or bearer tokens in metadata (e.g., `authorization: Bearer ...`). Rotate certs without restarts where possible.
- Operations: Configure keepalive, max message sizes, and concurrency. Use streaming RPCs for log streaming and large artifact transfers when applicable.
Common env configuration (typical; per-service may extend):
-`GRPC_ADDR` or `GRPC_HOST`/`GRPC_PORT` (listen/connect endpoint, e.g., `0.0.0.0:50051`)
-`GRPC_TLS_CERT`, `GRPC_TLS_KEY`, `GRPC_TLS_CA` (PEM paths for TLS/mTLS)
- Expose `grpc.health.v1.Health` via `tonic-health` for k8s/consumers. Include a readiness check (DB pool, AMQP connection) before reporting `SERVING`.
- Enable `tonic-reflection` only in dev/test to assist tooling; disable in prod.
Testing notes:
- Use `#[tokio::test(flavor = "multi_thread")]` and bind the server to `127.0.0.1:0` (ephemeral port) for integration tests.
- Assert deadlines and cancellation by setting short timeouts on the client `Request` and verifying `Status::deadline_exceeded`.
Cross-reference:
- Asynchronous communication: RabbitMQ via `lapin` (§11).
- Direct synchronous RPC: gRPC via `tonic` (this section).