Add ADR-016: GPU pipeline future-proofing for cloud gaming

Ensure today's CPU-first design doesn't block a future zero-copy GPU encoding path (DMA-BUF → VAAPI/NVENC, sub-5ms encode latency). Key design constraints: - FrameSource enum: Cpu and DmaBuf variants (build Cpu now, DmaBuf later) - FrameEncoder trait: accepts both, encoder picks preferred source - Pipelined render→encode→send (no synchronous blocking) - Adaptive frame cadence: OnDamage (desktop) vs Fixed (gaming) - Damage tracking forks: pixel diff for CPU, Wayland hints for GPU No GPU code built now -- just trait interfaces that preserve the path.
2026-04-10 13:10:41 +00:00 · 2026-03-29 13:29:32 +02:00 · 2026-03-29 13:29:32 +02:00 · f702cbf4e7
commit f702cbf4e7
parent 643c4f042d
1 changed files with 233 additions and 0 deletions
--- a/docs/ai/adr/016-gpu-pipeline-future-proofing.md
+++ b/docs/ai/adr/016-gpu-pipeline-future-proofing.md
@ -0,0 +1,233 @@
+# ADR-016: GPU Pipeline Future-Proofing (Cloud Gaming Path)
+
+## Status
+Accepted
+
+## Context
+WayRay's current design prioritizes the no-GPU headless path (PixmanRenderer + CPU encoding). This is correct for the primary thin client use case. However, a cloud gaming scenario -- GPU-rendered games streamed to thin clients -- is a natural future extension. We must ensure today's design decisions don't close that door.
+
+### The Cloud Gaming Pipeline
+
+```
+Current (CPU, works in zones):
+  App → wl_shm (CPU) → Pixman composite (CPU) → memcpy → CPU encode → QUIC
+  Latency: ~15-30ms encode
+  Quality: good for desktop, too slow for gaming
+
+Future (GPU, zero-copy):
+  Game → DMA-BUF (GPU) → GLES composite (GPU) → DMA-BUF → VAAPI/NVENC (GPU) → QUIC
+  Latency: ~3-5ms encode
+  Quality: 1080p60 or 4K60 with hardware encoding
+  CPU: barely involved
+```
+
+The zero-copy GPU path means pixels never touch CPU memory: the game renders to a DMA-BUF, the compositor composites into another DMA-BUF, and the hardware encoder reads that DMA-BUF directly. This is how Sunshine/Moonlight, GeForce NOW, and Steam Remote Play achieve sub-20ms motion-to-photon latency.
+
+### What Could Block This
+
+If we make any of these mistakes now, GPU encoding becomes a painful retrofit later:
+
+| Mistake | Why it blocks GPU path |
+|---------|----------------------|
+| Encoder takes only `&[u8]` pixel data | Forces GPU→CPU readback, kills zero-copy |
+| Frame pipeline is synchronous | Can't pipeline render→encode→send |
+| Damage tracking assumes CPU-accessible buffers | Can't diff GPU-resident frames |
+| Frame cadence is damage-only | Gaming needs consistent 60/120fps |
+| Compositor hardcodes PixmanRenderer | Can't composite on GPU |
+
+## Decision
+
+### Design constraints to preserve the GPU path:
+
+### 1. Frame Source Abstraction
+
+The encoder must accept both CPU buffers and GPU buffer handles:
+
+```rust
+/// A frame ready for encoding. Can be CPU pixels or a GPU buffer handle.
+enum FrameSource<'a> {
+    /// CPU-accessible pixel data (from PixmanRenderer / ExportMem)
+    Cpu {
+        data: &'a [u8],
+        width: u32,
+        height: u32,
+        stride: u32,
+        format: PixelFormat,
+    },
+
+    /// GPU buffer handle (from GlesRenderer / DRM)
+    /// Encoder can import this directly without CPU copy
+    DmaBuf {
+        fd: OwnedFd,
+        width: u32,
+        height: u32,
+        stride: u32,
+        format: DrmFourcc,
+        modifier: DrmModifier,
+    },
+}
+```
+
+A CPU-based encoder (zstd diff, software x264) processes `FrameSource::Cpu`.
+A hardware encoder (VAAPI, NVENC) prefers `FrameSource::DmaBuf` (zero-copy) but can also accept `FrameSource::Cpu` (upload first, slower).
+
+### 2. Encoder Trait
+
+```rust
+trait FrameEncoder: Send {
+    /// What frame sources this encoder supports
+    fn supported_sources(&self) -> &[FrameSourceType];
+
+    /// Encode a frame, returning encoded data for transmission.
+    /// `damage` hints which regions changed (encoder may ignore if doing full-frame).
+    fn encode(
+        &mut self,
+        source: FrameSource<'_>,
+        damage: &[Rectangle],
+        frame_seq: u64,
+    ) -> Result<EncodedFrame>;
+
+    /// Signal that encoding parameters should adapt
+    /// (e.g., client reported decode time too high)
+    fn adapt(&mut self, feedback: EncoderFeedback);
+}
+
+enum FrameSourceType {
+    Cpu,
+    DmaBuf,
+}
+```
+
+Planned encoder implementations:
+
+| Encoder | Accepts | Use Case |
+|---------|---------|----------|
+| `ZstdDiffEncoder` | Cpu | Desktop, lossless text, LAN |
+| `SoftwareH264Encoder` (x264) | Cpu | Desktop, lossy, WAN |
+| `VaapiH264Encoder` | DmaBuf (preferred), Cpu | GPU-accelerated, gaming |
+| `NvencH264Encoder` | DmaBuf (preferred), Cpu | NVIDIA GPU, gaming |
+| `VaapiAv1Encoder` | DmaBuf (preferred), Cpu | Future, better quality |
+
+### 3. Pipelined Frame Submission
+
+The render→encode→send pipeline must be **asynchronous and pipelined**, not synchronous:
+
+```
+Synchronous (WRONG -- blocks on encode):
+  Frame N:   [render]──[encode]──[send]
+  Frame N+1:                            [render]──[encode]──[send]
+
+Pipelined (RIGHT -- overlap work):
+  Frame N:   [render]──[encode]──[send]
+  Frame N+1:        [render]──[encode]──[send]
+  Frame N+2:             [render]──[encode]──[send]
+```
+
+Implementation: the compositor submits frames to an encoding channel. The encoder runs on a separate thread (or GPU async queue). Encoded frames are submitted to the QUIC sender. Each stage can work on a different frame simultaneously.
+
+```rust
+// Compositor render loop (calloop)
+fn on_frame_rendered(&mut self, frame: FrameSource, damage: Vec<Rectangle>) {
+    // Non-blocking: submit to encoder channel
+    self.encoder_tx.send(EncodeRequest { frame, damage, seq: self.next_seq() });
+}
+
+// Encoder thread
+fn encoder_loop(rx: Receiver<EncodeRequest>, tx: Sender<EncodedFrame>) {
+    for req in rx {
+        let encoded = self.encoder.encode(req.frame, &req.damage, req.seq)?;
+        tx.send(encoded);  // Submit to network sender
+    }
+}
+```
+
+### 4. Adaptive Frame Cadence
+
+Two scheduling modes, selectable at runtime:
+
+```rust
+enum FrameCadence {
+    /// Render only when Wayland clients commit new content.
+    /// Coalesce rapid commits within a window (e.g., 4ms).
+    /// Ideal for desktop: saves CPU when nothing changes.
+    OnDamage { coalesce_ms: u32 },
+
+    /// Render at a fixed rate regardless of client commits.
+    /// Resubmit the last frame if nothing changed (keeps encoder fed).
+    /// Needed for gaming: consistent framerate, no stutter.
+    Fixed { fps: u32 },
+}
+```
+
+The compositor defaults to `OnDamage`. When a surface is detected as "high frame rate" (e.g., commits faster than 30fps consistently), the compositor can switch to `Fixed` mode for that output. This could also be triggered by configuration (`wradm session set-cadence gaming`).
+
+### 5. Damage Tracking Must Support Both Paths
+
+For CPU buffers, we diff pixels directly (XOR previous frame, zstd compress diff).
+For GPU buffers, we can't cheaply read pixels for diffing. Instead:
+
+- **Wayland damage regions**: Every `wl_surface.commit()` includes damage hints from the client. These are already tracked by Smithay's `OutputDamageTracker`.
+- **Full-frame encoding**: Hardware encoders (H.264, AV1) handle their own temporal redundancy via P-frames and reference frames. They don't need pixel-level diffs from us.
+
+So the damage path forks:
+
+```
+CPU path:  pixel diff (XOR + zstd)  → needs pixel access    → FrameSource::Cpu
+GPU path:  Wayland damage hints     → no pixel access needed → FrameSource::DmaBuf
+           + H.264 temporal coding    encoder handles the rest
+```
+
+Both paths are valid. The encoder trait abstracts this -- a zstd encoder uses damage for pixel diff regions, a VAAPI encoder uses damage to decide keyframe frequency or quality.
+
+### 6. GPU Partitioning (Infrastructure, Not WayRay)
+
+For multi-session gaming, the GPU must be shared:
+
+| Technology | What it does | Platform |
+|-----------|-------------|----------|
+| SR-IOV | Hardware GPU partitioning | Intel (Flex/Arc), some AMD |
+| NVIDIA MIG | Multi-Instance GPU slicing | NVIDIA A100/A30/H100 |
+| NVIDIA vGPU | Virtual GPU via hypervisor | NVIDIA datacenter GPUs |
+| Intel GVT-g | Mediated GPU passthrough | Intel integrated GPUs |
+| Virtio-GPU | Paravirtualized GPU for VMs | QEMU/KVM |
+
+This is infrastructure configuration, not compositor code. WayRay just needs to see a `/dev/dri/renderD*` device in its environment. Whether that device is a physical GPU, an SR-IOV VF, a MIG slice, or a vGPU instance is transparent.
+
+**WayRay's responsibility**: detect available GPU, use it if present, fall back gracefully if not. Already covered by ADR-005 (dual renderer).
+
+## What We Build Now vs Later
+
+### Build Now (Phase 0-2)
+- `FrameSource` enum with `Cpu` variant only (no DmaBuf yet)
+- `FrameEncoder` trait with the full interface (including `supported_sources`)
+- `ZstdDiffEncoder` and `SoftwareH264Encoder` (Cpu only)
+- Pipelined encode channel (encoder runs on separate thread)
+- `FrameCadence::OnDamage` as default
+- PixmanRenderer path
+
+### Build Later (when GPU gaming is prioritized)
+- Add `FrameSource::DmaBuf` variant
+- `VaapiH264Encoder` / `NvencEncoder` (import DMA-BUF, zero-copy encode)
+- `FrameCadence::Fixed` mode
+- GlesRenderer path with DMA-BUF export to encoder
+- Client-side frame interpolation for network jitter compensation
+- Input prediction (client-side speculative input processing)
+
+### Never Build (out of scope)
+- GPU partitioning infrastructure (SR-IOV, MIG setup)
+- Game streaming as a separate product (WayRay is a compositor, not GeForce NOW)
+- Client-side GPU rendering (the client stays dumb)
+
+## Rationale
+
+- **Trait-based encoder** with `FrameSource` enum is the critical abstraction. It costs nothing now (we only implement the Cpu variant) but preserves the zero-copy DMA-BUF path for later.
+- **Pipelined encoding** improves desktop performance too (compositor doesn't block on encode), so it's not wasted work.
+- **Adaptive frame cadence** is a minor addition that enables gaming without disrupting desktop behavior.
+- **No premature GPU code**: we don't write VAAPI/NVENC encoders now. We just ensure the interfaces don't preclude them.
+
+## Consequences
+
+- `FrameSource` enum adds one level of indirection to the encoding path (trivial cost)
+- Pipelined encoding adds threading complexity (but improves performance)
+- Must resist the temptation to simplify the encoder trait to `fn encode(&[u8]) -> Vec<u8>` -- the DmaBuf variant must stay in the design even if unimplemented
+- Documentation must explain the GPU future path so contributors don't accidentally close the door