# ADR-016: GPU Pipeline Future-Proofing (Cloud Gaming Path) ## Status Accepted ## Context WayRay's current design prioritizes the no-GPU headless path (PixmanRenderer + CPU encoding). This is correct for the primary thin client use case. However, a cloud gaming scenario -- GPU-rendered games streamed to thin clients -- is a natural future extension. We must ensure today's design decisions don't close that door. ### The Cloud Gaming Pipeline ``` Current (CPU, works in zones): App → wl_shm (CPU) → Pixman composite (CPU) → memcpy → CPU encode → QUIC Latency: ~15-30ms encode Quality: good for desktop, too slow for gaming Future (GPU, zero-copy): Game → DMA-BUF (GPU) → GLES composite (GPU) → DMA-BUF → VAAPI/NVENC (GPU) → QUIC Latency: ~3-5ms encode Quality: 1080p60 or 4K60 with hardware encoding CPU: barely involved ``` The zero-copy GPU path means pixels never touch CPU memory: the game renders to a DMA-BUF, the compositor composites into another DMA-BUF, and the hardware encoder reads that DMA-BUF directly. This is how Sunshine/Moonlight, GeForce NOW, and Steam Remote Play achieve sub-20ms motion-to-photon latency. ### What Could Block This If we make any of these mistakes now, GPU encoding becomes a painful retrofit later: | Mistake | Why it blocks GPU path | |---------|----------------------| | Encoder takes only `&[u8]` pixel data | Forces GPU→CPU readback, kills zero-copy | | Frame pipeline is synchronous | Can't pipeline render→encode→send | | Damage tracking assumes CPU-accessible buffers | Can't diff GPU-resident frames | | Frame cadence is damage-only | Gaming needs consistent 60/120fps | | Compositor hardcodes PixmanRenderer | Can't composite on GPU | ## Decision ### Design constraints to preserve the GPU path: ### 1. Frame Source Abstraction The encoder must accept both CPU buffers and GPU buffer handles: ```rust /// A frame ready for encoding. Can be CPU pixels or a GPU buffer handle. enum FrameSource<'a> { /// CPU-accessible pixel data (from PixmanRenderer / ExportMem) Cpu { data: &'a [u8], width: u32, height: u32, stride: u32, format: PixelFormat, }, /// GPU buffer handle (from GlesRenderer / DRM) /// Encoder can import this directly without CPU copy DmaBuf { fd: OwnedFd, width: u32, height: u32, stride: u32, format: DrmFourcc, modifier: DrmModifier, }, } ``` A CPU-based encoder (zstd diff, software x264) processes `FrameSource::Cpu`. A hardware encoder (VAAPI, NVENC) prefers `FrameSource::DmaBuf` (zero-copy) but can also accept `FrameSource::Cpu` (upload first, slower). ### 2. Encoder Trait ```rust trait FrameEncoder: Send { /// What frame sources this encoder supports fn supported_sources(&self) -> &[FrameSourceType]; /// Encode a frame, returning encoded data for transmission. /// `damage` hints which regions changed (encoder may ignore if doing full-frame). fn encode( &mut self, source: FrameSource<'_>, damage: &[Rectangle], frame_seq: u64, ) -> Result; /// Signal that encoding parameters should adapt /// (e.g., client reported decode time too high) fn adapt(&mut self, feedback: EncoderFeedback); } enum FrameSourceType { Cpu, DmaBuf, } ``` Planned encoder implementations: | Encoder | Accepts | Use Case | |---------|---------|----------| | `ZstdDiffEncoder` | Cpu | Desktop, lossless text, LAN | | `SoftwareH264Encoder` (x264) | Cpu | Desktop, lossy, WAN | | `VaapiH264Encoder` | DmaBuf (preferred), Cpu | GPU-accelerated, gaming | | `NvencH264Encoder` | DmaBuf (preferred), Cpu | NVIDIA GPU, gaming | | `VaapiAv1Encoder` | DmaBuf (preferred), Cpu | Future, better quality | ### 3. Pipelined Frame Submission The render→encode→send pipeline must be **asynchronous and pipelined**, not synchronous: ``` Synchronous (WRONG -- blocks on encode): Frame N: [render]──[encode]──[send] Frame N+1: [render]──[encode]──[send] Pipelined (RIGHT -- overlap work): Frame N: [render]──[encode]──[send] Frame N+1: [render]──[encode]──[send] Frame N+2: [render]──[encode]──[send] ``` Implementation: the compositor submits frames to an encoding channel. The encoder runs on a separate thread (or GPU async queue). Encoded frames are submitted to the QUIC sender. Each stage can work on a different frame simultaneously. ```rust // Compositor render loop (calloop) fn on_frame_rendered(&mut self, frame: FrameSource, damage: Vec) { // Non-blocking: submit to encoder channel self.encoder_tx.send(EncodeRequest { frame, damage, seq: self.next_seq() }); } // Encoder thread fn encoder_loop(rx: Receiver, tx: Sender) { for req in rx { let encoded = self.encoder.encode(req.frame, &req.damage, req.seq)?; tx.send(encoded); // Submit to network sender } } ``` ### 4. Adaptive Frame Cadence Two scheduling modes, selectable at runtime: ```rust enum FrameCadence { /// Render only when Wayland clients commit new content. /// Coalesce rapid commits within a window (e.g., 4ms). /// Ideal for desktop: saves CPU when nothing changes. OnDamage { coalesce_ms: u32 }, /// Render at a fixed rate regardless of client commits. /// Resubmit the last frame if nothing changed (keeps encoder fed). /// Needed for gaming: consistent framerate, no stutter. Fixed { fps: u32 }, } ``` The compositor defaults to `OnDamage`. When a surface is detected as "high frame rate" (e.g., commits faster than 30fps consistently), the compositor can switch to `Fixed` mode for that output. This could also be triggered by configuration (`wradm session set-cadence gaming`). ### 5. Damage Tracking Must Support Both Paths For CPU buffers, we diff pixels directly (XOR previous frame, zstd compress diff). For GPU buffers, we can't cheaply read pixels for diffing. Instead: - **Wayland damage regions**: Every `wl_surface.commit()` includes damage hints from the client. These are already tracked by Smithay's `OutputDamageTracker`. - **Full-frame encoding**: Hardware encoders (H.264, AV1) handle their own temporal redundancy via P-frames and reference frames. They don't need pixel-level diffs from us. So the damage path forks: ``` CPU path: pixel diff (XOR + zstd) → needs pixel access → FrameSource::Cpu GPU path: Wayland damage hints → no pixel access needed → FrameSource::DmaBuf + H.264 temporal coding encoder handles the rest ``` Both paths are valid. The encoder trait abstracts this -- a zstd encoder uses damage for pixel diff regions, a VAAPI encoder uses damage to decide keyframe frequency or quality. ### 6. GPU Partitioning (Infrastructure, Not WayRay) For multi-session gaming, the GPU must be shared: | Technology | What it does | Platform | |-----------|-------------|----------| | SR-IOV | Hardware GPU partitioning | Intel (Flex/Arc), some AMD | | NVIDIA MIG | Multi-Instance GPU slicing | NVIDIA A100/A30/H100 | | NVIDIA vGPU | Virtual GPU via hypervisor | NVIDIA datacenter GPUs | | Intel GVT-g | Mediated GPU passthrough | Intel integrated GPUs | | Virtio-GPU | Paravirtualized GPU for VMs | QEMU/KVM | This is infrastructure configuration, not compositor code. WayRay just needs to see a `/dev/dri/renderD*` device in its environment. Whether that device is a physical GPU, an SR-IOV VF, a MIG slice, or a vGPU instance is transparent. **WayRay's responsibility**: detect available GPU, use it if present, fall back gracefully if not. Already covered by ADR-005 (dual renderer). ## What We Build Now vs Later ### Build Now (Phase 0-2) - `FrameSource` enum with `Cpu` variant only (no DmaBuf yet) - `FrameEncoder` trait with the full interface (including `supported_sources`) - `ZstdDiffEncoder` and `SoftwareH264Encoder` (Cpu only) - Pipelined encode channel (encoder runs on separate thread) - `FrameCadence::OnDamage` as default - PixmanRenderer path ### Build Later (when GPU gaming is prioritized) - Add `FrameSource::DmaBuf` variant - `VaapiH264Encoder` / `NvencEncoder` (import DMA-BUF, zero-copy encode) - `FrameCadence::Fixed` mode - GlesRenderer path with DMA-BUF export to encoder - Client-side frame interpolation for network jitter compensation - Input prediction (client-side speculative input processing) ### Never Build (out of scope) - GPU partitioning infrastructure (SR-IOV, MIG setup) - Game streaming as a separate product (WayRay is a compositor, not GeForce NOW) - Client-side GPU rendering (the client stays dumb) ## Rationale - **Trait-based encoder** with `FrameSource` enum is the critical abstraction. It costs nothing now (we only implement the Cpu variant) but preserves the zero-copy DMA-BUF path for later. - **Pipelined encoding** improves desktop performance too (compositor doesn't block on encode), so it's not wasted work. - **Adaptive frame cadence** is a minor addition that enables gaming without disrupting desktop behavior. - **No premature GPU code**: we don't write VAAPI/NVENC encoders now. We just ensure the interfaces don't preclude them. ## Consequences - `FrameSource` enum adds one level of indirection to the encoding path (trivial cost) - Pipelined encoding adds threading complexity (but improves performance) - Must resist the temptation to simplify the encoder trait to `fn encode(&[u8]) -> Vec` -- the DmaBuf variant must stay in the design even if unimplemented - Documentation must explain the GPU future path so contributors don't accidentally close the door