wayray/docs/ai/adr/016-gpu-pipeline-future-proofing.md
Till Wegmueller f702cbf4e7
Add ADR-016: GPU pipeline future-proofing for cloud gaming
Ensure today's CPU-first design doesn't block a future zero-copy
GPU encoding path (DMA-BUF → VAAPI/NVENC, sub-5ms encode latency).

Key design constraints:
- FrameSource enum: Cpu and DmaBuf variants (build Cpu now, DmaBuf later)
- FrameEncoder trait: accepts both, encoder picks preferred source
- Pipelined render→encode→send (no synchronous blocking)
- Adaptive frame cadence: OnDamage (desktop) vs Fixed (gaming)
- Damage tracking forks: pixel diff for CPU, Wayland hints for GPU

No GPU code built now -- just trait interfaces that preserve the path.
2026-03-29 13:29:32 +02:00

9.5 KiB

ADR-016: GPU Pipeline Future-Proofing (Cloud Gaming Path)

Status

Accepted

Context

WayRay's current design prioritizes the no-GPU headless path (PixmanRenderer + CPU encoding). This is correct for the primary thin client use case. However, a cloud gaming scenario -- GPU-rendered games streamed to thin clients -- is a natural future extension. We must ensure today's design decisions don't close that door.

The Cloud Gaming Pipeline

Current (CPU, works in zones):
  App → wl_shm (CPU) → Pixman composite (CPU) → memcpy → CPU encode → QUIC
  Latency: ~15-30ms encode
  Quality: good for desktop, too slow for gaming

Future (GPU, zero-copy):
  Game → DMA-BUF (GPU) → GLES composite (GPU) → DMA-BUF → VAAPI/NVENC (GPU) → QUIC
  Latency: ~3-5ms encode
  Quality: 1080p60 or 4K60 with hardware encoding
  CPU: barely involved

The zero-copy GPU path means pixels never touch CPU memory: the game renders to a DMA-BUF, the compositor composites into another DMA-BUF, and the hardware encoder reads that DMA-BUF directly. This is how Sunshine/Moonlight, GeForce NOW, and Steam Remote Play achieve sub-20ms motion-to-photon latency.

What Could Block This

If we make any of these mistakes now, GPU encoding becomes a painful retrofit later:

Mistake Why it blocks GPU path
Encoder takes only &[u8] pixel data Forces GPU→CPU readback, kills zero-copy
Frame pipeline is synchronous Can't pipeline render→encode→send
Damage tracking assumes CPU-accessible buffers Can't diff GPU-resident frames
Frame cadence is damage-only Gaming needs consistent 60/120fps
Compositor hardcodes PixmanRenderer Can't composite on GPU

Decision

Design constraints to preserve the GPU path:

1. Frame Source Abstraction

The encoder must accept both CPU buffers and GPU buffer handles:

/// A frame ready for encoding. Can be CPU pixels or a GPU buffer handle.
enum FrameSource<'a> {
    /// CPU-accessible pixel data (from PixmanRenderer / ExportMem)
    Cpu {
        data: &'a [u8],
        width: u32,
        height: u32,
        stride: u32,
        format: PixelFormat,
    },

    /// GPU buffer handle (from GlesRenderer / DRM)
    /// Encoder can import this directly without CPU copy
    DmaBuf {
        fd: OwnedFd,
        width: u32,
        height: u32,
        stride: u32,
        format: DrmFourcc,
        modifier: DrmModifier,
    },
}

A CPU-based encoder (zstd diff, software x264) processes FrameSource::Cpu. A hardware encoder (VAAPI, NVENC) prefers FrameSource::DmaBuf (zero-copy) but can also accept FrameSource::Cpu (upload first, slower).

2. Encoder Trait

trait FrameEncoder: Send {
    /// What frame sources this encoder supports
    fn supported_sources(&self) -> &[FrameSourceType];

    /// Encode a frame, returning encoded data for transmission.
    /// `damage` hints which regions changed (encoder may ignore if doing full-frame).
    fn encode(
        &mut self,
        source: FrameSource<'_>,
        damage: &[Rectangle],
        frame_seq: u64,
    ) -> Result<EncodedFrame>;

    /// Signal that encoding parameters should adapt
    /// (e.g., client reported decode time too high)
    fn adapt(&mut self, feedback: EncoderFeedback);
}

enum FrameSourceType {
    Cpu,
    DmaBuf,
}

Planned encoder implementations:

Encoder Accepts Use Case
ZstdDiffEncoder Cpu Desktop, lossless text, LAN
SoftwareH264Encoder (x264) Cpu Desktop, lossy, WAN
VaapiH264Encoder DmaBuf (preferred), Cpu GPU-accelerated, gaming
NvencH264Encoder DmaBuf (preferred), Cpu NVIDIA GPU, gaming
VaapiAv1Encoder DmaBuf (preferred), Cpu Future, better quality

3. Pipelined Frame Submission

The render→encode→send pipeline must be asynchronous and pipelined, not synchronous:

Synchronous (WRONG -- blocks on encode):
  Frame N:   [render]──[encode]──[send]
  Frame N+1:                            [render]──[encode]──[send]

Pipelined (RIGHT -- overlap work):
  Frame N:   [render]──[encode]──[send]
  Frame N+1:        [render]──[encode]──[send]
  Frame N+2:             [render]──[encode]──[send]

Implementation: the compositor submits frames to an encoding channel. The encoder runs on a separate thread (or GPU async queue). Encoded frames are submitted to the QUIC sender. Each stage can work on a different frame simultaneously.

// Compositor render loop (calloop)
fn on_frame_rendered(&mut self, frame: FrameSource, damage: Vec<Rectangle>) {
    // Non-blocking: submit to encoder channel
    self.encoder_tx.send(EncodeRequest { frame, damage, seq: self.next_seq() });
}

// Encoder thread
fn encoder_loop(rx: Receiver<EncodeRequest>, tx: Sender<EncodedFrame>) {
    for req in rx {
        let encoded = self.encoder.encode(req.frame, &req.damage, req.seq)?;
        tx.send(encoded);  // Submit to network sender
    }
}

4. Adaptive Frame Cadence

Two scheduling modes, selectable at runtime:

enum FrameCadence {
    /// Render only when Wayland clients commit new content.
    /// Coalesce rapid commits within a window (e.g., 4ms).
    /// Ideal for desktop: saves CPU when nothing changes.
    OnDamage { coalesce_ms: u32 },

    /// Render at a fixed rate regardless of client commits.
    /// Resubmit the last frame if nothing changed (keeps encoder fed).
    /// Needed for gaming: consistent framerate, no stutter.
    Fixed { fps: u32 },
}

The compositor defaults to OnDamage. When a surface is detected as "high frame rate" (e.g., commits faster than 30fps consistently), the compositor can switch to Fixed mode for that output. This could also be triggered by configuration (wradm session set-cadence gaming).

5. Damage Tracking Must Support Both Paths

For CPU buffers, we diff pixels directly (XOR previous frame, zstd compress diff). For GPU buffers, we can't cheaply read pixels for diffing. Instead:

  • Wayland damage regions: Every wl_surface.commit() includes damage hints from the client. These are already tracked by Smithay's OutputDamageTracker.
  • Full-frame encoding: Hardware encoders (H.264, AV1) handle their own temporal redundancy via P-frames and reference frames. They don't need pixel-level diffs from us.

So the damage path forks:

CPU path:  pixel diff (XOR + zstd)  → needs pixel access    → FrameSource::Cpu
GPU path:  Wayland damage hints     → no pixel access needed → FrameSource::DmaBuf
           + H.264 temporal coding    encoder handles the rest

Both paths are valid. The encoder trait abstracts this -- a zstd encoder uses damage for pixel diff regions, a VAAPI encoder uses damage to decide keyframe frequency or quality.

6. GPU Partitioning (Infrastructure, Not WayRay)

For multi-session gaming, the GPU must be shared:

Technology What it does Platform
SR-IOV Hardware GPU partitioning Intel (Flex/Arc), some AMD
NVIDIA MIG Multi-Instance GPU slicing NVIDIA A100/A30/H100
NVIDIA vGPU Virtual GPU via hypervisor NVIDIA datacenter GPUs
Intel GVT-g Mediated GPU passthrough Intel integrated GPUs
Virtio-GPU Paravirtualized GPU for VMs QEMU/KVM

This is infrastructure configuration, not compositor code. WayRay just needs to see a /dev/dri/renderD* device in its environment. Whether that device is a physical GPU, an SR-IOV VF, a MIG slice, or a vGPU instance is transparent.

WayRay's responsibility: detect available GPU, use it if present, fall back gracefully if not. Already covered by ADR-005 (dual renderer).

What We Build Now vs Later

Build Now (Phase 0-2)

  • FrameSource enum with Cpu variant only (no DmaBuf yet)
  • FrameEncoder trait with the full interface (including supported_sources)
  • ZstdDiffEncoder and SoftwareH264Encoder (Cpu only)
  • Pipelined encode channel (encoder runs on separate thread)
  • FrameCadence::OnDamage as default
  • PixmanRenderer path

Build Later (when GPU gaming is prioritized)

  • Add FrameSource::DmaBuf variant
  • VaapiH264Encoder / NvencEncoder (import DMA-BUF, zero-copy encode)
  • FrameCadence::Fixed mode
  • GlesRenderer path with DMA-BUF export to encoder
  • Client-side frame interpolation for network jitter compensation
  • Input prediction (client-side speculative input processing)

Never Build (out of scope)

  • GPU partitioning infrastructure (SR-IOV, MIG setup)
  • Game streaming as a separate product (WayRay is a compositor, not GeForce NOW)
  • Client-side GPU rendering (the client stays dumb)

Rationale

  • Trait-based encoder with FrameSource enum is the critical abstraction. It costs nothing now (we only implement the Cpu variant) but preserves the zero-copy DMA-BUF path for later.
  • Pipelined encoding improves desktop performance too (compositor doesn't block on encode), so it's not wasted work.
  • Adaptive frame cadence is a minor addition that enables gaming without disrupting desktop behavior.
  • No premature GPU code: we don't write VAAPI/NVENC encoders now. We just ensure the interfaces don't preclude them.

Consequences

  • FrameSource enum adds one level of indirection to the encoding path (trivial cost)
  • Pipelined encoding adds threading complexity (but improves performance)
  • Must resist the temptation to simplify the encoder trait to fn encode(&[u8]) -> Vec<u8> -- the DmaBuf variant must stay in the design even if unimplemented
  • Documentation must explain the GPU future path so contributors don't accidentally close the door