Ensure today's CPU-first design doesn't block a future zero-copy GPU encoding path (DMA-BUF → VAAPI/NVENC, sub-5ms encode latency). Key design constraints: - FrameSource enum: Cpu and DmaBuf variants (build Cpu now, DmaBuf later) - FrameEncoder trait: accepts both, encoder picks preferred source - Pipelined render→encode→send (no synchronous blocking) - Adaptive frame cadence: OnDamage (desktop) vs Fixed (gaming) - Damage tracking forks: pixel diff for CPU, Wayland hints for GPU No GPU code built now -- just trait interfaces that preserve the path.
9.5 KiB
ADR-016: GPU Pipeline Future-Proofing (Cloud Gaming Path)
Status
Accepted
Context
WayRay's current design prioritizes the no-GPU headless path (PixmanRenderer + CPU encoding). This is correct for the primary thin client use case. However, a cloud gaming scenario -- GPU-rendered games streamed to thin clients -- is a natural future extension. We must ensure today's design decisions don't close that door.
The Cloud Gaming Pipeline
Current (CPU, works in zones):
App → wl_shm (CPU) → Pixman composite (CPU) → memcpy → CPU encode → QUIC
Latency: ~15-30ms encode
Quality: good for desktop, too slow for gaming
Future (GPU, zero-copy):
Game → DMA-BUF (GPU) → GLES composite (GPU) → DMA-BUF → VAAPI/NVENC (GPU) → QUIC
Latency: ~3-5ms encode
Quality: 1080p60 or 4K60 with hardware encoding
CPU: barely involved
The zero-copy GPU path means pixels never touch CPU memory: the game renders to a DMA-BUF, the compositor composites into another DMA-BUF, and the hardware encoder reads that DMA-BUF directly. This is how Sunshine/Moonlight, GeForce NOW, and Steam Remote Play achieve sub-20ms motion-to-photon latency.
What Could Block This
If we make any of these mistakes now, GPU encoding becomes a painful retrofit later:
| Mistake | Why it blocks GPU path |
|---|---|
Encoder takes only &[u8] pixel data |
Forces GPU→CPU readback, kills zero-copy |
| Frame pipeline is synchronous | Can't pipeline render→encode→send |
| Damage tracking assumes CPU-accessible buffers | Can't diff GPU-resident frames |
| Frame cadence is damage-only | Gaming needs consistent 60/120fps |
| Compositor hardcodes PixmanRenderer | Can't composite on GPU |
Decision
Design constraints to preserve the GPU path:
1. Frame Source Abstraction
The encoder must accept both CPU buffers and GPU buffer handles:
/// A frame ready for encoding. Can be CPU pixels or a GPU buffer handle.
enum FrameSource<'a> {
/// CPU-accessible pixel data (from PixmanRenderer / ExportMem)
Cpu {
data: &'a [u8],
width: u32,
height: u32,
stride: u32,
format: PixelFormat,
},
/// GPU buffer handle (from GlesRenderer / DRM)
/// Encoder can import this directly without CPU copy
DmaBuf {
fd: OwnedFd,
width: u32,
height: u32,
stride: u32,
format: DrmFourcc,
modifier: DrmModifier,
},
}
A CPU-based encoder (zstd diff, software x264) processes FrameSource::Cpu.
A hardware encoder (VAAPI, NVENC) prefers FrameSource::DmaBuf (zero-copy) but can also accept FrameSource::Cpu (upload first, slower).
2. Encoder Trait
trait FrameEncoder: Send {
/// What frame sources this encoder supports
fn supported_sources(&self) -> &[FrameSourceType];
/// Encode a frame, returning encoded data for transmission.
/// `damage` hints which regions changed (encoder may ignore if doing full-frame).
fn encode(
&mut self,
source: FrameSource<'_>,
damage: &[Rectangle],
frame_seq: u64,
) -> Result<EncodedFrame>;
/// Signal that encoding parameters should adapt
/// (e.g., client reported decode time too high)
fn adapt(&mut self, feedback: EncoderFeedback);
}
enum FrameSourceType {
Cpu,
DmaBuf,
}
Planned encoder implementations:
| Encoder | Accepts | Use Case |
|---|---|---|
ZstdDiffEncoder |
Cpu | Desktop, lossless text, LAN |
SoftwareH264Encoder (x264) |
Cpu | Desktop, lossy, WAN |
VaapiH264Encoder |
DmaBuf (preferred), Cpu | GPU-accelerated, gaming |
NvencH264Encoder |
DmaBuf (preferred), Cpu | NVIDIA GPU, gaming |
VaapiAv1Encoder |
DmaBuf (preferred), Cpu | Future, better quality |
3. Pipelined Frame Submission
The render→encode→send pipeline must be asynchronous and pipelined, not synchronous:
Synchronous (WRONG -- blocks on encode):
Frame N: [render]──[encode]──[send]
Frame N+1: [render]──[encode]──[send]
Pipelined (RIGHT -- overlap work):
Frame N: [render]──[encode]──[send]
Frame N+1: [render]──[encode]──[send]
Frame N+2: [render]──[encode]──[send]
Implementation: the compositor submits frames to an encoding channel. The encoder runs on a separate thread (or GPU async queue). Encoded frames are submitted to the QUIC sender. Each stage can work on a different frame simultaneously.
// Compositor render loop (calloop)
fn on_frame_rendered(&mut self, frame: FrameSource, damage: Vec<Rectangle>) {
// Non-blocking: submit to encoder channel
self.encoder_tx.send(EncodeRequest { frame, damage, seq: self.next_seq() });
}
// Encoder thread
fn encoder_loop(rx: Receiver<EncodeRequest>, tx: Sender<EncodedFrame>) {
for req in rx {
let encoded = self.encoder.encode(req.frame, &req.damage, req.seq)?;
tx.send(encoded); // Submit to network sender
}
}
4. Adaptive Frame Cadence
Two scheduling modes, selectable at runtime:
enum FrameCadence {
/// Render only when Wayland clients commit new content.
/// Coalesce rapid commits within a window (e.g., 4ms).
/// Ideal for desktop: saves CPU when nothing changes.
OnDamage { coalesce_ms: u32 },
/// Render at a fixed rate regardless of client commits.
/// Resubmit the last frame if nothing changed (keeps encoder fed).
/// Needed for gaming: consistent framerate, no stutter.
Fixed { fps: u32 },
}
The compositor defaults to OnDamage. When a surface is detected as "high frame rate" (e.g., commits faster than 30fps consistently), the compositor can switch to Fixed mode for that output. This could also be triggered by configuration (wradm session set-cadence gaming).
5. Damage Tracking Must Support Both Paths
For CPU buffers, we diff pixels directly (XOR previous frame, zstd compress diff). For GPU buffers, we can't cheaply read pixels for diffing. Instead:
- Wayland damage regions: Every
wl_surface.commit()includes damage hints from the client. These are already tracked by Smithay'sOutputDamageTracker. - Full-frame encoding: Hardware encoders (H.264, AV1) handle their own temporal redundancy via P-frames and reference frames. They don't need pixel-level diffs from us.
So the damage path forks:
CPU path: pixel diff (XOR + zstd) → needs pixel access → FrameSource::Cpu
GPU path: Wayland damage hints → no pixel access needed → FrameSource::DmaBuf
+ H.264 temporal coding encoder handles the rest
Both paths are valid. The encoder trait abstracts this -- a zstd encoder uses damage for pixel diff regions, a VAAPI encoder uses damage to decide keyframe frequency or quality.
6. GPU Partitioning (Infrastructure, Not WayRay)
For multi-session gaming, the GPU must be shared:
| Technology | What it does | Platform |
|---|---|---|
| SR-IOV | Hardware GPU partitioning | Intel (Flex/Arc), some AMD |
| NVIDIA MIG | Multi-Instance GPU slicing | NVIDIA A100/A30/H100 |
| NVIDIA vGPU | Virtual GPU via hypervisor | NVIDIA datacenter GPUs |
| Intel GVT-g | Mediated GPU passthrough | Intel integrated GPUs |
| Virtio-GPU | Paravirtualized GPU for VMs | QEMU/KVM |
This is infrastructure configuration, not compositor code. WayRay just needs to see a /dev/dri/renderD* device in its environment. Whether that device is a physical GPU, an SR-IOV VF, a MIG slice, or a vGPU instance is transparent.
WayRay's responsibility: detect available GPU, use it if present, fall back gracefully if not. Already covered by ADR-005 (dual renderer).
What We Build Now vs Later
Build Now (Phase 0-2)
FrameSourceenum withCpuvariant only (no DmaBuf yet)FrameEncodertrait with the full interface (includingsupported_sources)ZstdDiffEncoderandSoftwareH264Encoder(Cpu only)- Pipelined encode channel (encoder runs on separate thread)
FrameCadence::OnDamageas default- PixmanRenderer path
Build Later (when GPU gaming is prioritized)
- Add
FrameSource::DmaBufvariant VaapiH264Encoder/NvencEncoder(import DMA-BUF, zero-copy encode)FrameCadence::Fixedmode- GlesRenderer path with DMA-BUF export to encoder
- Client-side frame interpolation for network jitter compensation
- Input prediction (client-side speculative input processing)
Never Build (out of scope)
- GPU partitioning infrastructure (SR-IOV, MIG setup)
- Game streaming as a separate product (WayRay is a compositor, not GeForce NOW)
- Client-side GPU rendering (the client stays dumb)
Rationale
- Trait-based encoder with
FrameSourceenum is the critical abstraction. It costs nothing now (we only implement the Cpu variant) but preserves the zero-copy DMA-BUF path for later. - Pipelined encoding improves desktop performance too (compositor doesn't block on encode), so it's not wasted work.
- Adaptive frame cadence is a minor addition that enables gaming without disrupting desktop behavior.
- No premature GPU code: we don't write VAAPI/NVENC encoders now. We just ensure the interfaces don't preclude them.
Consequences
FrameSourceenum adds one level of indirection to the encoding path (trivial cost)- Pipelined encoding adds threading complexity (but improves performance)
- Must resist the temptation to simplify the encoder trait to
fn encode(&[u8]) -> Vec<u8>-- the DmaBuf variant must stay in the design even if unimplemented - Documentation must explain the GPU future path so contributors don't accidentally close the door