mirror of
https://github.com/CloudNebulaProject/wayray.git
synced 2026-04-10 21:20:40 +00:00
Add ADR-016: GPU pipeline future-proofing for cloud gaming
Ensure today's CPU-first design doesn't block a future zero-copy GPU encoding path (DMA-BUF → VAAPI/NVENC, sub-5ms encode latency). Key design constraints: - FrameSource enum: Cpu and DmaBuf variants (build Cpu now, DmaBuf later) - FrameEncoder trait: accepts both, encoder picks preferred source - Pipelined render→encode→send (no synchronous blocking) - Adaptive frame cadence: OnDamage (desktop) vs Fixed (gaming) - Damage tracking forks: pixel diff for CPU, Wayland hints for GPU No GPU code built now -- just trait interfaces that preserve the path.
This commit is contained in:
parent
643c4f042d
commit
f702cbf4e7
1 changed files with 233 additions and 0 deletions
233
docs/ai/adr/016-gpu-pipeline-future-proofing.md
Normal file
233
docs/ai/adr/016-gpu-pipeline-future-proofing.md
Normal file
|
|
@ -0,0 +1,233 @@
|
||||||
|
# ADR-016: GPU Pipeline Future-Proofing (Cloud Gaming Path)
|
||||||
|
|
||||||
|
## Status
|
||||||
|
Accepted
|
||||||
|
|
||||||
|
## Context
|
||||||
|
WayRay's current design prioritizes the no-GPU headless path (PixmanRenderer + CPU encoding). This is correct for the primary thin client use case. However, a cloud gaming scenario -- GPU-rendered games streamed to thin clients -- is a natural future extension. We must ensure today's design decisions don't close that door.
|
||||||
|
|
||||||
|
### The Cloud Gaming Pipeline
|
||||||
|
|
||||||
|
```
|
||||||
|
Current (CPU, works in zones):
|
||||||
|
App → wl_shm (CPU) → Pixman composite (CPU) → memcpy → CPU encode → QUIC
|
||||||
|
Latency: ~15-30ms encode
|
||||||
|
Quality: good for desktop, too slow for gaming
|
||||||
|
|
||||||
|
Future (GPU, zero-copy):
|
||||||
|
Game → DMA-BUF (GPU) → GLES composite (GPU) → DMA-BUF → VAAPI/NVENC (GPU) → QUIC
|
||||||
|
Latency: ~3-5ms encode
|
||||||
|
Quality: 1080p60 or 4K60 with hardware encoding
|
||||||
|
CPU: barely involved
|
||||||
|
```
|
||||||
|
|
||||||
|
The zero-copy GPU path means pixels never touch CPU memory: the game renders to a DMA-BUF, the compositor composites into another DMA-BUF, and the hardware encoder reads that DMA-BUF directly. This is how Sunshine/Moonlight, GeForce NOW, and Steam Remote Play achieve sub-20ms motion-to-photon latency.
|
||||||
|
|
||||||
|
### What Could Block This
|
||||||
|
|
||||||
|
If we make any of these mistakes now, GPU encoding becomes a painful retrofit later:
|
||||||
|
|
||||||
|
| Mistake | Why it blocks GPU path |
|
||||||
|
|---------|----------------------|
|
||||||
|
| Encoder takes only `&[u8]` pixel data | Forces GPU→CPU readback, kills zero-copy |
|
||||||
|
| Frame pipeline is synchronous | Can't pipeline render→encode→send |
|
||||||
|
| Damage tracking assumes CPU-accessible buffers | Can't diff GPU-resident frames |
|
||||||
|
| Frame cadence is damage-only | Gaming needs consistent 60/120fps |
|
||||||
|
| Compositor hardcodes PixmanRenderer | Can't composite on GPU |
|
||||||
|
|
||||||
|
## Decision
|
||||||
|
|
||||||
|
### Design constraints to preserve the GPU path:
|
||||||
|
|
||||||
|
### 1. Frame Source Abstraction
|
||||||
|
|
||||||
|
The encoder must accept both CPU buffers and GPU buffer handles:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
/// A frame ready for encoding. Can be CPU pixels or a GPU buffer handle.
|
||||||
|
enum FrameSource<'a> {
|
||||||
|
/// CPU-accessible pixel data (from PixmanRenderer / ExportMem)
|
||||||
|
Cpu {
|
||||||
|
data: &'a [u8],
|
||||||
|
width: u32,
|
||||||
|
height: u32,
|
||||||
|
stride: u32,
|
||||||
|
format: PixelFormat,
|
||||||
|
},
|
||||||
|
|
||||||
|
/// GPU buffer handle (from GlesRenderer / DRM)
|
||||||
|
/// Encoder can import this directly without CPU copy
|
||||||
|
DmaBuf {
|
||||||
|
fd: OwnedFd,
|
||||||
|
width: u32,
|
||||||
|
height: u32,
|
||||||
|
stride: u32,
|
||||||
|
format: DrmFourcc,
|
||||||
|
modifier: DrmModifier,
|
||||||
|
},
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
A CPU-based encoder (zstd diff, software x264) processes `FrameSource::Cpu`.
|
||||||
|
A hardware encoder (VAAPI, NVENC) prefers `FrameSource::DmaBuf` (zero-copy) but can also accept `FrameSource::Cpu` (upload first, slower).
|
||||||
|
|
||||||
|
### 2. Encoder Trait
|
||||||
|
|
||||||
|
```rust
|
||||||
|
trait FrameEncoder: Send {
|
||||||
|
/// What frame sources this encoder supports
|
||||||
|
fn supported_sources(&self) -> &[FrameSourceType];
|
||||||
|
|
||||||
|
/// Encode a frame, returning encoded data for transmission.
|
||||||
|
/// `damage` hints which regions changed (encoder may ignore if doing full-frame).
|
||||||
|
fn encode(
|
||||||
|
&mut self,
|
||||||
|
source: FrameSource<'_>,
|
||||||
|
damage: &[Rectangle],
|
||||||
|
frame_seq: u64,
|
||||||
|
) -> Result<EncodedFrame>;
|
||||||
|
|
||||||
|
/// Signal that encoding parameters should adapt
|
||||||
|
/// (e.g., client reported decode time too high)
|
||||||
|
fn adapt(&mut self, feedback: EncoderFeedback);
|
||||||
|
}
|
||||||
|
|
||||||
|
enum FrameSourceType {
|
||||||
|
Cpu,
|
||||||
|
DmaBuf,
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Planned encoder implementations:
|
||||||
|
|
||||||
|
| Encoder | Accepts | Use Case |
|
||||||
|
|---------|---------|----------|
|
||||||
|
| `ZstdDiffEncoder` | Cpu | Desktop, lossless text, LAN |
|
||||||
|
| `SoftwareH264Encoder` (x264) | Cpu | Desktop, lossy, WAN |
|
||||||
|
| `VaapiH264Encoder` | DmaBuf (preferred), Cpu | GPU-accelerated, gaming |
|
||||||
|
| `NvencH264Encoder` | DmaBuf (preferred), Cpu | NVIDIA GPU, gaming |
|
||||||
|
| `VaapiAv1Encoder` | DmaBuf (preferred), Cpu | Future, better quality |
|
||||||
|
|
||||||
|
### 3. Pipelined Frame Submission
|
||||||
|
|
||||||
|
The render→encode→send pipeline must be **asynchronous and pipelined**, not synchronous:
|
||||||
|
|
||||||
|
```
|
||||||
|
Synchronous (WRONG -- blocks on encode):
|
||||||
|
Frame N: [render]──[encode]──[send]
|
||||||
|
Frame N+1: [render]──[encode]──[send]
|
||||||
|
|
||||||
|
Pipelined (RIGHT -- overlap work):
|
||||||
|
Frame N: [render]──[encode]──[send]
|
||||||
|
Frame N+1: [render]──[encode]──[send]
|
||||||
|
Frame N+2: [render]──[encode]──[send]
|
||||||
|
```
|
||||||
|
|
||||||
|
Implementation: the compositor submits frames to an encoding channel. The encoder runs on a separate thread (or GPU async queue). Encoded frames are submitted to the QUIC sender. Each stage can work on a different frame simultaneously.
|
||||||
|
|
||||||
|
```rust
|
||||||
|
// Compositor render loop (calloop)
|
||||||
|
fn on_frame_rendered(&mut self, frame: FrameSource, damage: Vec<Rectangle>) {
|
||||||
|
// Non-blocking: submit to encoder channel
|
||||||
|
self.encoder_tx.send(EncodeRequest { frame, damage, seq: self.next_seq() });
|
||||||
|
}
|
||||||
|
|
||||||
|
// Encoder thread
|
||||||
|
fn encoder_loop(rx: Receiver<EncodeRequest>, tx: Sender<EncodedFrame>) {
|
||||||
|
for req in rx {
|
||||||
|
let encoded = self.encoder.encode(req.frame, &req.damage, req.seq)?;
|
||||||
|
tx.send(encoded); // Submit to network sender
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4. Adaptive Frame Cadence
|
||||||
|
|
||||||
|
Two scheduling modes, selectable at runtime:
|
||||||
|
|
||||||
|
```rust
|
||||||
|
enum FrameCadence {
|
||||||
|
/// Render only when Wayland clients commit new content.
|
||||||
|
/// Coalesce rapid commits within a window (e.g., 4ms).
|
||||||
|
/// Ideal for desktop: saves CPU when nothing changes.
|
||||||
|
OnDamage { coalesce_ms: u32 },
|
||||||
|
|
||||||
|
/// Render at a fixed rate regardless of client commits.
|
||||||
|
/// Resubmit the last frame if nothing changed (keeps encoder fed).
|
||||||
|
/// Needed for gaming: consistent framerate, no stutter.
|
||||||
|
Fixed { fps: u32 },
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
The compositor defaults to `OnDamage`. When a surface is detected as "high frame rate" (e.g., commits faster than 30fps consistently), the compositor can switch to `Fixed` mode for that output. This could also be triggered by configuration (`wradm session set-cadence gaming`).
|
||||||
|
|
||||||
|
### 5. Damage Tracking Must Support Both Paths
|
||||||
|
|
||||||
|
For CPU buffers, we diff pixels directly (XOR previous frame, zstd compress diff).
|
||||||
|
For GPU buffers, we can't cheaply read pixels for diffing. Instead:
|
||||||
|
|
||||||
|
- **Wayland damage regions**: Every `wl_surface.commit()` includes damage hints from the client. These are already tracked by Smithay's `OutputDamageTracker`.
|
||||||
|
- **Full-frame encoding**: Hardware encoders (H.264, AV1) handle their own temporal redundancy via P-frames and reference frames. They don't need pixel-level diffs from us.
|
||||||
|
|
||||||
|
So the damage path forks:
|
||||||
|
|
||||||
|
```
|
||||||
|
CPU path: pixel diff (XOR + zstd) → needs pixel access → FrameSource::Cpu
|
||||||
|
GPU path: Wayland damage hints → no pixel access needed → FrameSource::DmaBuf
|
||||||
|
+ H.264 temporal coding encoder handles the rest
|
||||||
|
```
|
||||||
|
|
||||||
|
Both paths are valid. The encoder trait abstracts this -- a zstd encoder uses damage for pixel diff regions, a VAAPI encoder uses damage to decide keyframe frequency or quality.
|
||||||
|
|
||||||
|
### 6. GPU Partitioning (Infrastructure, Not WayRay)
|
||||||
|
|
||||||
|
For multi-session gaming, the GPU must be shared:
|
||||||
|
|
||||||
|
| Technology | What it does | Platform |
|
||||||
|
|-----------|-------------|----------|
|
||||||
|
| SR-IOV | Hardware GPU partitioning | Intel (Flex/Arc), some AMD |
|
||||||
|
| NVIDIA MIG | Multi-Instance GPU slicing | NVIDIA A100/A30/H100 |
|
||||||
|
| NVIDIA vGPU | Virtual GPU via hypervisor | NVIDIA datacenter GPUs |
|
||||||
|
| Intel GVT-g | Mediated GPU passthrough | Intel integrated GPUs |
|
||||||
|
| Virtio-GPU | Paravirtualized GPU for VMs | QEMU/KVM |
|
||||||
|
|
||||||
|
This is infrastructure configuration, not compositor code. WayRay just needs to see a `/dev/dri/renderD*` device in its environment. Whether that device is a physical GPU, an SR-IOV VF, a MIG slice, or a vGPU instance is transparent.
|
||||||
|
|
||||||
|
**WayRay's responsibility**: detect available GPU, use it if present, fall back gracefully if not. Already covered by ADR-005 (dual renderer).
|
||||||
|
|
||||||
|
## What We Build Now vs Later
|
||||||
|
|
||||||
|
### Build Now (Phase 0-2)
|
||||||
|
- `FrameSource` enum with `Cpu` variant only (no DmaBuf yet)
|
||||||
|
- `FrameEncoder` trait with the full interface (including `supported_sources`)
|
||||||
|
- `ZstdDiffEncoder` and `SoftwareH264Encoder` (Cpu only)
|
||||||
|
- Pipelined encode channel (encoder runs on separate thread)
|
||||||
|
- `FrameCadence::OnDamage` as default
|
||||||
|
- PixmanRenderer path
|
||||||
|
|
||||||
|
### Build Later (when GPU gaming is prioritized)
|
||||||
|
- Add `FrameSource::DmaBuf` variant
|
||||||
|
- `VaapiH264Encoder` / `NvencEncoder` (import DMA-BUF, zero-copy encode)
|
||||||
|
- `FrameCadence::Fixed` mode
|
||||||
|
- GlesRenderer path with DMA-BUF export to encoder
|
||||||
|
- Client-side frame interpolation for network jitter compensation
|
||||||
|
- Input prediction (client-side speculative input processing)
|
||||||
|
|
||||||
|
### Never Build (out of scope)
|
||||||
|
- GPU partitioning infrastructure (SR-IOV, MIG setup)
|
||||||
|
- Game streaming as a separate product (WayRay is a compositor, not GeForce NOW)
|
||||||
|
- Client-side GPU rendering (the client stays dumb)
|
||||||
|
|
||||||
|
## Rationale
|
||||||
|
|
||||||
|
- **Trait-based encoder** with `FrameSource` enum is the critical abstraction. It costs nothing now (we only implement the Cpu variant) but preserves the zero-copy DMA-BUF path for later.
|
||||||
|
- **Pipelined encoding** improves desktop performance too (compositor doesn't block on encode), so it's not wasted work.
|
||||||
|
- **Adaptive frame cadence** is a minor addition that enables gaming without disrupting desktop behavior.
|
||||||
|
- **No premature GPU code**: we don't write VAAPI/NVENC encoders now. We just ensure the interfaces don't preclude them.
|
||||||
|
|
||||||
|
## Consequences
|
||||||
|
|
||||||
|
- `FrameSource` enum adds one level of indirection to the encoding path (trivial cost)
|
||||||
|
- Pipelined encoding adds threading complexity (but improves performance)
|
||||||
|
- Must resist the temptation to simplify the encoder trait to `fn encode(&[u8]) -> Vec<u8>` -- the DmaBuf variant must stay in the design even if unimplemented
|
||||||
|
- Documentation must explain the GPU future path so contributors don't accidentally close the door
|
||||||
Loading…
Add table
Reference in a new issue