WI: Implement visual testing primitives (screenshot + assertion + input injection) #184

New issue

Closed

opened 2026-05-24 05:13:25 +00:00 by toasterson · 0 comments

toasterson commented

2026-05-24 05:13:25 +00:00

Owner

Goal

Implement screenshot capture, vision model verification via Ollama cloud, and input injection (keyboard + mouse) for GUI testing.

Background

Once VMs boot with Sway, we need to capture screenshots, assert visual correctness via a vision-capable LLM, and inject keyboard/mouse events. This WI delivers the three core visual testing primitives.

🔗 Remote Sources (all on code.aopc.cloud)

vm-manager:

https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/ssh.rs — SSH connection via ssh2 crate, key management
https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/console.rs — ConsoleTailer for text polling from serial console

Anima workspace:

https://code.aopc.cloud/toasterson/Anima/src/branch/main/crates/anima-qa/src/actions.rs — extend with visual primitives
New module: crates/anima-qa/src/screenshot.rs

Ollama cloud API:

Documentation: https://ollama.com/docs/api (OpenAI-compatible chat completions)
Endpoint: https://ollama.cloud/v1/chat/completions
Vision model: minicpm-v (multimodal, supports image+text)
Auth: API key via OLLAMA_CLOUD_API_KEY env variable
Request format: multimodal message with base64-encoded image in image_url content type

🏗️ Implementation Guide

Step 1: Screenshot capture

SSH into VM, set WAYLAND_DISPLAY=wayland-0, run grim -g "x,y wxh" - → read PNG from stdout → base64 encode

Step 2: Vision assertion

POST to Ollama cloud with multimodal request. Prompt: "Does this screenshot contain: {prompt}? Answer ONLY with YES or NO followed by a brief explanation." Parse YES/NO, extract confidence.

Step 3: Keyboard input

SSH execute WAYLAND_DISPLAY=wayland-0 wtype '{escaped_text}' for text, wtype -k Tab for keys

Step 4: Mouse input

SSH execute WAYLAND_DISPLAY=wayland-0 ydotool mousemove --absolute {x} {y} && ydotool click 0x00

Step 5: Wait for text

Poll SSH command output for expected text. Loop: sleep 500ms, check, timeout after N seconds.

Step 6: Register MCP tools

qa_screenshot, qa_assert_visible, qa_type, qa_click_at, qa_key_combo, qa_wait_for_text

Environment: OLLAMA_CLOUD_API_KEY and QA_VISION_MODEL (default: minicpm-v)

✅ Acceptance Criteria

Screenshot returns valid base64 PNG
Vision model identifies UI elements correctly
Text injection and mouse clicks work
Wait-for-text polls successfully

#181 — MCP server (dependency)
#183 — VM image (provides grim/wtype/ydotool)
#187-189 — test suites (consume these primitives)

## Goal Implement screenshot capture, vision model verification via Ollama cloud, and input injection (keyboard + mouse) for GUI testing. ## Background Once VMs boot with Sway, we need to capture screenshots, assert visual correctness via a vision-capable LLM, and inject keyboard/mouse events. This WI delivers the three core visual testing primitives. ## 🔗 Remote Sources (all on code.aopc.cloud) **vm-manager:** - https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/ssh.rs — SSH connection via ssh2 crate, key management - https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/console.rs — ConsoleTailer for text polling from serial console **Anima workspace:** - https://code.aopc.cloud/toasterson/Anima/src/branch/main/crates/anima-qa/src/actions.rs — extend with visual primitives - New module: `crates/anima-qa/src/screenshot.rs` **Ollama cloud API:** - Documentation: https://ollama.com/docs/api (OpenAI-compatible chat completions) - Endpoint: `https://ollama.cloud/v1/chat/completions` - Vision model: `minicpm-v` (multimodal, supports image+text) - Auth: API key via `OLLAMA_CLOUD_API_KEY` env variable - Request format: multimodal message with base64-encoded image in `image_url` content type ## 🏗️ Implementation Guide ### Step 1: Screenshot capture SSH into VM, set WAYLAND_DISPLAY=wayland-0, run `grim -g "x,y wxh" -` → read PNG from stdout → base64 encode ### Step 2: Vision assertion POST to Ollama cloud with multimodal request. Prompt: "Does this screenshot contain: {prompt}? Answer ONLY with YES or NO followed by a brief explanation." Parse YES/NO, extract confidence. ### Step 3: Keyboard input SSH execute `WAYLAND_DISPLAY=wayland-0 wtype '{escaped_text}'` for text, `wtype -k Tab` for keys ### Step 4: Mouse input SSH execute `WAYLAND_DISPLAY=wayland-0 ydotool mousemove --absolute {x} {y} && ydotool click 0x00` ### Step 5: Wait for text Poll SSH command output for expected text. Loop: sleep 500ms, check, timeout after N seconds. ### Step 6: Register MCP tools qa_screenshot, qa_assert_visible, qa_type, qa_click_at, qa_key_combo, qa_wait_for_text Environment: `OLLAMA_CLOUD_API_KEY` and `QA_VISION_MODEL` (default: minicpm-v) ## ✅ Acceptance Criteria - [ ] Screenshot returns valid base64 PNG - [ ] Vision model identifies UI elements correctly - [ ] Text injection and mouse clicks work - [ ] Wait-for-text polls successfully ## Related WIs - #181 — MCP server (dependency) - #183 — VM image (provides grim/wtype/ydotool) - #187-189 — test suites (consume these primitives)

toasterson referenced this issue

2026-05-24 05:22:17 +00:00

WI: Implement anima-qa MCP server (HTTP + JSON-RPC) #181

toasterson referenced this issue

2026-05-24 05:22:18 +00:00

WI: Build Wayland dev VM base image (Debian Bookworm + Wayland + Sway + grim + wtype) #183

toasterson referenced this issue

2026-05-24 05:22:19 +00:00

WI: Implement YAML test spec parser and executor #185

toasterson referenced this issue

2026-05-24 05:22:20 +00:00

WI: NeoCDE shell startup smoke test suite #187