WI: Implement visual testing primitives (screenshot + assertion + input injection) #184

Closed
opened 2026-05-24 05:13:25 +00:00 by toasterson · 0 comments
Owner

Goal

Implement screenshot capture, vision model verification via Ollama cloud, and input injection (keyboard + mouse) for GUI testing.

Background

Once VMs boot with Sway, we need to capture screenshots, assert visual correctness via a vision-capable LLM, and inject keyboard/mouse events. This WI delivers the three core visual testing primitives.

🔗 Remote Sources (all on code.aopc.cloud)

vm-manager:

Anima workspace:

Ollama cloud API:

  • Documentation: https://ollama.com/docs/api (OpenAI-compatible chat completions)
  • Endpoint: https://ollama.cloud/v1/chat/completions
  • Vision model: minicpm-v (multimodal, supports image+text)
  • Auth: API key via OLLAMA_CLOUD_API_KEY env variable
  • Request format: multimodal message with base64-encoded image in image_url content type

🏗️ Implementation Guide

Step 1: Screenshot capture

SSH into VM, set WAYLAND_DISPLAY=wayland-0, run grim -g "x,y wxh" - → read PNG from stdout → base64 encode

Step 2: Vision assertion

POST to Ollama cloud with multimodal request. Prompt: "Does this screenshot contain: {prompt}? Answer ONLY with YES or NO followed by a brief explanation." Parse YES/NO, extract confidence.

Step 3: Keyboard input

SSH execute WAYLAND_DISPLAY=wayland-0 wtype '{escaped_text}' for text, wtype -k Tab for keys

Step 4: Mouse input

SSH execute WAYLAND_DISPLAY=wayland-0 ydotool mousemove --absolute {x} {y} && ydotool click 0x00

Step 5: Wait for text

Poll SSH command output for expected text. Loop: sleep 500ms, check, timeout after N seconds.

Step 6: Register MCP tools

qa_screenshot, qa_assert_visible, qa_type, qa_click_at, qa_key_combo, qa_wait_for_text

Environment: OLLAMA_CLOUD_API_KEY and QA_VISION_MODEL (default: minicpm-v)

Acceptance Criteria

  • Screenshot returns valid base64 PNG
  • Vision model identifies UI elements correctly
  • Text injection and mouse clicks work
  • Wait-for-text polls successfully
  • #181 — MCP server (dependency)
  • #183 — VM image (provides grim/wtype/ydotool)
  • #187-189 — test suites (consume these primitives)
## Goal Implement screenshot capture, vision model verification via Ollama cloud, and input injection (keyboard + mouse) for GUI testing. ## Background Once VMs boot with Sway, we need to capture screenshots, assert visual correctness via a vision-capable LLM, and inject keyboard/mouse events. This WI delivers the three core visual testing primitives. ## 🔗 Remote Sources (all on code.aopc.cloud) **vm-manager:** - https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/ssh.rs — SSH connection via ssh2 crate, key management - https://code.aopc.cloud/CloudNebulaProject/vm-manager/src/branch/main/crates/vm-manager/src/console.rs — ConsoleTailer for text polling from serial console **Anima workspace:** - https://code.aopc.cloud/toasterson/Anima/src/branch/main/crates/anima-qa/src/actions.rs — extend with visual primitives - New module: `crates/anima-qa/src/screenshot.rs` **Ollama cloud API:** - Documentation: https://ollama.com/docs/api (OpenAI-compatible chat completions) - Endpoint: `https://ollama.cloud/v1/chat/completions` - Vision model: `minicpm-v` (multimodal, supports image+text) - Auth: API key via `OLLAMA_CLOUD_API_KEY` env variable - Request format: multimodal message with base64-encoded image in `image_url` content type ## 🏗️ Implementation Guide ### Step 1: Screenshot capture SSH into VM, set WAYLAND_DISPLAY=wayland-0, run `grim -g "x,y wxh" -` → read PNG from stdout → base64 encode ### Step 2: Vision assertion POST to Ollama cloud with multimodal request. Prompt: "Does this screenshot contain: {prompt}? Answer ONLY with YES or NO followed by a brief explanation." Parse YES/NO, extract confidence. ### Step 3: Keyboard input SSH execute `WAYLAND_DISPLAY=wayland-0 wtype '{escaped_text}'` for text, `wtype -k Tab` for keys ### Step 4: Mouse input SSH execute `WAYLAND_DISPLAY=wayland-0 ydotool mousemove --absolute {x} {y} && ydotool click 0x00` ### Step 5: Wait for text Poll SSH command output for expected text. Loop: sleep 500ms, check, timeout after N seconds. ### Step 6: Register MCP tools qa_screenshot, qa_assert_visible, qa_type, qa_click_at, qa_key_combo, qa_wait_for_text Environment: `OLLAMA_CLOUD_API_KEY` and `QA_VISION_MODEL` (default: minicpm-v) ## ✅ Acceptance Criteria - [ ] Screenshot returns valid base64 PNG - [ ] Vision model identifies UI elements correctly - [ ] Text injection and mouse clicks work - [ ] Wait-for-text polls successfully ## Related WIs - #181 — MCP server (dependency) - #183 — VM image (provides grim/wtype/ydotool) - #187-189 — test suites (consume these primitives)
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
toasterson/Anima#184
No description provided.