DOM-Native vs. Screenshot Agents
The embedded agent space is split between two architectural approaches: screenshot-based (CUA/Vision) and DOM-native. The choice fundamentally determines your agent's speed, accuracy, cost, and security posture.
How Screenshot Agents Work
- Capture a screenshot of the page
- Send the image to a vision model for analysis
- Receive click coordinates or action descriptions
- Execute the action via a remote browser VM
- Repeat for every step
Latency per action: 2-5 seconds (image capture + model inference + network round-trip)
How DOM-Native Agents Work
- Parse the live DOM into a semantic tree (Smart DOM Tree)
- Send the compact tree structure to a language model
- Receive precise element selectors and actions
- Execute directly in the user's browser
- Repeat with updated DOM state
Latency per action: < 500ms (no image processing, local execution)
Head-to-Head Comparison
| Dimension | Screenshot (CUA) | DOM-Native (Rover) |
|---|---|---|
| Speed | 2-5s per action | < 500ms per action |
| Token usage | ~10,000 tokens/screenshot | ~1,000 tokens/DOM tree |
| Runs where | Remote VM | User's browser |
| User data exposure | Screenshots sent to server | Only DOM structure sent |
| Handles CSS changes | Breaks (pixel coords shift) | Resilient (semantic selectors) |
| Cost per session | $0.10-0.50 (VM + vision API) | $0.01-0.05 (text model only) |
| Auth handling | Needs credential injection | Uses existing user session |
| Maintenance | High (visual regressions) | Zero (reads live DOM) |
Security Implications
Screenshot agents require:
- A remote browser VM per user session
- Credential injection or session replay
- Screenshots of potentially sensitive content leaving the browser
DOM-native agents:
- Run in the user's existing browser tab
- Use the user's existing authenticated session
- Send only semantic DOM structure (not visual content)
- Are subject to the same CORS/CSP rules as your own JavaScript
When Each Approach Makes Sense
Screenshot agents work well for:
- Cross-site automation where you don't control the target
- Testing/QA scenarios with visual regression needs
- One-off scraping tasks
DOM-native agents excel at:
- First-party embedded experiences
- High-frequency user interactions
- Privacy-sensitive environments
- Production deployments where speed and cost matter
Conclusion
For embedded use cases — where the agent lives on your site and helps your users — DOM-native is the clear winner. It's faster, cheaper, more secure, and requires zero maintenance as your site evolves.
Rover is the first production-ready DOM-native embedded agent. One script tag, no remote infrastructure.
