An agent that can only see HTML inside a headless browser can do a narrow slice of real work. Most of what a human does at a computer involves multiple desktop apps, drag-and-drop between them, native dialogs, file pickers, OS-level permission prompts, system settings, and the occasional installer. Remio gives your AI agent a real operating-system desktop — the same one a human would use — with native input, full screen access, and an end-to-end encrypted session. No headless browser. No virtualized container with half the APIs missing. Just a real Mac or PC, driven by your model.
A browser sandbox is fine for an agent that only needs to fill in web forms. The moment work crosses into native apps, the sandbox runs out of road. A real desktop is the cleanest agent runtime — what a human uses, an agent can too.
An agent that lives only inside a headless browser cannot open Final Cut, cannot drag a video file from Finder into a creative tool, cannot sign a PDF in Preview, cannot run a `.dmg` or `.exe` installer, cannot grant another app screen-recording permission, cannot pick a file with the system file picker, and cannot survive an unexpected modal dialog from the OS. These are the moments where most real workflows actually live — and where the headless-browser agent quietly fails.
Virtualized containers (the "agent computer" pattern from the major labs) close some of those gaps, but bring their own. The image is usually a minimal Linux desktop with a handful of apps preinstalled; anything outside that footprint is unreachable. The container has no continuity with the user's actual work — their installed apps, their logged-in accounts, their saved files, their permission grants. Every agent run starts from a blank-slate machine that is not the user's machine.
Remio takes the other path. The agent gets the user's real desktop — the actual Mac or PC the user works on — through the same remote-desktop session a human would use. Every app the user has installed is there. Every account the user is logged in to is there. Every file is there. The agent is not a tenant of a synthetic environment; it is a guest on the user's actual workstation, for as long as the user lets it in.
Agents use the standard Remio data channel for input. The host sees them as native OS input — same path as a human user. No agent-specific privilege escalation, no kernel hooks, no accessibility shims required.
An agent calls `input.send_click(x, y)` and the host machine receives a mouse-click event from the operating system's native input subsystem — the exact same event the user's actual mouse would generate. `input.send_key("cmd+s")` arrives at the focused app as a Cmd+S keyboard shortcut. Scroll, drag, multi-touch gestures, modifier keys, function keys, media keys — all routed through the platform's own input plumbing. There is no separate code path for "agent input" on the host side; the host literally cannot tell the difference, because there is no difference.
This matters more than it sounds. Agent platforms that fake input by injecting events at the application layer (sending synthetic keystrokes into a specific window's event queue, or using accessibility APIs to drive UI elements directly) hit limits constantly: protected fields refuse the input, native shortcuts misfire, drag-and-drop doesn't work between apps, focus management gets confused. Routing through the OS-level input pipeline sidesteps all of that — if a human can do it with a real mouse and keyboard, the agent can do it through Remio.
The privilege boundary is the same too. The host's Accessibility and Input Monitoring permissions, granted to the Remio Host app by the human user once, cover both the human's session and the agent's session through that same app. The agent does not need its own permission grant. The agent does not have one to revoke. Revoking Remio's permissions kicks the agent out at the same time it would kick a human out — one switch, both paths.
The agent can request: full screen capture at any moment, region capture, OCR transcript with bounding boxes, or the macOS or Windows accessibility tree for the focused window. The choice is the agent's — bigger context for vision-language models, smaller and faster for structured-output ones.
A vision-language model like Claude or GPT-4o usually wants pixels. `screen.capture()` returns a full PNG of the host's primary display. `screen.capture(region)` returns just the rectangle the agent asked for — useful when the model already knows roughly where to look and wants to cut prompt cost. The capture comes from the same native capture pipeline that feeds the live human stream, so the image the agent sees is the image the user would see.
For agents that prefer structured input, `screen.ocr()` returns the on-screen text as a list of strings with bounding boxes, generated on the host using the platform's native OCR (Vision framework on macOS, Windows OCR API on Windows). The agent gets a compact text representation it can reason over without a vision model in the loop — cheaper, faster, and easier to chain into deterministic tools. Bounding boxes mean the agent can turn an OCR hit into a precise `input.send_click(x, y)` without ever sending a single pixel to its model.
For the highest-fidelity structured access, `accessibility.tree()` returns the OS accessibility tree for the focused window — the same tree a screen reader walks. The agent gets a labelled hierarchy of buttons, text fields, menus, and roles, with stable identifiers it can target directly. Most agent workflows mix all three: capture once to orient, OCR to read a long block of text, accessibility tree to interact with a specific control. The agent picks per query — Remio just answers.
Every session is created by a human pairing a 4-digit PIN with the host. Agents inherit the human's session. They cannot start their own. They cannot bypass screen-recording or accessibility permissions. They cannot persist credentials. End a session and the agent is locked out until paired again.
The starting point of every Remio session is a human reading a 4-digit PIN off the host's screen and entering it on the client. There is no token an agent can hold and silently refresh. There is no API key that grants persistent access. The PIN is short-lived, the session it creates is single-use, and the moment the human ends the session, the agent's connection drops. The next agent run needs a new PIN, which means a new human-in-the-loop moment.
Within the session, the agent inherits the host's OS-level permissions — not the user's broader account credentials. If the host has not granted Remio Screen Recording permission for the secondary display, `screen.capture()` on that display returns nothing. If Accessibility permission is off, the input pipeline refuses to inject events. The agent operates strictly within the envelope the user has granted the Remio app itself; it cannot widen that envelope and it cannot reach around it.
Nothing the agent does during a session persists on the host beyond the host's own file system. Remio does not stash credentials in a keychain for the agent to reuse later. It does not write the session's screen frames or OCR output to disk. When the human ends the session, the agent's working state on the host is whatever files the agent saved to the user's drive during the session — visible to the user, owned by the user, deletable by the user. The agent has no hidden tail.
Remio publishes a thin SDK over the data channel: input.send_click, input.send_key, screen.capture, screen.ocr, accessibility.tree. Wire it into any agent loop. No vendor lock-in.
The SDK is intentionally small. Five primitives cover the entire interaction surface an agent needs: two for input (click, key), two for vision (capture, OCR), one for structured UI (accessibility tree). Each maps directly onto a data-channel message; there is no clever middleware in between. An agent loop is a `while` loop around those five calls, plus whatever planning your model does in between. If you can call a Python function, you can drive Remio.
Because the SDK is model-agnostic, you can swap the LLM without changing anything on the Remio side. Today you might wrap the SDK calls in tool definitions for the Anthropic SDK and let Claude plan the workflow. Tomorrow you switch to GPT through the OpenAI SDK and the same tool definitions work. The week after, you try Gemini through the Google Gen AI SDK; same tools again. The day you ship your own model fine-tuned for desktop work, you wire it into the same five primitives and the rest of your stack does not notice.
The SDK ships as a small client library — no separate sidecar process, no auth flow to manage, no quota system. The agent loop runs on your machine, on your server, on your edge node, wherever your model already runs. Remio is the desktop runtime; the agent runtime is whatever you already use to call a model.
Remio is the right runtime for some agent workflows and the wrong one for others. Here is the short, honest cut.
If your agent only needs a web browser — scraping pages, filling forms, reading dashboards — use a headless browser. Playwright or Puppeteer driving Chromium is faster, cheaper, more deterministic, and easier to scale than running an entire remote-desktop session for the sake of a browser tab. The headless route owns this category, and Remio is not trying to compete in it.
If your agent needs to spin up dozens of disposable environments — one per task, fresh state each time, parallel execution — use containers. Docker, Firecracker, or one of the AI-lab "agent computer" offerings give you ephemeral Linux desktops you can fan out at scale. Remio is built around a persistent human-paired session, which is the opposite shape: low concurrency, high continuity, real user context.
Remio is the right answer when the agent needs a real OS desktop with native apps, the user's actual installed software, the user's real accounts and files, and a human-in-the-loop session boundary. Accessibility agents driving a desktop for a user with disabilities. AI-assisted QA on real OS apps. Agentic workflows that have to use a specific native tool the user already owns. Any task where "use my actual machine, with everything on it, while I supervise" is the natural framing. For those, Remio is the runtime; for everything else, pick the lighter tool.
Five questions the agent integration tends to raise — honest answers below.
Install Remio on the machine you want the agent to work on, install it on the device the agent runs from, pair the two with a 4-digit PIN, and your agent loop has hands and eyes on a real OS. Five primitives, one encrypted session, no special agent mode on the host.
macOS, iOS, iPadOS, Windows, and Android. Your model, your machine, your data.