Skip to content

Designing a Desktop Overlay Platform

Grammarly works everywhere you write. Word, Google Docs, Slack, your browser, native desktop apps. It draws underlines, shows suggestion cards, and runs an AI assistant, all as a transparent overlay on top of whatever application you're using.

Making that work across macOS, Windows, Chrome Extension, Safari Extension, and the web from a single codebase is a harder problem than it looks.

I designed the platform behind it. This is about the architectural decisions and the ideas that make it work.

Why this needed to exist

The product ships on five platforms: macOS, Windows, Chrome Extension, Safari Extension, and the web. For years, each platform had its own UI implementation. Features were built multiple times by different teams. Coverage drifted. A bug fixed on one platform would resurface on another months later.

The company has far more web engineers than native engineers. But the desktop clients could only be worked on by the native teams. Every new feature (an AI assistant, a suggestion panel redesign, a growth experiment) had to be implemented separately on each platform, by people who were already stretched thin.

Previous attempts at unification picked the wrong level of abstraction. Both worked in controlled conditions but neither survived the pace of real development.

Server-Driven UI sent serialized UI state from the server to clients. It worked for frozen features, but versioning across multiple active renderer versions was a nightmare, the single-purpose component count ballooned, and the unconventional authoring model created a steep learning curve. Too brittle for rapid iteration.

A platform-agnostic bridge layer let native and web code call each other via RPC. Clean in theory, but the contract kept growing. Every new capability required contributions to each platform and the shared codebase. Building on existing primitives was easy; needing a new one meant touching six codebases. Too costly for rapid iteration.

Both solved for correctness. Neither solved for velocity.

The insight was that not everything needs to be shared. Accessibility APIs on Windows work nothing like accessibility APIs on macOS, which work nothing like the DOM in a browser. Text access is inherently platform-specific. But what you do with that text is the same everywhere. Checking it, showing suggestions, rendering an assistant. That's where all the feature work happens.

So the architecture draws a line. Platform-specific code handles text access and window management. Shared web code (TypeScript, React) handles everything else. The boundary is a protocol, not an abstraction layer. Each side is honest about what it can and can't do.

The constraints

Building a desktop overlay sounds simple. Float a transparent window above the editor. Draw underlines. Show a card when the user hovers.

Then you realize the window needs to be click-through, so mouse events reach the app underneath. But you still need to know when the cursor hovers over your UI. Click-through and pointer-aware at the same time — on the same window.

Then you need multiple windows (suggestions, assistant, settings) sharing state. Then you need all of it working on macOS, Windows, browser extensions, and web from the same codebase.

No single technology solves all of these.

Three layers

The system splits into three execution contexts:

HOST              Tauri (or web host)
                  Native windows, mouse tracking, shortcuts

FOREGROUND        TypeScript, one per window
                  React rendering

BACKGROUND        TypeScript, SharedWorker
                  Plugin host, business logic, state

The host is Tauri on desktop (Electron would work too). It creates native windows, tracks mouse events outside webview bounds, and registers global shortcuts. On web, an iframe manager does the equivalent.

The background is a singleton because business logic (backend connections, authentication, state) must be shared across all windows. A SharedWorker survives individual window closures and prevents duplicate connections.

The foreground is per-window because each UI panel needs its own rendering context. Every window connects to the same background, forming a hub-and-spoke model.

On desktop, a "main window" bootstraps everything. It starts at 0x0 pixels, invisible. It spawns the SharedWorker and establishes the data provider connection. Then it joins the window pool and gets reused as the first visible window. No wasted resources.

This separation is what lets a web engineer ship a feature without touching native code. They work in the foreground and background layers — TypeScript, React, familiar tools. The host layer is someone else's problem.

Execution scopes

If I had to pick the single most important idea in this architecture, it's execution scopes.

An execution scope defines a period during which a task is performed, with access to a set of resources. When the scope ends, everything inside it is cleaned up automatically. Scopes form a tree:

Global Scope (entire application lifetime)
  │
  ├── Connection Scope (per data provider connection)
  │     │
  │     ├── Document Scope (per text field being edited)
  │     │     │
  │     │     └── Interaction Scope (hover → dismiss)
  │     │
  │     └── Document Scope (another document)
  │
  └── Connection Scope (another connection)

The global scope holds authentication and the window manager. A connection scope is created when a text data provider connects. A document scope is created when the user focuses a text field. An interaction scope lives for a single hover-and-card interaction.

Resources are created when you need them. Cleaned up when you don't.

Why this matters

Without scopes, you'd either create everything at startup (wasteful), manually track each resource's lifecycle (error-prone), or use a flat dependency injection container (no hierarchy, no automatic cleanup).

Scopes tie resource lifecycle to semantic application events. When a user closes a document, the document scope disposes. That cascades: the backend connection closes, the text model is cleaned up, suggestion state is cleared, UI subscriptions are cancelled.

No manual cleanup code. No forgotten teardown.

Document Scope disposes
  ├── Child interaction scopes disposed first
  ├── Backend connection closed
  ├── Text model cleaned up
  ├── UI subscriptions cancelled
  └── Parent scope notified (not disposed)

AsyncDisposableStack handles it in LIFO order. Every resource implements Symbol.dispose — subscriptions, connections, UI elements, scope controllers. JavaScript's explicit resource management proposal (using declarations and DisposableStack) was a perfect fit. Every onStart callback returns a Disposable. Every registration returns a Disposable. The scope collects them all.

This pairs with Result<T> to unify error handling and resource management. No try/catch. No finally blocks. If a registration fails, everything that succeeded before it is rolled back through the same disposal mechanism.

The scope is the operating system

Every feature activates within a scope. Every resource is owned by a scope. Cross-window communication happens through scope-scoped entities. I started thinking of scopes as the operating system for the plugin layer:

  • Process lifecycle → Scope start/stop
  • Memory management → Disposal cascade
  • Dependency injection → DI container per scope, inheriting from parent
  • IPC → Remote entities over MessagePort
  • Isolation → Scope boundaries

The hierarchy isn't fixed. Any plugin can define new scope types:

// A plugin declares a new scope type
interface ScopeRegistry {
  readonly interaction: unknown
}

This extensibility lets ephemeral UI scopes (a hover interaction lasting milliseconds) coexist with long-lived structural scopes (the global application state) in the same tree.

Plugins

Every feature is a plugin. Roughly 30 in production. Each plugin is an independent package with code that can run in the background and in the foreground, communicating via RPC.

A plugin's activation is remarkably constrained. It can:

  1. Resolve dependencies on other plugins
  2. Register services on scope definitions
  3. Register lifecycle callbacks

That's it. No imperative "do stuff now." Everything is declarative: when this scope starts, here's what I provide.

export const activate = (context) => {
  const documentScope = context.getScopeDefinition('document')

  // Register a service on the scope definition
  documentScope.register({
    token: SuggestionServiceToken,
    useFactory: (resolver) => resolver.resolve(BackendClient).map((client) => new SuggestionService(client)),
  })

  // When a document scope starts, initialize the service
  documentScope.onStart((execution) => {
    const service = execution.resolve(SuggestionServiceToken)
    return service.connect() // returns Disposable
  })
}

The pattern: register a service on a scope definition. When a scope instance starts (a user focuses a text field), the service is created and initialized. When the scope ends (the user moves to another document), the returned Disposable cleans it up. The plugin never manages lifecycle directly. It declares what it provides and when.

This design gives you two things that matter at scale.

Loose coupling. Plugins never import each other's code. Each plugin has a public surface: just tokens and interfaces, no implementation. A plugin declares what it provides and what it consumes. The DI system wires them together at runtime.

A suggestion plugin consumes a user-info token without knowing which plugin provides it. Plugins can be swapped, extended, or disabled without cascading changes. Each plugin ships a test double, so any plugin can be tested in isolation.

Automatic hygiene. A plugin can't leak resources even if the author forgets to clean up. The scope owns the lifecycle. When the scope ends, everything the plugin registered is disposed. With 30 plugins from different teams, this is the difference between "works in isolation, breaks in production" and "works by construction."

Each plugin spans two realms, background and foreground, connected through scope-level remote entities. The background holds the data; the foreground renders the UI. They're separate execution contexts with no shared memory, but the scope system keeps them in sync. This is the entanglement story.

The data provider abstraction

A data provider gives the platform read-write access to text on screen. It implements a protocol using whatever platform APIs are available:

  • macOS desktop: Accessibility tree parsing
  • Windows desktop: UI Automation APIs
  • Chrome / Safari: DOM APIs
  • Web: Host page DOM

The protocol is the same everywhere:

documentDidFocus(request)
documentTextDidChange(request)     // OT operations
documentGeometryDidChange(request)
getDocumentText(request)
applyDocumentTextChange(request)   // OT operations
...

Text changes use operational transform operations (insert, delete, retain). The protocol is defined in an IDL and code-generated into multiple languages.

A handful of methods. That's the entire contract between platform-specific code and shared application logic.

A plugin that processes suggestions doesn't know or care whether the text came from a native app via accessibility APIs or from a textarea via the DOM. This is what makes "build once, run everywhere" real — not by hiding the platform, but by making the boundary so small that platform differences don't leak through.

The data provider protocol also drives the scope lifecycle. When a data provider connects, a connection scope is created. When it sends documentDidFocus, a document scope is created. When the connection drops, the connection scope disposes and everything underneath it is cleaned up automatically. Every document scope, every interaction scope, every service.

The protocol is the heartbeat. Scopes are the response.

Entanglement

The background runs in a SharedWorker. Each window runs in its own context. They need to share state, but they can't share memory.

Two scopes in separate execution contexts, no shared memory, yet kept in sync. In practice, it's eventually consistent — changes propagate over MessagePort, not instantaneously. But from a plugin author's perspective, the abstraction holds.

It works in two parts.

Structure. When a window opens, the background serializes its scope tree. Just the shape, not the data. The foreground reconstructs the same hierarchy: same scope types, same IDs, same parent-child relationships. This is cheap. New scopes created in the background after the window opens are automatically mirrored.

Access. Services registered in the background are exposed to the foreground as remote objects over MessagePort. The foreground calls methods on what looks like a local object; the calls are proxied to the background and the results come back. Both sides must be operating at the same scope level. You can't resolve a document-scoped service from the global scope.

The interesting part is that windows choose which scopes to entangle:

createWindow({
  name: 'suggestion-card',
  scopes: ['interaction'], // only this scope
})

A suggestion card window entangles one scope. Its ancestors are entangled too (the structure is always consistent), but remote objects are lazy. Nothing is resolved until a plugin asks for it, so the cost is just the tree shape. The assistant window entangles three scopes and actively uses services from all of them. A devtools window entangles zero. Complete isolation from production data.

  • Suggestion card entangles ['interaction']: card state only
  • Text overlay entangles ['document']: document data only
  • Assistant entangles ['global', 'connection', 'document']: everything
  • DevTools entangles []: nothing

Security, performance, and resource management in one mechanism. A lightweight popup initializes fast with minimal memory. A full-featured panel gets the access it needs. Nothing gets more than it asked for.

Click-through transparency

The hardest problem. The overlay must be:

  1. Visually transparent: see the app underneath
  2. Input transparent: clicks pass through
  3. Pointer-aware: know where the mouse is

Requirements 2 and 3 fight each other. If clicks pass through your window, you don't receive mouse events. But you need mouse events to know when the user hovers over an underline.

The core solution is event proxying: intercept mouse events at the OS level and forward them to the overlay window, even though the window itself isn't receiving them natively.

On top of that, we built tracking areas: rectangular regions that plugins register to declare "I care about pointer events here." Tracking areas serve two purposes: they filter which events get proxied (you don't need every mouse move across the entire screen), and they enable fast switching between click-through and opaque mode. When the cursor enters a tracking area, the window becomes interactive for that region. This is how a user can move from hovering over a text decoration (click-through window) to clicking a button on the suggestion card (opaque window) without any visible transition.

The event interception is platform-specific:

macOS. A Core Graphics event tap intercepts mouse events system-wide on a private thread.

Windows. A low-level mouse hook captures events on a dedicated thread.

Web. The host page listens for pointer events and forwards them to the iframe via postMessage.

We tried and rejected several approaches on Windows before landing on this one. DLL injection into the browser process worked but antivirus software blocked it for a significant percentage of users. RPC-based event forwarding (like the web approach) introduced noticeable latency on lower-end machines.

What changed

The AI assistant was the first real test. Before this platform, it would have been built separately on each native client by the native teams, with the web team building their own version in parallel. Instead, one team built it as a plugin. It's rolling out across all five platforms from a single codebase.

That became the pattern. Feature teams write plugins, register services on scopes, and the platform handles the rest. They don't touch native code. They don't coordinate across platform teams. They don't even need to know whether the text came from an accessibility tree or a DOM element.

Teams that previously couldn't contribute to desktop now own features across all five platforms. The ratio of web engineers to native engineers went from a bottleneck to an advantage.

Roughly a hundred packages. 30 plugins. Five platforms. One codebase.

The ideas that generalize

I think the most useful parts of this work aren't specific to writing tools. They're structural ideas that apply to any application facing similar constraints.

Execution scopes as a hierarchical resource lifecycle primitive. Not flat DI containers, but a tree where disposal cascades and resources are tied to semantic events.

Partial entanglement for per-window access control and performance. Windows declare what they need, and nothing more gets wired up.

Event proxying with tracking areas for pointer awareness on click-through transparent windows. The constraint seems impossible until you move event interception to the OS level and let plugins declare regions of interest.

A protocol boundary between platform-specific data access and shared application logic. Not an abstraction that hides the platform. A boundary that's honest about it.

Any application that needs transparent overlays, multi-window coordination, or cross-platform rendering from a single web codebase faces the same problems. The specific answers may differ, but the shape of the questions is the same.

Comments