Decoupled Deployment

Context: AWS Greengrass v2 · Tauri · Debian 12 · Tinker Board · WebKitGTK 2.40 · X11 · CPU rendering
Target scale: 1000 devices
Status: Architecture proposal — not yet implemented

TL;DR

The current architecture couples the Tauri app and the Greengrass management plane into a single process tree. This works at demo scale and fails at fleet scale. Decoupling means splitting them into two separate Greengrass components with different update cadences, different blast radii, and different lifecycles:

Supervisor — small, boring, rarely changes. Owns the Greengrass IPC connection. Manages the Tauri app's lifecycle.
App binary — the Tauri build itself. Changes often. Just exists on disk; supervisor decides when and how to run it. A side effect of doing this: the bindgen-over-C++ FFI layer in the current Tauri app disappears entirely. The Tauri app stops needing the AWS IoT SDK at all.

The Problem Being Solved

What "coupled" means today

Build on Tinker Board → S3 → Greengrass component pulls binary → runs Tauri app

The Greengrass component is the app. One artifact, one lifecycle. The thing that manages updates and the thing being updated are the same process tree. A surgeon shouldn't operate on themselves.

The failure mode

Ship a Tauri build with a startup bug:

Greengrass deploys new component version → stops old Tauri → starts new Tauri
New Tauri crashes immediately
Greengrass marks component as BROKEN
Depending on deployment policy, Greengrass may roll back — or it may not, especially if the crash is slow (starts, runs 30s, then dies)
Now you have a device showing a black X11 screen
Your only remote lever is the same Greengrass deployment system that just pushed the broken thing If the bug corrupts a config file the supervisor reads, or eats GPU memory and wedges X, you may not even be able to push a fix because the device is too unhealthy to complete a deployment.

Device count	What this looks like
10 devices	SSH in, fix it manually
100 devices	Bad afternoon
1000 devices	Incident. The kind that wakes people up.

Secondary failure: the Tauri app holds the IPC connection

Because the Tauri app is the Greengrass component, it owns the EventStream RPC socket to the nucleus. Every Tauri restart:

Tears down the EventStream RPC connection
Re-authenticates with the nucleus (SVCUID token exchange)
Re-subscribes to all IPC topics
Loses in-flight messages that hadn't been acked On a Tinker Board with cold WebKit cache, restart is 5–15 seconds. During that window the device is invisible to the fleet — no telemetry, no command reception, no "I'm rebooting, don't panic." Multiply by 1000 devices on a bad app version and CloudWatch shows a fleet-wide outage that's actually just the UI restarting.

Tertiary failure: the security model is awkward

Greengrass IPC authorization is per-component. The nucleus checks "is component X allowed to publish to topic Y?" based on the component's recipe. If the Tauri app talks to IPC, what component is it authenticating as? Three usual workarounds, all bad:

Run Tauri as a Greengrass component → fights the decoupling
Hardcode an SVCUID → security problem; these are meant to be per-process and rotated
Run Tauri as ggc_user → GUI runs as the Greengrass system user; nobody wants to audit that privilege model None are clean. That itself is a signal the architecture is fighting the platform.

The Decoupled Architecture

Two Greengrass components, different lifecycles

**Component A: **app-supervisor — rarely changes (maybe quarterly) A tiny, boring, battle-tested process. Its only jobs:

Start the Tauri app as a child process under the X11 session
Watchdog it (IPC ping, or check PID + heartbeat file)
Restart on crash, with exponential backoff
Report health to IoT Core via MQTT (alive, crashed N times, X server up/down, current app version running)
Listen on an MQTT topic for "switch to app version X" commands
Read which Tauri binary to run from a local pointer file (e.g., /opt/app/current symlink → /opt/app/versions/1.4.2/) This is the management plane. It updates rarely. When it does update, it's done with care.

**Component B: **app-binary — changes often (per release) The actual Tauri build. Its job is just to exist on disk.

Doesn't run itself
Downloads to a versioned path like /opt/app/versions/1.4.3/
Verifies a signature
Writes a "ready" marker
Supervisor sees the marker, decides when to switch (atomic symlink flip), restarts the Tauri process pointing at the new path The deployment of Component B can fail completely — bad download, signature mismatch, half-written files — and the supervisor is still running, still reporting to the cloud, still able to receive "roll back to 1.4.2" commands.

Process model

┌─────────────────────────────────────────────────────────┐
│ Tinker Board (Debian 12)                                │
│                                                         │
│  ┌──────────────────┐         ┌──────────────────────┐  │
│  │ Tauri app        │  JSON   │ Supervisor (Python)  │  │
│  │ (X11 session)    │◄───────►│ - Greengrass IPC     │  │
│  │ - WebKitGTK 2.40 │  Unix   │ - Watchdog Tauri     │  │
│  │ - CPU render     │  socket │ - Version switching  │  │
│  │ - No IPC code    │         │ - Health to MQTT     │  │
│  └──────────────────┘         └──────────┬───────────┘  │
│                                          │              │
│                                          │ EventStream  │
│                                          │ RPC          │
│                                ┌─────────▼───────────┐  │
│                                │ Greengrass Nucleus  │  │
│                                └─────────┬───────────┘  │
└──────────────────────────────────────────┼──────────────┘
                                           │
                                           │ MQTT/TLS
                                           ▼
                                    ┌─────────────┐
                                    │  AWS IoT    │
                                    │   Core      │
                                    └─────────────┘

On-device file layout

/opt/app/
  ├── current → versions/1.4.2     # symlink, supervisor reads this
  ├── versions/
  │   ├── 1.4.1/                   # kept for rollback
  │   ├── 1.4.2/                   # running
  │   └── 1.4.3/                   # downloaded, not yet active
  ├── ready_markers/
  │   └── 1.4.3.ready              # signature-verified, ready to switch
  └── supervisor/                  # Component A, separate lifecycle

/run/app/
  └── supervisor.sock              # Unix socket: Tauri ↔ Supervisor

Version switch protocol (local, fast, no cloud round-trip)

Component B finishes downloading → writes /opt/app/ready_markers/1.4.3.ready
Supervisor sees marker → publishes "ready to switch" health event
Supervisor receives "switch to 1.4.3" command (or auto-switches per policy)
Supervisor: SIGTERM to Tauri (graceful), wait, SIGKILL if needed
Supervisor: ln -sfn versions/1.4.3 current (atomic symlink flip)
Supervisor: start Tauri from current/
Supervisor: wait for heartbeat within 60s window
If heartbeat received → mark switch successful, publish state
If no heartbeat → flip symlink back to 1.4.2, restart, publish failure Rollback happens locally, in seconds, without needing a cloud round-trip or a new Greengrass deployment. That's the decoupling that matters.

Why This Deletes the bindgen Problem

The current Tauri app has this stack:

Tauri app (Rust)
  └── thin Rust FFI layer (bindgen)
       └── C++ AWS IoT SDK (statically linked on Tinker Board)
            └── Unix socket → Greengrass nucleus IPC

In the decoupled architecture, the Tauri app no longer talks to Greengrass. It talks to the supervisor over a trivial local protocol:

Tauri app (Rust)
  └── tokio::net::UnixStream + serde_json (~50 lines)
       └── /run/app/supervisor.sock
            └── Supervisor (Python) handles everything else

What gets deleted:

❌ bindgen-generated bindings for C++ headers
❌ Hand-written C shim layer (if you have one)
❌ Static linking of aws-crt-cpp, aws-c-mqtt, aws-c-io, aws-c-cal, aws-c-common, s2n-tls, libcrypto
❌ FFI exception handling (was UB anyway)
❌ std::shared_ptr / std::function cross-FFI workarounds
❌ Build-time dependency on a C++ toolchain in the Tauri build
❌ Architecture-specific linking concerns on armv7 What replaces it:
✅ tokio::net::UnixStream
✅ serde_json
✅ A schema for the local IPC messages That's the trade. A genuinely brittle layer becomes ~50 lines of standard Rust.

The Supervisor: Implementation Choices

Language: Python

Recommendation: Python with the official awsiotsdk package. Counter-intuitive on an embedded device, but right here:

The supervisor is not a hot path. It ticks every few seconds, watches a process, publishes MQTT, listens for commands. Python is more than fast enough.
The official awsiotsdk Python package is maintained, well-documented, and the Greengrass IPC client is first-class.
No static linking nightmare. pip install in CI, ship a venv or PyInstaller bundle.
When the SDK gets a CVE → pip install --upgrade, redeploy a small component. No C++ toolchain rebuild.
Greengrass itself ships official Python component templates and examples — more working code than C++.

When to pick C++ instead:

µs-level latency requirements (this is not that)
Running on a microcontroller (this is not that)
Existing C++ codebase the team must maintain
Hard licensing constraint For a fleet supervisor on a Tinker Board, Python is strictly less pain.

Lifecycle responsibilities

The supervisor owns:

Tauri process lifecycle — fork/exec, signal handling, restart with exponential backoff
Watchdog — heartbeat file, IPC ping, "alive but stuck" detection (CPU-pinned event loop)
Version management — read pointer file, watch for ready markers, perform atomic switch, handle rollback
Greengrass IPC — persistent EventStream RPC connection to nucleus
Telemetry — publish device shadow updates: app version, X server status, restart count, last error
Command handling — subscribe to MQTT topics for "switch version," "restart app," "reboot device"
Local protocol server — Unix socket for Tauri to send events / receive commands

What it does not own

Rendering anything
Touching the X11 session directly (it spawns Tauri, which handles its own X connection)
Application-level business logic
User-facing UI The supervisor should be boring. If the supervisor needs frequent updates, something has crept into it that doesn't belong.

The Local IPC Protocol (Tauri ↔ Supervisor)

Transport

Unix domain socket at /run/app/supervisor.sock. Permissions set so the X11 session user can read/write.

Wire format

Line-delimited JSON. One JSON object per line. No framing protocol needed.

Direction: Tauri → Supervisor (events)

{"type": "heartbeat", "ts": 1715800000, "version": "1.4.2"}
{"type": "event", "name": "button_pressed", "payload": {"button_id": "start"}}
{"type": "telemetry", "metrics": {"frame_time_ms": 42, "memory_mb": 187}}
{"type": "error", "level": "warn", "message": "websocket disconnected"}

Direction: Supervisor → Tauri (commands)

{"type": "command", "name": "reload_config"}
{"type": "command", "name": "shutdown", "reason": "version_switch"}
{"type": "config_update", "config": {...}}

Why this protocol is trivially good

No SDK dependency in the Tauri app. Just tokio::net::UnixStream + serde_json.
Easy to test. nc -U /run/app/supervisor.sock lets you poke at it manually.
Portable. If Greengrass ever gets replaced, the Tauri app doesn't change. The supervisor changes.
Decoupled schemas. Tauri evolves independently of what's on the wire to AWS.

Reconnection behavior

Tauri reconnects to the socket on disconnect with exponential backoff
Supervisor accepts connections; tolerates Tauri restarts
Neither side assumes the other is alive — they reconnect and resume

Greengrass Component Recipes (Sketch)

Component A: `com.example.app-supervisor`

RecipeFormatVersion: "2020-01-25"
ComponentName: com.example.app-supervisor
ComponentVersion: "1.0.0"
ComponentDescription: "Manages Tauri app lifecycle and Greengrass IPC."
ComponentPublisher: Example Co.
ComponentDependencies:
  aws.greengrass.Nucleus:
    VersionRequirement: ">=2.0.0"
ComponentConfiguration:
  DefaultConfiguration:
    accessControl:
      aws.greengrass.ipc.mqttproxy:
        com.example.app-supervisor:pubsub:1:
          policyDescription: "Publish device state, subscribe to commands"
          operations:
            - aws.greengrass#PublishToIoTCore
            - aws.greengrass#SubscribeToIoTCore
          resources:
            - "fleet/+/state"
            - "fleet/+/command"
Manifests:
  - Platform:
      os: linux
      architecture: arm  # or aarch64 for RK3399
    Lifecycle:
      Install:
        Script: |
          mkdir -p /opt/app/versions /opt/app/ready_markers /run/app
          chmod 755 /opt/app
      Run:
        Script: |
          exec /opt/app/supervisor/bin/supervisor --socket /run/app/supervisor.sock
    Artifacts:
      - URI: s3://my-bucket/supervisor/1.0.0/supervisor.tar.gz
        Unarchive: TAR
        Permission:
          Read: ALL
          Execute: OWNER

Component B: `com.example.app-binary`

RecipeFormatVersion: "2020-01-25"
ComponentName: com.example.app-binary
ComponentVersion: "1.4.3"
ComponentDescription: "Tauri app binary. Installed on disk; supervisor decides when to run."
ComponentPublisher: Example Co.
ComponentDependencies:
  com.example.app-supervisor:
    VersionRequirement: ">=1.0.0"
Manifests:
  - Platform:
      os: linux
      architecture: arm
    Lifecycle:
      Install:
        Script: |
          VERSION="1.4.3"
          TARGET="/opt/app/versions/${VERSION}"
          mkdir -p "${TARGET}"
          cp -r {artifacts:path}/* "${TARGET}/"
          # Verify signature here
          sha256sum -c "${TARGET}/SHA256SUMS"
          # Write ready marker — supervisor watches this directory
          touch "/opt/app/ready_markers/${VERSION}.ready"
      # No Run lifecycle. This component does not run anything.
    Artifacts:
      - URI: s3://my-bucket/app/1.4.3/app.tar.gz
        Unarchive: TAR

Note: Component B has no Run section. That's the point. It installs files and exits.

Two Ways to Ship Component B

Both work; pick based on how much control you want.

Option 1: Greengrass component as downloader (recommended)

The component artifact is small — a script that pulls the real binary from S3, verifies signature, drops it in /opt/app/versions/\<v>/, writes the ready marker
Greengrass handles thing-group targeting, rollout rings, and reporting
Most teams pick this

Option 2: Tauri's built-in updater pointed at signed S3/CloudFront manifest

Bypasses Greengrass entirely for app updates
More flexible, but now you have two fleet management systems
You'll have to build your own rollout-ring logic
Only worth it if you need update behavior Greengrass can't express For 1000 devices on Greengrass already, Option 1. Don't introduce a second control plane.

What This Unlocks

Direct consequences of decoupling:

Benefit	Mechanism
Local rollback in seconds	Supervisor flips symlink without cloud round-trip
Management plane survives bad app deploys	Supervisor is a different process, different component
Tauri restarts don't blink the IPC connection	Supervisor holds the connection persistently
bindgen / C++ FFI deleted	Tauri app no longer talks to AWS SDK
Independent update cadences	Supervisor updates quarterly, app updates weekly
Cleaner security model	Supervisor is the authenticated component; Tauri is a local client
Watchdog for WebKit-stuck event loop	Supervisor (separate process) can see "alive but no heartbeat"
CPU budget separation	Supervisor can be `nice`'d / pinned to a different core than rendering
Reduced build complexity	Tauri build becomes a normal Rust build, no C++ toolchain

Migration Path

Assume the current architecture is working in demo. The migration is incremental.

Phase 1: Build the supervisor in parallel (no production impact)

Implement supervisor in Python
Connect to Greengrass IPC, publish a heartbeat to a test topic
Run alongside the existing Tauri app on a dev device
Validate: connection survives, MQTT round-trip works, no resource issues

Exit criterion: supervisor runs for 72 hours without crashes, connection stays up.

Phase 2: Add the local socket protocol

Supervisor opens Unix socket, accepts connections
Modify Tauri app: add a new client that connects to the socket, sends heartbeats
Do not remove the bindgen layer yet. Both run in parallel.
Compare: do the new path and old path agree on telemetry?

Exit criterion: Tauri publishes heartbeat via supervisor for 24h, matches direct-IPC telemetry.

Phase 3: Migrate one workflow at a time

Pick the smallest IPC use case (e.g., one outbound event)
Migrate it to the new path
Remove that specific code from the bindgen layer
Repeat per workflow

Exit criterion: each migrated workflow runs for a week without regression.

Phase 4: Remove the bindgen layer entirely

Once all workflows are migrated, delete the C++ FFI code from Tauri
Delete the static linking from the build
Tauri build is now a pure Rust build

Exit criterion: clean build with no C++ deps, all tests pass.

Phase 5: Split into two Greengrass components

Up to this point the supervisor may have shipped inside the existing component
Now: split into com.example.app-supervisor and com.example.app-binary recipes
Deploy to canary group first

Exit criterion: version switch works end-to-end on canary; rollback works.

Phase 6: Roll out to fleet

Canary (5 devices) → Early (50) → Production (rest)
24h soak between each ring

Total estimated time: 4–6 weeks for a small team, mostly serialized on soak periods rather than coding.

What Can Go Wrong (And How to Catch It)

Risk	Mitigation
Supervisor itself becomes coupled / complex	Keep it ruthlessly small. If it needs frequent updates, something's wrong.
Local socket protocol grows into an RPC framework	Resist this. Line-delimited JSON, simple message types.
Supervisor crashes and Tauri keeps running with stale data	Tauri must treat socket disconnect as "no Greengrass right now" and degrade gracefully
Atomic symlink flip races with Tauri startup	Tauri must read the symlink at startup, not assume a path. Supervisor must SIGTERM before flipping.
Ready marker exists but signature was bad	Verify signature before writing the marker, not after
Version directories accumulate forever	Supervisor garbage-collects: keep current + N-1, delete the rest
Multiple supervisors started somehow	PID file with `flock` at startup
Tauri can't connect to supervisor on boot	Tauri retries with backoff; do not crash on initial socket absence

Open Questions

Supervisor packaging format: raw Python + venv, or PyInstaller single binary? PyInstaller adds bundle size but simplifies deployment. Probably PyInstaller.
Heartbeat frequency: every 5s? every 30s? Tradeoff: detection latency vs. log noise. Start with 10s.
Rollback policy: auto-rollback after N failed heartbeats, or require explicit command from cloud? Recommend auto-rollback for safety, with an audit log to IoT Core.
N-1 retention: keep 1 previous version, or 3? Disk on Tinker Board is limited. Start with 2 (current + previous).
Supervisor → Tauri command authorization: does Tauri trust everything from the socket? Probably yes, since the socket is filesystem-permission-protected.

Out of Scope for This Document

These are mentioned for completeness but belong in their own docs:

CI/cross-compilation pipeline (prerequisite, see main architecture doc)
Fleet Provisioning by Claim (separate concern)
Thing Group rollout rings (deployment policy, not architecture)
Observability and metrics design (post-supervisor)
Disk wear, time sync, secrets management (production readiness gaps)

Companion document to the main Greengrass + Tauri Fleet Architecture notes. Update as implementation lands.

TL;DR​

The Problem Being Solved​

What "coupled" means today​

The failure mode​

Secondary failure: the Tauri app holds the IPC connection​

Tertiary failure: the security model is awkward​

The Decoupled Architecture​

Two Greengrass components, different lifecycles​

Process model​

On-device file layout​

Version switch protocol (local, fast, no cloud round-trip)​

Why This Deletes the bindgen Problem​

The Supervisor: Implementation Choices​

Language: Python​

Lifecycle responsibilities​

What it does not own​

The Local IPC Protocol (Tauri ↔ Supervisor)​

Transport​

Wire format​

Direction: Tauri → Supervisor (events)​

Direction: Supervisor → Tauri (commands)​

Why this protocol is trivially good​

Reconnection behavior​

Greengrass Component Recipes (Sketch)​

Component A: com.example.app-supervisor​

Component B: com.example.app-binary​

Two Ways to Ship Component B​

Option 1: Greengrass component as downloader (recommended)​

Option 2: Tauri's built-in updater pointed at signed S3/CloudFront manifest​

What This Unlocks​

Migration Path​

Phase 1: Build the supervisor in parallel (no production impact)​

Phase 2: Add the local socket protocol​

Phase 3: Migrate one workflow at a time​

Phase 4: Remove the bindgen layer entirely​

Phase 5: Split into two Greengrass components​

Phase 6: Roll out to fleet​

What Can Go Wrong (And How to Catch It)​

Open Questions​

Out of Scope for This Document​

Related Articles