Decoupled Deployment
Context: AWS Greengrass v2 · Tauri · Debian 12 · Tinker Board · WebKitGTK 2.40 · X11 · CPU rendering
Target scale: 1000 devices
Status: Architecture proposal — not yet implemented
TL;DR
The current architecture couples the Tauri app and the Greengrass management plane into a single process tree. This works at demo scale and fails at fleet scale. Decoupling means splitting them into two separate Greengrass components with different update cadences, different blast radii, and different lifecycles:
-
Supervisor — small, boring, rarely changes. Owns the Greengrass IPC connection. Manages the Tauri app's lifecycle.
-
App binary — the Tauri build itself. Changes often. Just exists on disk; supervisor decides when and how to run it. A side effect of doing this: the bindgen-over-C++ FFI layer in the current Tauri app disappears entirely. The Tauri app stops needing the AWS IoT SDK at all.
The Problem Being Solved
What "coupled" means today
Build on Tinker Board → S3 → Greengrass component pulls binary → runs Tauri app
The Greengrass component is the app. One artifact, one lifecycle. The thing that manages updates and the thing being updated are the same process tree. A surgeon shouldn't operate on themselves.
The failure mode
Ship a Tauri build with a startup bug:
-
Greengrass deploys new component version → stops old Tauri → starts new Tauri
-
New Tauri crashes immediately
-
Greengrass marks component as
BROKEN -
Depending on deployment policy, Greengrass may roll back — or it may not, especially if the crash is slow (starts, runs 30s, then dies)
-
Now you have a device showing a black X11 screen
-
Your only remote lever is the same Greengrass deployment system that just pushed the broken thing If the bug corrupts a config file the supervisor reads, or eats GPU memory and wedges X, you may not even be able to push a fix because the device is too unhealthy to complete a deployment.
| Device count | What this looks like |
| 10 devices | SSH in, fix it manually |
| 100 devices | Bad afternoon |
| 1000 devices | Incident. The kind that wakes people up. |
Secondary failure: the Tauri app holds the IPC connection
Because the Tauri app is the Greengrass component, it owns the EventStream RPC socket to the nucleus. Every Tauri restart:
-
Tears down the EventStream RPC connection
-
Re-authenticates with the nucleus (SVCUID token exchange)
-
Re-subscribes to all IPC topics
-
Loses in-flight messages that hadn't been acked On a Tinker Board with cold WebKit cache, restart is 5–15 seconds. During that window the device is invisible to the fleet — no telemetry, no command reception, no "I'm rebooting, don't panic." Multiply by 1000 devices on a bad app version and CloudWatch shows a fleet-wide outage that's actually just the UI restarting.
Tertiary failure: the security model is awkward
Greengrass IPC authorization is per-component. The nucleus checks "is component X allowed to publish to topic Y?" based on the component's recipe. If the Tauri app talks to IPC, what component is it authenticating as? Three usual workarounds, all bad:
-
Run Tauri as a Greengrass component → fights the decoupling
-
Hardcode an SVCUID → security problem; these are meant to be per-process and rotated
-
Run Tauri as
ggc_user→ GUI runs as the Greengrass system user; nobody wants to audit that privilege model None are clean. That itself is a signal the architecture is fighting the platform.
The Decoupled Architecture
Two Greengrass components, different lifecycles
**Component A: **app-supervisor — rarely changes (maybe quarterly)
A tiny, boring, battle-tested process. Its only jobs:
-
Start the Tauri app as a child process under the X11 session
-
Watchdog it (IPC ping, or check PID + heartbeat file)
-
Restart on crash, with exponential backoff
-
Report health to IoT Core via MQTT (alive, crashed N times, X server up/down, current app version running)
-
Listen on an MQTT topic for "switch to app version X" commands
-
Read which Tauri binary to run from a local pointer file (e.g.,
/opt/app/currentsymlink →/opt/app/versions/1.4.2/) This is the management plane. It updates rarely. When it does update, it's done with care.
**Component B: **app-binary — changes often (per release)
The actual Tauri build. Its job is just to exist on disk.
-
Doesn't run itself
-
Downloads to a versioned path like
/opt/app/versions/1.4.3/ -
Verifies a signature
-
Writes a "ready" marker
-
Supervisor sees the marker, decides when to switch (atomic symlink flip), restarts the Tauri process pointing at the new path The deployment of Component B can fail completely — bad download, signature mismatch, half-written files — and the supervisor is still running, still reporting to the cloud, still able to receive "roll back to 1.4.2" commands.
Process model
┌─────────────────────────────────────────────────────────┐
│ Tinker Board (Debian 12) │
│ │
│ ┌──────────────────┐ ┌──────────────────────┐ │
│ │ Tauri app │ JSON │ Supervisor (Python) │ │
│ │ (X11 session) │◄───────►│ - Greengrass IPC │ │
│ │ - WebKitGTK 2.40 │ Unix │ - Watchdog Tauri │ │
│ │ - CPU render │ socket │ - Version switching │ │
│ │ - No IPC code │ │ - Health to MQTT │ │
│ └──────────────────┘ └──────────┬───────────┘ │
│ │ │
│ │ EventStream │
│ │ RPC │
│ ┌─────────▼───────────┐ │
│ │ Greengrass Nucleus │ │
│ └─────────┬───────────┘ │
└──────────────────────────────────────────┼──────────────┘
│
│ MQTT/TLS
▼
┌─────────────┐
│ AWS IoT │
│ Core │
└─────────────┘
On-device file layout
/opt/app/
├── current → versions/1.4.2 # symlink, supervisor reads this
├── versions/
│ ├── 1.4.1/ # kept for rollback
│ ├── 1.4.2/ # running
│ └── 1.4.3/ # downloaded, not yet active
├── ready_markers/
│ └── 1.4.3.ready # signature-verified, ready to switch
└── supervisor/ # Component A, separate lifecycle
/run/app/
└── supervisor.sock # Unix socket: Tauri ↔ Supervisor
Version switch protocol (local, fast, no cloud round-trip)
-
Component B finishes downloading → writes
/opt/app/ready_markers/1.4.3.ready -
Supervisor sees marker → publishes "ready to switch" health event
-
Supervisor receives "switch to 1.4.3" command (or auto-switches per policy)
-
Supervisor:
SIGTERMto Tauri (graceful), wait,SIGKILLif needed -
Supervisor:
ln -sfn versions/1.4.3 current(atomic symlink flip) -
Supervisor: start Tauri from
current/ -
Supervisor: wait for heartbeat within 60s window
-
If heartbeat received → mark switch successful, publish state
-
If no heartbeat → flip symlink back to 1.4.2, restart, publish failure Rollback happens locally, in seconds, without needing a cloud round-trip or a new Greengrass deployment. That's the decoupling that matters.
Why This Deletes the bindgen Problem
The current Tauri app has this stack:
Tauri app (Rust)
└── thin Rust FFI layer (bindgen)
└── C++ AWS IoT SDK (statically linked on Tinker Board)
└── Unix socket → Greengrass nucleus IPC
In the decoupled architecture, the Tauri app no longer talks to Greengrass. It talks to the supervisor over a trivial local protocol:
Tauri app (Rust)
└── tokio::net::UnixStream + serde_json (~50 lines)
└── /run/app/supervisor.sock
└── Supervisor (Python) handles everything else
What gets deleted:
-
❌ bindgen-generated bindings for C++ headers
-
❌ Hand-written C shim layer (if you have one)
-
❌ Static linking of aws-crt-cpp, aws-c-mqtt, aws-c-io, aws-c-cal, aws-c-common, s2n-tls, libcrypto
-
❌ FFI exception handling (was UB anyway)
-
❌
std::shared_ptr/std::functioncross-FFI workarounds -
❌ Build-time dependency on a C++ toolchain in the Tauri build
-
❌ Architecture-specific linking concerns on armv7 What replaces it:
-
✅
tokio::net::UnixStream -
✅
serde_json -
✅ A schema for the local IPC messages That's the trade. A genuinely brittle layer becomes ~50 lines of standard Rust.
The Supervisor: Implementation Choices
Language: Python
Recommendation: Python with the official awsiotsdk package.
Counter-intuitive on an embedded device, but right here:
-
The supervisor is not a hot path. It ticks every few seconds, watches a process, publishes MQTT, listens for commands. Python is more than fast enough.
-
The official
awsiotsdkPython package is maintained, well-documented, and the Greengrass IPC client is first-class. -
No static linking nightmare.
pip installin CI, ship a venv or PyInstaller bundle. -
When the SDK gets a CVE →
pip install --upgrade, redeploy a small component. No C++ toolchain rebuild. -
Greengrass itself ships official Python component templates and examples — more working code than C++.
When to pick C++ instead:
-
µs-level latency requirements (this is not that)
-
Running on a microcontroller (this is not that)
-
Existing C++ codebase the team must maintain
-
Hard licensing constraint For a fleet supervisor on a Tinker Board, Python is strictly less pain.
Lifecycle responsibilities
The supervisor owns:
-
Tauri process lifecycle — fork/exec, signal handling, restart with exponential backoff
-
Watchdog — heartbeat file, IPC ping, "alive but stuck" detection (CPU-pinned event loop)
-
Version management — read pointer file, watch for ready markers, perform atomic switch, handle rollback
-
Greengrass IPC — persistent EventStream RPC connection to nucleus
-
Telemetry — publish device shadow updates: app version, X server status, restart count, last error
-
Command handling — subscribe to MQTT topics for "switch version," "restart app," "reboot device"
-
Local protocol server — Unix socket for Tauri to send events / receive commands
What it does not own
-
Rendering anything
-
Touching the X11 session directly (it spawns Tauri, which handles its own X connection)
-
Application-level business logic
-
User-facing UI The supervisor should be boring. If the supervisor needs frequent updates, something has crept into it that doesn't belong.
The Local IPC Protocol (Tauri ↔ Supervisor)
Transport
Unix domain socket at /run/app/supervisor.sock. Permissions set so the X11 session user can read/write.
Wire format
Line-delimited JSON. One JSON object per line. No framing protocol needed.
Direction: Tauri → Supervisor (events)
{"type": "heartbeat", "ts": 1715800000, "version": "1.4.2"}
{"type": "event", "name": "button_pressed", "payload": {"button_id": "start"}}
{"type": "telemetry", "metrics": {"frame_time_ms": 42, "memory_mb": 187}}
{"type": "error", "level": "warn", "message": "websocket disconnected"}
Direction: Supervisor → Tauri (commands)
{"type": "command", "name": "reload_config"}
{"type": "command", "name": "shutdown", "reason": "version_switch"}
{"type": "config_update", "config": {...}}
Why this protocol is trivially good
-
No SDK dependency in the Tauri app. Just
tokio::net::UnixStream+serde_json. -
Easy to test.
nc -U /run/app/supervisor.socklets you poke at it manually. -
Portable. If Greengrass ever gets replaced, the Tauri app doesn't change. The supervisor changes.
-
Decoupled schemas. Tauri evolves independently of what's on the wire to AWS.
Reconnection behavior
-
Tauri reconnects to the socket on disconnect with exponential backoff
-
Supervisor accepts connections; tolerates Tauri restarts
-
Neither side assumes the other is alive — they reconnect and resume
Greengrass Component Recipes (Sketch)
Component A: com.example.app-supervisor
RecipeFormatVersion: "2020-01-25"
ComponentName: com.example.app-supervisor
ComponentVersion: "1.0.0"
ComponentDescription: "Manages Tauri app lifecycle and Greengrass IPC."
ComponentPublisher: Example Co.
ComponentDependencies:
aws.greengrass.Nucleus:
VersionRequirement: ">=2.0.0"
ComponentConfiguration:
DefaultConfiguration:
accessControl:
aws.greengrass.ipc.mqttproxy:
com.example.app-supervisor:pubsub:1:
policyDescription: "Publish device state, subscribe to commands"
operations:
- aws.greengrass#PublishToIoTCore
- aws.greengrass#SubscribeToIoTCore
resources:
- "fleet/+/state"
- "fleet/+/command"
Manifests:
- Platform:
os: linux
architecture: arm # or aarch64 for RK3399
Lifecycle:
Install:
Script: |
mkdir -p /opt/app/versions /opt/app/ready_markers /run/app
chmod 755 /opt/app
Run:
Script: |
exec /opt/app/supervisor/bin/supervisor --socket /run/app/supervisor.sock
Artifacts:
- URI: s3://my-bucket/supervisor/1.0.0/supervisor.tar.gz
Unarchive: TAR
Permission:
Read: ALL
Execute: OWNER
Component B: com.example.app-binary
RecipeFormatVersion: "2020-01-25"
ComponentName: com.example.app-binary
ComponentVersion: "1.4.3"
ComponentDescription: "Tauri app binary. Installed on disk; supervisor decides when to run."
ComponentPublisher: Example Co.
ComponentDependencies:
com.example.app-supervisor:
VersionRequirement: ">=1.0.0"
Manifests:
- Platform:
os: linux
architecture: arm
Lifecycle:
Install:
Script: |
VERSION="1.4.3"
TARGET="/opt/app/versions/${VERSION}"
mkdir -p "${TARGET}"
cp -r {artifacts:path}/* "${TARGET}/"
# Verify signature here
sha256sum -c "${TARGET}/SHA256SUMS"
# Write ready marker — supervisor watches this directory
touch "/opt/app/ready_markers/${VERSION}.ready"
# No Run lifecycle. This component does not run anything.
Artifacts:
- URI: s3://my-bucket/app/1.4.3/app.tar.gz
Unarchive: TAR
Note: Component B has no Run section. That's the point. It installs files and exits.
Two Ways to Ship Component B
Both work; pick based on how much control you want.
Option 1: Greengrass component as downloader (recommended)
-
The component artifact is small — a script that pulls the real binary from S3, verifies signature, drops it in
/opt/app/versions/\<v>/, writes the ready marker -
Greengrass handles thing-group targeting, rollout rings, and reporting
-
Most teams pick this
Option 2: Tauri's built-in updater pointed at signed S3/CloudFront manifest
-
Bypasses Greengrass entirely for app updates
-
More flexible, but now you have two fleet management systems
-
You'll have to build your own rollout-ring logic
-
Only worth it if you need update behavior Greengrass can't express For 1000 devices on Greengrass already, Option 1. Don't introduce a second control plane.
What This Unlocks
Direct consequences of decoupling:
| Benefit | Mechanism |
| Local rollback in seconds | Supervisor flips symlink without cloud round-trip |
| Management plane survives bad app deploys | Supervisor is a different process, different component |
| Tauri restarts don't blink the IPC connection | Supervisor holds the connection persistently |
| bindgen / C++ FFI deleted | Tauri app no longer talks to AWS SDK |
| Independent update cadences | Supervisor updates quarterly, app updates weekly |
| Cleaner security model | Supervisor is the authenticated component; Tauri is a local client |
| Watchdog for WebKit-stuck event loop | Supervisor (separate process) can see "alive but no heartbeat" |
| CPU budget separation | Supervisor can be nice'd / pinned to a different core than rendering |
| Reduced build complexity | Tauri build becomes a normal Rust build, no C++ toolchain |
Migration Path
Assume the current architecture is working in demo. The migration is incremental.
Phase 1: Build the supervisor in parallel (no production impact)
-
Implement supervisor in Python
-
Connect to Greengrass IPC, publish a heartbeat to a test topic
-
Run alongside the existing Tauri app on a dev device
-
Validate: connection survives, MQTT round-trip works, no resource issues
Exit criterion: supervisor runs for 72 hours without crashes, connection stays up.
Phase 2: Add the local socket protocol
-
Supervisor opens Unix socket, accepts connections
-
Modify Tauri app: add a new client that connects to the socket, sends heartbeats
-
Do not remove the bindgen layer yet. Both run in parallel.
-
Compare: do the new path and old path agree on telemetry?
Exit criterion: Tauri publishes heartbeat via supervisor for 24h, matches direct-IPC telemetry.
Phase 3: Migrate one workflow at a time
-
Pick the smallest IPC use case (e.g., one outbound event)
-
Migrate it to the new path
-
Remove that specific code from the bindgen layer
-
Repeat per workflow
Exit criterion: each migrated workflow runs for a week without regression.
Phase 4: Remove the bindgen layer entirely
-
Once all workflows are migrated, delete the C++ FFI code from Tauri
-
Delete the static linking from the build
-
Tauri build is now a pure Rust build
Exit criterion: clean build with no C++ deps, all tests pass.
Phase 5: Split into two Greengrass components
-
Up to this point the supervisor may have shipped inside the existing component
-
Now: split into
com.example.app-supervisorandcom.example.app-binaryrecipes -
Deploy to canary group first
Exit criterion: version switch works end-to-end on canary; rollback works.
Phase 6: Roll out to fleet
-
Canary (5 devices) → Early (50) → Production (rest)
-
24h soak between each ring
Total estimated time: 4–6 weeks for a small team, mostly serialized on soak periods rather than coding.
What Can Go Wrong (And How to Catch It)
| Risk | Mitigation |
| Supervisor itself becomes coupled / complex | Keep it ruthlessly small. If it needs frequent updates, something's wrong. |
| Local socket protocol grows into an RPC framework | Resist this. Line-delimited JSON, simple message types. |
| Supervisor crashes and Tauri keeps running with stale data | Tauri must treat socket disconnect as "no Greengrass right now" and degrade gracefully |
| Atomic symlink flip races with Tauri startup | Tauri must read the symlink at startup, not assume a path. Supervisor must SIGTERM before flipping. |
| Ready marker exists but signature was bad | Verify signature before writing the marker, not after |
| Version directories accumulate forever | Supervisor garbage-collects: keep current + N-1, delete the rest |
| Multiple supervisors started somehow | PID file with flock at startup |
| Tauri can't connect to supervisor on boot | Tauri retries with backoff; do not crash on initial socket absence |
Open Questions
-
Supervisor packaging format: raw Python + venv, or PyInstaller single binary? PyInstaller adds bundle size but simplifies deployment. Probably PyInstaller.
-
Heartbeat frequency: every 5s? every 30s? Tradeoff: detection latency vs. log noise. Start with 10s.
-
Rollback policy: auto-rollback after N failed heartbeats, or require explicit command from cloud? Recommend auto-rollback for safety, with an audit log to IoT Core.
-
N-1 retention: keep 1 previous version, or 3? Disk on Tinker Board is limited. Start with 2 (current + previous).
-
Supervisor → Tauri command authorization: does Tauri trust everything from the socket? Probably yes, since the socket is filesystem-permission-protected.
Out of Scope for This Document
These are mentioned for completeness but belong in their own docs:
-
CI/cross-compilation pipeline (prerequisite, see main architecture doc)
-
Fleet Provisioning by Claim (separate concern)
-
Thing Group rollout rings (deployment policy, not architecture)
-
Observability and metrics design (post-supervisor)
-
Disk wear, time sync, secrets management (production readiness gaps)
Companion document to the main Greengrass + Tauri Fleet Architecture notes. Update as implementation lands.