Health Monitoring & Watchdog

The watchdog is the part of Sanctum that doesn’t sleep. Every ten minutes it checks every service, attempts to fix what’s broken, and tells you about whatever it couldn’t fix. It is, in the most literal sense, a program that watches other programs and judges them. If it ever develops opinions about your uptime, unplug everything.

The watchdog — tireless guardian of twenty-something services

How It Works

Check health — The watchdog runs the full health test suite against all monitored services.
Evaluate failures — Any failing checks are collected with their severity and diagnostic output.
Auto-fix — If auto_fix is enabled, the watchdog invokes service-doctor --fix for each failing service.
Settle delay — Waits for the configured settle_delay period to allow services to stabilize after repair.
Re-check — Runs the health suite again to confirm whether repairs succeeded.
Notify — Sends notifications for any services that remain in a failed state after the fix attempt.

┌─────────────┐     ┌──────────────┐     ┌─────────────┐
│ Health Check │────→│ Failures?    │─No─→│ All Clear   │
└─────────────┘     └──────────────┘     └─────────────┘
                           │ Yes
                           ▼
                    ┌──────────────┐
                    │ service-     │
                    │ doctor --fix │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐
                    │ Settle Delay │
                    └──────────────┘
                           │
                           ▼
                    ┌──────────────┐     ┌─────────────┐
                    │ Re-check     │─OK─→│ Fixed       │
                    └──────────────┘     └─────────────┘
                           │ Still failing
                           ▼
                    ┌──────────────┐
                    │ Notify       │
                    └──────────────┘

Most of the time, the happy path is: check, all clear, go back to sleep. On the less happy days, the watchdog detects a failure, summons the doctor, waits, checks again, and either nods in satisfaction or sends you a message on Signal that your evening is about to get worse.

Schedule

The watchdog runs every 10 minutes via a LaunchAgent:

Property	Value
Label	`com.sanctum.watchdog`
StartInterval	600 (seconds)
RunAtLoad	true

The first execution occurs at LaunchAgent load time (typically at login or boot), then repeats at the configured interval.

Notification Deduplication

To avoid flooding notification channels during extended outages — because nothing helps you fix a problem like your phone buzzing every ten minutes about it — the watchdog implements deduplication. When a service fails and a notification is sent, subsequent failures of the same service are suppressed for the duration of the dedup_window.

The dedup state is tracked in memory between invocations using a timestamp file. Each service failure is keyed by service name and the dedup window is evaluated against the last notification time for that key.

First failure of "openclaw-gateway" → Notify ✓
Same failure 5 min later             → Suppressed (within 30 min window)
Same failure 35 min later            → Notify ✓ (window expired)
Different service fails               → Notify ✓ (independent key)

Configuration

All watchdog settings live in instance.yaml under the services.watchdog key:

services:
  watchdog:
    enabled: true
    settle_delay: 30        # seconds to wait after fix attempt
    auto_fix: true           # enable service-doctor auto-repair
    dedup_window: 1800       # seconds (30 min) to suppress repeat alerts

Setting	Type	Default	Description
`enabled`	boolean	true	Enable or disable the watchdog entirely
`settle_delay`	number	30	Seconds to wait after repair before re-checking
`auto_fix`	boolean	true	Whether to invoke service-doctor on failures
`dedup_window`	number	1800	Seconds to suppress duplicate notifications

Notification Channels

The watchdog delivers alerts through three independent channels. All channels fire in parallel when a notification is triggered. Redundancy is the point — if one channel is down, the others still reach you.

macOS Notifications

Uses osascript to display a native macOS notification with the service name and failure summary. These appear in Notification Center and are useful when physically at the machine.

Writes alert data to the health API, which the command center dashboard polls. Active alerts appear as a banner at the top of the dashboard with severity coloring and a timestamp.

Signal Messenger

Sends a message to a configured Signal group using the apple-toolkit skill’s Signal integration. This provides mobile-reachable alerts for critical failures when away from the LAN. Nothing quite like getting a Signal message from your house at midnight telling you the gateway is down.

Service Doctor Integration

The service-doctor skill is the repair engine invoked by the watchdog. It knows how to restart LaunchAgents, restart systemd services on the VM via SSH, restart Docker containers, and reset network interfaces. It is a blunt instrument with a narrow range of repair strategies, but most failures in this stack respond to “have you tried turning it off and on again.”

When auto_fix is enabled, the watchdog passes each failing service to service-doctor with the --fix flag. The doctor applies the appropriate repair action based on the service type and returns a success or failure status.

Health Test Suite

The test suite covers all critical Sanctum subsystems:

Test	What It Checks
VM reachable	SSH connectivity to 10.10.10.10
Bridge100 IP	bridge100 interface has IP 10.10.10.1
Gateway (Mac)	Port 1977 responds
Gateway (VM)	Port on VM responds via SSH
Home Assistant	Port 8123 returns HTTP 200
Docker running	Docker daemon is responsive
Tailscale connected	`tailscale status` succeeds
Cloudflare tunnel	Tunnel process is alive
LM Studio	Port 1234 responds with model list
XTTS server	Port 8008 responds
Firewalla bridge	Port 1984 responds
Memory sentinel	System free memory above critical threshold

Tests return structured results with pass/fail status, latency, and diagnostic messages that feed into both the notification system and the dashboard. Twelve tests. Every ten minutes. That’s 1,728 health checks a day. Your house gets more checkups than you do.

Memory Sentinel

At 3:39 AM on March 23, 2026, two rogue Python processes leaked 59 GB on the 64 GB Mac Mini. WindowServer hit a watchdog timeout and the entire machine crashed. The service-graph kept checking ports and HTTP endpoints while the system slowly asphyxiated underneath it — because knowing that a service is up doesn’t help when there’s no RAM left to run it.

The memory sentinel exists because of that night. It runs as a pre-flight check before the service-graph, because the service-graph itself spawns Python, threads, and SSH connections — all of which consume memory. Running diagnostics on a memory-starved system is the monitoring equivalent of performing surgery on a patient who is actively on fire.

How It Works

The sentinel reads kern.memorystatus_level (the kernel’s view of free memory, accounting for compression) and scans per-process RSS via ps. It classifies the system into four levels:

Level	System Free	Per-Process RSS	Action
OK	>= 20%	< 8 GB	Exit 0, all clear
Warning	15–19%	8–12 GB	Exit 0, log only
Critical	8–14%	12–20 GB	Exit 1, SIGTERM top offender (5s grace, then SIGKILL)
Emergency	< 8%	> 20 GB	Exit 1, immediate SIGKILL

The “top offender” is the highest-RSS process not on the safelist. Safelisted processes are never killed regardless of consumption:

kernel_task, WindowServer, launchd — killing these is the crash
com.apple.Virtualization.VirtualMachine, QEMULauncher — RSS is guest-mapped memory, not a leak
mds_stores, corespotlightd, mediaanalysisd — system indexers that self-regulate

Everything else — MLX server, LM Studio, node processes, Docker, Python agents — is fair game. Managed services get killed and the service-graph handles the restart. Unmanaged processes get killed and get to think about what they did.

Watchdog Integration

The sentinel runs as Step 0.5 in the watchdog cycle — after log rotation, before the service-graph:

Watchdog check starting
  → Memory sentinel pre-flight (--json)
    → If emergency/critical: --kill, notify, sleep 3s
    → If warning: log only
  → Step 1: service-graph check-all (existing flow)

If the sentinel kills a process, it sends a critical notification via all channels (macOS, dashboard, Signal) and waits three seconds for the system to reclaim memory before the service-graph runs.

Configuration

services:
  memory_sentinel:
    enabled: true
    thresholds:
      warning_free_pct: 15
      critical_free_pct: 8
      emergency_free_pct: 5
      per_process_warn_gb: 8
      per_process_kill_gb: 12
    auto_kill: true
    kill_grace_seconds: 5
    safelist:
      - "com.apple.Virtualization.VirtualMachine"
      - "QEMULauncher"
      - "WindowServer"
      - "kernel_task"
      - "mds_stores"
      - "corespotlightd"
      - "mediaanalysisd"
      - "launchd"

Agent Integration

Cilghal (the health agent) can invoke the sentinel through the memory-check.sh tool in the service-doctor skill. This gives agents the ability to check memory pressure and, if authorized, kill runaway processes without SSH access to the host. The tool passes --status, --json, or --kill through to the sentinel script.

Health Monitoring & Watchdog

How It Works

Schedule

Notification Deduplication

Configuration

Notification Channels

macOS Notifications

Dashboard Banner

Signal Messenger

Service Doctor Integration

Health Test Suite

Memory Sentinel

How It Works

Watchdog Integration

Configuration

Agent Integration