Building an Uptime Monitor: Architecture Decisions and System Design

Checking If a URL Is Up Is a Solved Problem

It isn’t.

Or rather: the first version is solved. Ten lines of code, a fetch(), a status code check, done. You can write that in an afternoon and it will work perfectly — until it doesn’t.

It won’t survive a server restart. It won’t distinguish a transient network blip from a genuine outage. It has no concept of “this monitor is paused for maintenance, don’t alert.” It can’t tell you what was happening at 3 AM two months ago. And when you want to show a status page with 90-day history bars, you’ll discover that storing every check result forever is unbounded storage growth, but throwing away old results means you can’t show history.

These aren’t edge cases. They’re the actual hard parts. The naive setInterval + fetch approach collapses under the weight of the real requirements within weeks of production use.

This post is about the version that doesn’t collapse — the queue architecture that makes checks restartable, the data model that bounds storage growth by design, the incident state machine that eliminates a whole class of sync bugs, and the places where we made pragmatic tradeoffs we’d reconsider with more time.

The System at a Glance

UptimeMonitor is two processes. An API server (Express) that handles all HTTP traffic from the dashboard and status pages. A Worker (BullMQ) that runs every scheduled check, dispatches alerts, and handles SSL expiry logic. They share MongoDB and Redis — nothing else.

System Architecture

Browser

Dashboard / Status page

API Server

Express

Redis

BullMQ queues

Worker

Check · Alert · SSL · Heartbeat

MongoDB

Monitors · Logs · Incidents

The separation is intentional. Check workers do CPU-bound work: TLS handshakes, TCP socket connections, spawning ping subprocesses. None of that belongs in the request path. If a check blocks for 10 seconds (because the target host is hung), it should block the worker thread — not hold up an API response that’s unrelated to that monitor.

Two processes also gives you independent scaling. If you’re running hundreds of monitors, you can add more worker instances without touching the API tier. The queue is the coordination layer; neither process needs to know the other exists.

The Queue Is the Architecture

If you take one thing from this post, it’s this: the queue is not an implementation detail. It is the architecture. Every design decision that makes the system reliable flows from this choice.

Why not cron?

The obvious approach is node-cron or a simple setInterval per monitor. Register a job when a monitor is created, fire every N minutes. This works great until the process restarts.

When the process comes back up, all those in-memory timers are gone. You have to re-register every monitor’s schedule from scratch. If you don’t do this perfectly, some monitors silently stop getting checked. Worse: if you do it imperfectly, you might register duplicate timers and start double-checking monitors.

BullMQ’s repeatable jobs store their schedule state in Redis, not process memory. When the worker restarts, it connects to Redis and the schedules are already there. The queue survives the process; the process is just a consumer.

Four job types

The system uses four queues with distinct semantics:

monitor-checks — the main queue. One repeatable job per non-heartbeat monitor, firing every interval minutes (1, 5, 15, 30, or 60). Job names are check:{monitorId} — more on why that matters in a moment.

alert-dispatch — one-shot jobs, enqueued when a monitor transitions from up to down or back. These have 3 retries with exponential backoff (5s → 25s → 125s), because alerting is the one place where failure is visible to users.

ssl-checks — a single daily cron at 2 AM UTC. One job, no per-monitor jobs. The worker iterates all SSL states and checks which thresholds have been crossed.

heartbeat-watchdog — delayed jobs. The monitored service calls us; we set a watchdog timer. Each valid ping removes and re-adds the watchdog job, resetting the clock. If the ping doesn’t arrive in time, the delayed job fires and marks the monitor down.

The startup reconciliation problem

This is the subtlety that most cron-based approaches miss. On every server restart, before the worker begins consuming jobs, it calls reconcileMonitorJobs():

export async function reconcileMonitorJobs(): Promise<void> {
  const monitors = await Monitor.find({
    visibility: { $ne: 'deleted' },
    status: { $ne: 'paused' },
  })
    .select('_id interval type')
    .lean()

  const nonHeartbeat = monitors.filter((m) => m.type !== 'heartbeat')
  const heartbeats = monitors.filter((m) => m.type === 'heartbeat')

  // Re-add repeatable jobs for all active monitors
  for (const m of nonHeartbeat) {
    await addMonitorJob(m._id.toString(), m.interval)
  }

  // Remove stale jobs for deleted/paused monitors
  const activeCheckIds = new Set(nonHeartbeat.map((m) => m._id.toString()))
  const existingRepeatJobs = await monitorChecksQueue.getRepeatableJobs()
  for (const job of existingRepeatJobs) {
    const monitorId = job.name.startsWith('check:') ? job.name.slice(6) : null
    if (monitorId && !activeCheckIds.has(monitorId)) {
      await monitorChecksQueue.removeRepeatableByKey(job.key)
    }
  }

  // Heartbeat watchdogs and SSL cron
  for (const m of heartbeats) {
    await addHeartbeatWatchdog(m._id.toString(), m.interval)
  }
  await setupSslDailyCron()
}

Without this, a monitor paused during a downtime period never gets re-queued on restart. The job disappears from Redis (because we removed it when pausing), and the monitor silently stops being checked until someone notices the stale lastCheckedAt.

Idempotency by job ID

addMonitorJob calls removeMonitorJob before adding. This makes the operation idempotent — call it twice for the same monitor and you get exactly one repeatable job, not two. The job name check:{monitorId} is what makes this work: you can look up the existing job by name, remove it, and re-add it.

The same pattern applies to heartbeat watchdogs: jobId: 'watchdog:{monitorId}' means BullMQ deduplicates on insert. If the worker restarts and tries to add a watchdog that’s already queued, the existing timer keeps running.

await heartbeatWatchdogQueue.add(
  'watchdog',
  { monitorId },
  {
    jobId: `watchdog:${monitorId}`,
    delay: intervalMinutes * 60 * 1000,
  }
)

Job ID deduplication is one of BullMQ’s underappreciated features. It turns a potentially dangerous “add on startup” operation into a safe, idempotent one.

The Data Model — What to Store and for How Long

Good data models make the hard queries easy. The decisions here were deliberate.

Monitor visibility: three states, not a boolean

The visibility field is 'visible' | 'hidden' | 'deleted' — not a boolean isDeleted. When a user “deletes” a monitor, we set visibility: 'deleted' and stop scheduling checks. The document stays in the database.

Why? Because check logs and incidents reference the monitor’s _id. If you hard-delete the monitor document, every log and incident pointing to it becomes an orphan. You’d need cascading deletes everywhere, and you’d lose the ability to answer “why did we have 47 alerts in January?” after someone deleted the monitor in February.

The hidden state handles a separate case: monitors the organization wants to keep for internal checking but not show on the public status page. Status page queries filter { visibility: 'visible' }.

CheckLog: 90-day TTL, enforced by the database

Every single check result is stored in CheckLog. Every one. For a monitor checking every minute, that’s 1,440 documents per day. For 20 monitors, that’s 28,800 documents per day, 2.6M per month.

That’s fine, because the collection has a TTL index:

checkLogSchema.index(
  { timestamp: 1 },
  { expireAfterSeconds: CHECK_LOG_TTL_DAYS * 24 * 60 * 60 }
)

CHECK_LOG_TTL_DAYS = 90. MongoDB’s background TTL process automatically removes documents older than 90 days. Storage is bounded by design, not by remembering to run a cleanup job. You never need a cron job to purge old logs. You never need to think about it again.

The query index is { monitorId: 1, timestamp: -1 } — time-range queries per monitor, descending. This is the shape of every log history query the API makes.

Incident as a state machine

An Incident has two fields that matter for state: startedAt (required) and resolvedAt (nullable, defaults to null).

Open incident: { resolvedAt: null }. Closed incident: { resolvedAt: Date }. That’s it.

No status enum. No 'OPEN' / 'CLOSED' / 'ACKNOWLEDGED' strings. The query for all open incidents is { resolvedAt: null }. The query to close an incident is a findOneAndUpdate setting resolvedAt: new Date().

// Open an incident when a monitor goes down
await Incident.create({
  monitorId: monitor._id,
  orgId: monitor.orgId,
  startedAt: new Date(),
  cause: error ?? `HTTP ${statusCode}`,
})

// Close it when the monitor recovers
await Incident.findOneAndUpdate(
  { monitorId: monitor._id, resolvedAt: null },
  { resolvedAt: new Date() },
  { sort: { startedAt: -1 } }
)

This eliminates an entire class of state sync bugs. There’s no enum value to get out of sync with the timestamp. There’s no “incident is CLOSED but resolvedAt is null” edge case. The state is the data.

SslState with alert deduplication

The SslState model tracks certificate expiry per monitor. The interesting field is alertsSent: number[] — an array of thresholds (30, 7, 1 days) for which alerts have already been sent.

The daily SSL job checks each certificate against these thresholds and uses $addToSet to record which thresholds triggered:

await SslState.updateOne(
  { _id: state._id },
  { $addToSet: { alertsSent: { $each: newAlerts } } }
)

$addToSet is MongoDB’s atomic set-union operator. Even if the daily job runs twice (possible in edge cases), it will never add the same threshold twice. No separate deduplication logic needed. The data model enforces the invariant.

The Check Engine — Five Monitor Types

All five check types share the same entry point (processCheckJob) and produce the same { result, responseTime, statusCode, error } shape. The dispatch is a switch on monitor.type.

HTTP — fetch() with an AbortController timeout. Configurable expected status codes: if none are set, any 2xx/3xx is a pass. If you specify [200, 201], only those codes pass. Custom headers are merged with a User-Agent header, so the check identifies itself at the receiving server.

Keyword — same as HTTP but scans the response body for a string. The keywordPresent boolean controls direction: true means “this string must be present,” false means “this string must be absent.” The absent case is useful for monitoring that an error message isn’t showing up on a page.

Ping — shell exec with platform-specific flags (-W on Linux, -t on macOS). The host is sanitized before interpolation:

if (!/^[a-zA-Z0-9.\-_]+$/.test(host)) {
  return { result: 'down', error: 'Invalid host' }
}

Port — raw TCP socket with a timeout. Connect to host:port; if the connection succeeds, the service is up. Doesn’t validate what the service says — just that something is listening. Useful for Redis, PostgreSQL, SMTP, and other non-HTTP services.

Heartbeat — the inverted model. Instead of us checking the service, the service calls us via a unique token URL (/heartbeat/:token). The BullMQ watchdog acts as a dead-man’s switch: each valid ping removes the existing delayed job and adds a new one, resetting the clock. If the clock reaches zero without a ping, the delayed job fires and the monitor is marked down.

Maintenance windows: checks still run and results are logged (data continuity is preserved — you’ll still see those bars in the history), but status transitions and alert dispatch are suppressed. When the window expires, a check naturally cleans up the expired window document.

The 90-Day History Problem — Two Aggregations in Parallel

The public status page needs two different views of the same CheckLog data:

30-day uptime percentage — a simple ratio: up_count / total_count across all logs in the last 30 days.
90-day daily history — an array of 90 day-buckets, each 'up', 'down', or null (no data).

Both queries hit the same collection. They run together in a Promise.all — no sequential round-trips:

const [uptimeAgg, dailyAgg] = await Promise.all([
  CheckLog.aggregate([
    { $match: { monitorId: { $in: monitorIds }, timestamp: { $gte: since30d } } },
    {
      $group: {
        _id: { $toString: '$monitorId' },
        total: { $sum: 1 },
        up: { $sum: { $cond: [{ $eq: ['$result', 'up'] }, 1, 0] } },
      },
    },
  ]),
  CheckLog.aggregate([
    { $match: { monitorId: { $in: monitorIds }, timestamp: { $gte: since90d } } },
    {
      $group: {
        _id: {
          monitorId: { $toString: '$monitorId' },
          date: { $dateToString: { format: '%Y-%m-%d', date: '$timestamp' } },
        },
        hasUp:   { $max: { $cond: [{ $eq: ['$result', 'up'] },   1, 0] } },
        hasDown: { $max: { $cond: [{ $eq: ['$result', 'down'] }, 1, 0] } },
      },
    },
  ]),
])

The daily aggregation deserves explanation. The $group stage produces one document per (monitorId, date) pair. The hasUp and hasDown fields use $max of a $cond — not $sum. This is deliberate.

The question being asked is: “did any check pass today?” That’s a boolean question. $max of 0/1 gives you 1 if any check passed, 0 if none did. $sum would give you a count — and a count is the wrong type for the answer.

The day-bucket resolution logic runs in JavaScript after the aggregation:

// Build Map<monitorId, Map<dateStr, 'up'|'down'>>
const dailyByMonitorId = new Map<string, Map<string, 'up' | 'down'>>()
for (const row of dailyAgg) {
  const { monitorId, date } = row._id
  if (!dailyByMonitorId.has(monitorId)) dailyByMonitorId.set(monitorId, new Map())
  const dayStatus: 'up' | 'down' = row.hasUp === 1 ? 'up' : 'down'
  dailyByMonitorId.get(monitorId)!.set(date, dayStatus)
}

// Fill 90-element array, newest-last
const dailyHistory: Array<{ date: string; status: 'up' | 'down' | null }> = []
for (let i = 89; i >= 0; i--) {
  const d = new Date(Date.now() - i * 86_400_000)
  const dateStr = d.toISOString().slice(0, 10)
  dailyHistory.push({ date: dateStr, status: dayMap?.get(dateStr) ?? null })
}

Days with no check data at all — because the monitor didn’t exist yet, or was paused for the entire day — stay null. The status page renders them as grey bars. This is the right behavior: grey means “no data,” not “down.”

A day with 100 passing checks and 1 failing check is 'up'. That might feel wrong, but consider the alternative: a single flaky check (transient DNS hiccup at 3 AM) would turn an otherwise green day red. The hasUp === 1 decision reflects how users actually interpret daily history.

Multi-Channel Alerts — The Polymorphic Dispatch Pattern

AlertChannel has a type field ('email' | 'webhook' | 'telegram' | 'slack') and a config object. Only the relevant fields are populated per type: config.email for email channels, config.url + config.secret for webhooks, config.botToken + config.chatId for Telegram, config.slackWebhookUrl for Slack.

The dispatch worker reads channel.type and branches:

if (channel.type === 'email') {
  await sendEmail(channel.config.email!, emailSubject, emailHtml)
} else if (channel.type === 'webhook') {
  await sendWebhook(channel.config.url!, channel.config.secret, webhookPayload)
} else if (channel.type === 'telegram') {
  await sendTelegram(channel.config.botToken!, channel.config.chatId!, emailSubject)
} else if (channel.type === 'slack') {
  await sendSlack(channel.config.slackWebhookUrl!, emailSubject, webhookPayload)
}

Why not separate collections per channel type? Because MongoDB’s flexible document model and the dispatch logic are simple enough that a single collection with optional fields is less ceremony than four schemas. The polymorphic config is straightforward to reason about at this scale.

Webhook signing — webhook payloads are signed with HMAC-SHA256 and sent as X-Signature-256: sha256={hex}. Consumers can verify that the payload came from UptimeMonitor and wasn’t tampered with in transit. This mirrors GitHub’s webhook signature scheme, which most developers already know.

Telegram’s “chat not found” gotcha — this one bit us in production. Telegram’s bot API requires the user to send /start to the bot before the bot can message them. If you configure a Telegram channel and Telegram returns { ok: false, description: "Bad Request: chat not found" }, it’s not a credentials error — the user just hasn’t initiated the conversation with the bot. The error message is misleading. We discovered this the hard way and now surface it explicitly in the channel test flow.

Slack’s plain-text "ok" response — Slack’s incoming webhooks respond with the literal string ok (not JSON). Every other API in the dispatch stack returns JSON. The Slack handler reads res.text() and checks body !== 'ok' instead of parsing JSON. A small but real integration gotcha.

Both Telegram and Slack channels require a passing live test before they can be saved. The test runs the actual API call against your provided credentials. This enforces “don’t save broken channels” at the boundary where users are providing configuration, not at alert time when it’s too late.

The project is a monorepo with apps/server, apps/client, and packages/shared. The shared package exports three things.

Zod schemas — used for request validation on the server and for TypeScript inference on the client. When CreateMonitorSchema gains a new required field, both the server validation and the client TypeScript type update in the same commit. The schema is the contract. There’s no “I updated the API but forgot to update the client types” class of bug.

Constants — MONITOR_TYPES, MONITOR_STATUSES, SSL_ALERT_THRESHOLDS, CHECK_LOG_TTL_DAYS. One source of truth. The TTL index in the Mongoose schema references CHECK_LOG_TTL_DAYS from shared constants — so changing the retention period is a one-line edit that propagates everywhere.

String constants — error codes like MONITOR_NOT_FOUND, ORG_NOT_FOUND. API error responses and client-side error handling both import from the same source. No magic strings scattered across 12 files. No “I changed the error message on the server but the client is still matching the old string.”

The wiring is an npm workspace dependency: "@uptimemonitor/shared": "*" in each app’s package.json. Resolved locally at build time, no publishing to a registry needed.

// Server validation
import { CreateMonitorSchema } from '@uptimemonitor/shared/schemas'
const body = CreateMonitorSchema.parse(req.body)

// Client type inference (same schema, different import context)
import { type CreateMonitorInput } from '@uptimemonitor/shared/schemas'

The discipline required is that packages/shared has no dependencies on apps/*. It’s a leaf node in the dependency graph. Any shared logic that needs database access or business rules stays in the server app.

What We’d Reconsider

Honest section. These are things that work but have friction.

What We'd Reconsider

Single MongoDB for write and read paths

The status page aggregation queries run on the same cluster as check log inserts. Under sustained load — many monitors firing simultaneously — a slow aggregation for a public status page could contend for I/O with the insert path. A read replica is the correct fix. We haven’t hit this in practice, but the architecture doesn’t prevent it from becoming a problem at scale.

Ping via shell exec

Spawning a subprocess per check is genuinely expensive. A 30-second timeout on a ping check means that subprocess could linger for 30 seconds, consuming a file descriptor and process table slot. A native ICMP library (or moving ping checks to a dedicated worker pool) would be cleaner. The current implementation is correct but not efficient at scale.

No observability beyond logs

Winston structured logs are good — they tell you what happened. But they don’t tell you rates easily. How many checks are firing per minute? What’s the P95 response time for the alert dispatch queue? These are Prometheus counter/histogram questions. Structured metrics would make the system considerably easier to operate in production.

Plan limits in constants

PLAN_LIMITS.free.maxMonitors = 20 is a constant in packages/shared. Fine for an internal tool or single-tenant deployment. For a multi-tenant SaaS where different organizations have different plan limits, this needs to move to a database field on the Organisation model. The current structure makes that migration mechanical, but it’s still a migration.

Uptime Monitoring Is Not a Solved Problem

The first version is 50 lines: setInterval, fetch, log the result. That version handles the demo.

The version that survives process restarts, handles maintenance windows without losing data, deduplicates SSL alerts atomically, renders 90-day history bars from raw check logs, and delivers alerts through four different channels with retry logic — that version is considerably more interesting.

The queue is what makes it reliable. Schedules live in Redis, not process memory; a restart is a non-event. The TTL index is what makes it cheap to run; storage stays bounded without operational toil. The Zod schemas shared across the API boundary are what make it safe to change; adding a field doesn’t require coordinating a server deploy with a client deploy.

The tradeoffs — single database cluster, shell-spawned pings, plan limits in constants — are real, and the right response to them depends on the scale you’re operating at. For a self-hosted tool, they’re fine. For a SaaS product serving thousands of organizations, they’re the first three items on the rewrite list.

The codebase is open source if you want to dig into the implementation: github.com/honeycoder96/Upcheck.

Engineering questions, or a different architecture you’ve used for uptime monitoring at scale? I’d like to hear about it.