How we monitor our services

Monitoring all of our services makes sure we’re aware of problems when they occur, but most importantly, it helps us detect problems in advance — before they become outages. Our main tool for this task is Prometheus, an open source time-series database. It takes a snapshot of various metrics across all of our services every few seconds, then allows you to write queries which model trends in that data. Our instance is publicly available for you to explore at metrics.sr.ht.

The Prometheus docs have some good tips for writing queries. Here’s an example:

rate(node_cpu_seconds_total{mode="user",instance="node.git.sr.ht:80"}[2m])

This returns a value between 0 and 1 which represents the current CPU usage across all nodes on git.sr.ht. You can check out live graph here (note that the live graph automatically selects the scale appropriate for the data — our server is probably not as overloaded as it looks here), or take a look at this graph I generated in advance (with the correct scale):

An SVG showing CPU utilization on git.sr.ht

We can easily turn this into an alarm by rephrasing it as a boolean expression and writing a little bit of YAML:

- alert: High CPU usage
  expr: &cpu_gt_75pct rate(node_cpu_seconds_total{mode="user"}[2m]) > 0.75
  << : &brief
     for: 5m
     labels:
       severity: interesting
  annotations:
    summary: "Instance {{ $labels.instance }} is under high CPU usage"

This is one of the few times where I take advantage of some of YAML’s more advanced features: &cpu_gt_75pct and &brief are a kind of macro, which allows me to easily create derivative alarms, like these ones for network activity:

- alert: High network activity
  expr: &net_gt_10mibsec >
    (rate(node_network_receive_bytes_total{device=~"eth0|ens3|enp.*"}[5m]) / 1024^2
    > 10) or (
    rate(node_network_transmit_bytes_total{device=~"eth0|ens3|enp.*"}[5m]) / 1024^2
    > 10)
  << : *brief
  annotations:
    summary: "Instance {{ $labels.instance }} >10 MiB/s network use"
- alert: Sustained high network activity
  expr: *net_gt_10mibsec
  << : *sustained
  annotations:
    summary: "Instance {{ $labels.instance }} sustained >10 MiB/s network use"
- alert: Prolonged high network activity
  expr: *net_gt_10mibsec
  << : *prolonged
  annotations:
    summary: "Instance {{ $labels.instance }} prolonged >10 MiB/s network use"

You can see all of our alarms online on metrics.sr.ht, and you can see our YAML files in our git repository. Some of our users have helpfully sent us patches to add new alarms and refine our existing ones - thanks folks! The advantages of being an open source business are numerous.

When these alarms go off, they’re routed into Alertmanager. I found this particular part of the Prometheus stack a bit more annoying than the others, but eventually we rigged it up the way I like. We combine sachet with Twilio to route urgent alarms to my cell phone, and there I have some rules set up with Automate¹ to make my phone go nuts when a matching SMS message arrives - it starts beeping and vibrating until I acknowledge the message.

“Urgent” and “important” alarms are routed into our operational mailing list, sr.ht-ops, which is also public. Browsing this list, you’ll also see some automated emails which are not related to alarms — I’ll get to these in a moment. We also have “interesting” alarms, which are routed to IRC, along with the urgent and important alarms, using alertmanager-irc-relay. This IRC channel is also public: #sr.ht.ops on irc.freenode.net.

Prometheus normally works by pulling stats off of daemons, but we also want to monitor ephemeral or one-off tasks. For this purpose, we have also set up Pushgateway. We mainly use this to keep track of timestamps: the timestamp of the last backup, the last ZFS snapshot, or the expiration dates of our SSL certificates. For example, we can get the age of our ZFS snapshots in seconds with this query:

time() - zfs_last_snapshot

An SVG showing the age of ZFS snapshots across the fleet

This is trivially populated with a simple cronjob:

#!/bin/sh -eu
host="$(hostname -f)"
stats() {
	printf '# TYPE zfs_last_snapshot gauge\n'
	printf '# HELP zfs_last_snapshot Unix timestamp of last ZFS snapshot\n'
	ts=$(zfs list -HS creation -o creation -t snapshot | head -n1)
	printf 'zfs_last_snapshot{instance="%s"} %d\n' "$host" "$ts"
}

stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/"$host"

Our backup age is filled in at the time the backup is taken with something like this:

#!/bin/sh -eu
export BORG_REPO=#...
export BORG_PASSPHRASE=#...

backup_start="$(date -u +'%s')"

echo "borg create"
borg create \
	::git.sr.ht-"$(date +"%Y-%m-%d_%H:%M")" \
	/var/lib/git \
	#...

echo "borg prune"
borg prune \
	--keep-hourly 48 \
	--keep-daily 60 \
	--keep-weekly 24 \
	--info --stats

stats() {
	backup_end="$(date -u +'%s')"
	printf '# TYPE last_backup gauge\n'
	printf '# HELP last_backup Unix timestamp of last backup\n'
	printf 'last_backup{instance="git.sr.ht"} %d\n' "$backup_end"
	printf '# TYPE backup_duration gauge\n'
	printf '# HELP backup_duration Number of seconds most recent backup took to complete\n'
	printf 'backup_duration{instance="git.sr.ht"} %d\n' "$((backup_end-backup_start))"
}

stats | curl --data-binary @- https://push.metrics.sr.ht/metrics/job/git.sr.ht

In addition to alarming if the age gets too large, the duration allows us to graph trends in the backup cost so that we can make informed judgements about the backup schedule:

An SVG showing the duration of backups. git.sr.ht takes the longest at about 3000 seconds.

Backups are of critical importance in our ops plan, so we have redundant monitoring for them. This is where the other emails you saw on sr.ht-ops come into play. We configured ZFS zed to send emails to the mailing list on ZFS events, and left debug mode on so that it’s emailing us even when there are no errors. We set up a cronjob to run a scrub on the 1st of the month, and the resulting zed emails are posted to the ops list. We also do weekly borg backup checks, which are posted to the mailing list every Sunday night.

Reliability is important to us. SourceHut is still an alpha, but a robust approach to reliability is considered a blocker for the production phase, and taking the development of this approach seriously during the alpha and beta periods. You can read up more on our ops work at our public operations manual. And it’s working — we have 99.99% uptime for 2020, and have only suffered one unplanned partial outage this year.² Let’s aim for 99.999% in 2021!

Unfortunately, this and Twilio are the only parts of the stack which are not open source. Replacing Twilio would be hard but I bet Automate is low-hanging fruit - suggestions would be welcome! ↩︎
Two pages were affected, and it lasted 46 minutes. We fixed the problem, prevented future problems of the sort, and added new alarms which would catch a broader net of problems in the future. Read the details here. ↩︎