Monitoring Stack: Prometheus and Grafana

The Philosophy of Observability

Running a complex system without monitoring is like flying a plane without instruments. You might be fine in clear weather, but when something goes wrong, you're blind. Observability gives you visibility into what's happening inside your system.

There are three pillars of observability:

Metrics: Numerical data over time (CPU usage, request rates, error counts)

Logs: Discrete events (errors, access logs, application events)

Traces: Request flows through multiple services (less relevant for your current setup)

Your monitoring stack focuses on metrics, using Prometheus to collect and store them, and Grafana to visualize them.

Prometheus: Time-Series Database

Prometheus is a monitoring system designed for reliability and efficiency. It uses a pull model—it periodically scrapes HTTP endpoints to collect metrics, rather than waiting for applications to push data.

How Prometheus Works

Data Model: Prometheus stores data as time series—sequences of timestamped values identified by metric names and key-value pairs called labels.

Example:

http_requests_total{method="GET", status="200", endpoint="/api/posts"} 1027
http_requests_total{method="POST", status="201", endpoint="/api/posts"} 156
}}}

Here, `http_requests_total` is the metric name. `{method="GET", ...}` are labels. `1027` is the value—the total number of GET requests to `/api/posts` that returned 200.

*Scraping:* Prometheus has a list of targets (URLs to scrape). On a configured interval (default 15 seconds), it makes HTTP requests to each target, parses the response (which is in a simple text format), and stores the data.

*Retention:* Prometheus is designed for operational monitoring, not long-term storage. By default, it keeps data for 15 days. Older data is deleted. For long-term retention, you'd use a separate system like Thanos or Cortex.

*Querying:* Prometheus provides a query language called PromQL. You can ask questions like:
- "What was the CPU usage 5 minutes ago?"
- "What's the rate of requests per second over the last hour?"
- "Which pods are using more than 80% of their memory limit?"

=== The Prometheus Configuration ===

Prometheus is configured via a YAML file. Key sections:

*global:* Default settings like scrape interval and evaluation interval

*scrape_configs:* List of jobs to scrape

Each job has:
- *job_name:* Identifier for this group of targets
- *scrape_interval:* How often to scrape (can override global)
- *static_configs:* List of targets (host:port combinations)
- *labels:* Additional labels to attach to all metrics from these targets

Your configuration defines several jobs:
- *prometheus:* Scrapes itself (metrics about Prometheus)
- *microcosm-ufos:* Scrapes the UFOs service on port 8765
- *node-exporter:* Scrapes system metrics on port 9100

=== The Text Format ===

When Prometheus scrapes a target, it expects a specific text format:

# HELP httprequeststotal Total number of HTTP requests

# TYPE httprequeststotal counter

httprequeststotal{method="GET",status="200"} 1027

httprequeststotal{method="POST",status="201"} 156

}}}

Lines starting with # are comments (HELP provides documentation, TYPE declares the metric type). Other lines are metrics.

Metric types include:

Counter: Only increases (total requests, total errors)
Gauge: Can go up or down (current memory usage, temperature)
Histogram: Distribution of values into buckets (request duration)
Summary: Similar to histogram but calculates quantiles on the client

Metric Types in Practice

Counters are for things you want to rate over time:

rate(http_requests_total[5m])  # Requests per second over last 5 minutes
}}}

*Gauges* are for current state:

nodememoryMemAvailable_bytes # Current available memory

}}}

Histograms let you understand distributions:

histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# 95th percentile request duration
}}}

== Grafana: Visualization and Dashboards ==

Prometheus stores data but isn't designed for human consumption. Grafana provides the user interface—charts, graphs, and dashboards that make the data understandable.

=== How Grafana Works ===

*Data Sources:* Grafana connects to various data sources. Your configuration uses Prometheus as the primary (and only) data source. Grafana makes queries to Prometheus using PromQL.

*Dashboards:* Collections of panels (visualizations) arranged on pages. Each panel is a query result rendered as a chart, table, gauge, etc.

*Provisioning:* Grafana can be configured entirely through files rather than clicking through the UI. This is important for you—your dashboards are defined as JSON files in your Nix configuration, not created interactively and lost if Grafana's database is wiped.

=== Dashboard Structure ===

A dashboard JSON file contains:

*Metadata:* Title, description, tags, timezone settings

*Panels:* Array of visualization definitions, each with:
- Type (graph, table, stat, gauge, etc.)
- Query (PromQL expression)
- Display options (colors, legends, thresholds)
- Position and size on the dashboard

*Templating:* Variables that let users filter the dashboard (e.g., select which service to view)

*Annotations:* Markers for events (deployments, incidents) that appear on graphs

=== Your Dashboards ===

You have two provisioned dashboards:

*UFOs ATProto Overview:* Metrics from the UFOs service
- Jetstream event rates
- Database sizes
- API request rates
- Error counts

*System Health:* Overall system metrics
- CPU, memory, disk usage
- Network I/O
- Load average
- Service status

=== Provisioning vs Interactive ===

*Interactive setup:* You log into Grafana, click "Create Dashboard," build panels via the UI, save. Dashboard is stored in Grafana's database.

*Provisioning:* You write JSON files, put them in `/etc/grafana-dashboards/`, Grafana loads them on startup. Dashboards are in git, version controlled, reproducible.

*Why provision:*
- Infrastructure as code: dashboards are part of your system configuration
- Version control: track changes, review modifications
- Reproducibility: new instances get the same dashboards automatically
- Backup: dashboards survive database wipes

=== The Provisioning Configuration ===

In your Nix configuration, you specify:
- *Datasource:* How to connect to Prometheus
- *Dashboard provider:* Where to find dashboard JSON files
- *Update interval:* How often to check for new/updated dashboards

The `environment.etc` entries copy your dashboard files to `/etc/grafana-dashboards/`, where Grafana's provisioning system picks them up.

== Node Exporter: System Metrics ==

Node exporter is a Prometheus exporter that exposes Linux system metrics. It reads from `/proc` and `/sys` filesystems and presents the data in Prometheus format.

*What it exposes:*
- CPU usage (by mode: user, system, idle, iowait)
- Memory (total, used, free, buffers, cache)
- Disk (usage by filesystem, I/O rates)
- Network (bytes/packets sent/received, errors)
- System (load average, uptime, boot time)
- Process counts
- File descriptor usage
- And much more

*Collectors:* Node exporter is modular. You enable specific collectors based on what you need. Your configuration enables:
- *systemd:* Metrics about systemd services (are they running, restart counts)
- *processes:* Information about running processes

Other available collectors include textfile (read arbitrary metrics from files), hwmon (hardware sensors), and many more.

=== Why System Metrics Matter ===

System metrics tell you about the health of the infrastructure:

*Resource exhaustion:* Is the server running out of CPU, memory, or disk?
*I/O bottlenecks:* Is the disk saturated? Network?
*Service health:* Are services running? How often do they restart?
*Capacity planning:* Trends over time help predict when you'll need more resources

== Service Metrics ==

Beyond system metrics, your individual services expose their own metrics:

*UFOs (Microcosm):*
- `jetstream_total_events_received` - Events ingested
- `jetstream_total_bytes_received` - Data volume
- `database_size_bytes` - Storage usage
- `query_duration_seconds` - API response times

*Lycan:*
- `firehose_lag_seconds` - How far behind real-time
- `database_query_duration_seconds` - Query performance
- `feed_requests_total` - Feed API usage

*Caddy:* (via logs, or with additional configuration)
- Request rates by endpoint
- Response codes
- Latency percentiles

These application-specific metrics are often more actionable than system metrics. A high error rate in a specific API endpoint tells you more than "CPU is at 80%."

== Alerting (Not Yet Configured) ==

While you have metrics collection and visualization, you don't currently have alerting. Prometheus includes an alerting system (Alertmanager) that can send notifications when metrics cross thresholds.

*Example alerts you might configure:*
- Disk usage > 90%
- Service down for > 2 minutes
- Error rate > 1%
- Memory usage > 85%
- SSL certificate expires in < 7 days

Alerts go to email, Slack, PagerDuty, or other destinations. They're defined in Prometheus configuration, separate from the dashboards.

== Metric Retention and Storage ==

Prometheus stores metrics in its time-series database (TSDB). The storage format is efficient but not infinite.

*Storage characteristics:*
- Compressed, efficient on-disk format
- Default retention: 15 days
- Retention based on time, not data volume (unless configured otherwise)
- Queries across long time ranges become slower

*For your setup:* 15 days is probably sufficient for operational monitoring. If you need longer-term data for capacity planning or compliance, you'd:
- Increase Prometheus retention
- Use remote storage (VictoriaMetrics, Thanos, Cortex)
- Export to long-term storage systems

== Grafana Anonymous Access ==

Your Grafana configuration enables anonymous access:

"auth.anonymous" = {

enabled = true;

org_role = "Viewer";

}

}}}

This means anyone can view dashboards without logging in. They get "Viewer" role—can see dashboards but can't modify them.

Security implications:

Your dashboards are public (anyone with the URL can see them)
They only see metrics, can't modify anything
This is fine for metrics (not sensitive) but consider if you want to expose them

To restrict access, disable anonymous access and configure authentication (via OAuth, LDAP, etc.).

Integration with NixOS

The NixOS modules for Prometheus and Grafana handle:

Installing packages
Creating systemd services
Generating configuration files
Setting up data directories
Managing permissions

Prometheus module:

Creates /var/lib/prometheus2 for TSDB data
Generates prometheus.yml from your Nix configuration
Creates systemd service that runs Prometheus

Grafana module:

Creates /var/lib/grafana for database and files
Generates grafana.ini from your settings
Creates systemd service
Handles provisioning configuration

When you change monitoring configuration and rebuild:

1. NixOS regenerates config files

2. Services reload (Prometheus can reload without restart; Grafana typically needs restart)

3. New scrape configs take effect

4. New dashboards appear (or updated dashboards refresh)

Common Prometheus Queries

Here are useful PromQL queries for your setup:

Request rate:

rate(http_requests_total[5m])
}}}

*Error rate percentage:*

rate(httprequeststotal{status=~"5.."}[5m])

rate(httprequeststotal[5m]) * 100

}}}

Disk usage percentage:

(node_filesystem_size_bytes - node_filesystem_free_bytes) 
/ 
node_filesystem_size_bytes * 100
}}}

*Service up/down:*

up{job="microcosm-ufos"}

# Returns 1 if service is up, 0 if down

}}}

Top memory consumers:

topk(10, process_resident_memory_bytes)
}}}

*Network traffic:*

rate(nodenetworkreceivebytestotal[5m])

}}}

Troubleshooting Monitoring

"No data in Grafana":

Check if Prometheus is scraping targets: up metric should be 1
Check if queries return data in Prometheus UI
Verify datasource URL in Grafana

"Metrics missing":

Check if service exposes /metrics endpoint
Verify scrape config has correct port
Check target is healthy in Prometheus UI

"Grafana won't start":

Check logs: journalctl -u grafana
Verify permissions on /var/lib/grafana
Check if port 3001 is available

"Dashboards not appearing":

Verify JSON files are in /etc/grafana-dashboards/
Check provisioning configuration points to right path
Look for syntax errors in dashboard JSON
Check Grafana logs for loading errors

The Observability Mindset

Monitoring isn't just about having pretty graphs. It's about:

Understanding normal: What's the baseline? What's typical load?

Detecting anomalies: When does something deviate from normal?

Debugging incidents: When things break, what metrics explain why?

Capacity planning: Are resources trending toward exhaustion?

Validation: Did that change actually improve performance?

Good monitoring helps you sleep better—you'll know when things break, and you'll have the data to fix them quickly.