NixOS Configuration Guide

= Microcosm Services: Consuming the ATProto Firehose

== Understanding the Firehose

The AT Protocol network generates a constant stream of events—every post, like, follow, profile update, and repository change is broadcast to the network. This stream is called the "firehose." It's the nervous system of the ATProto ecosystem, carrying signals from Personal Data Servers to AppViews, indexers, and other consumers.

The firehose is accessible via WebSocket connections to services called "relays" or "Jetstream" endpoints. When you connect to a firehose, you receive a real-time feed of all public activity on the network. This is both powerful and overwhelming—the volume of data is substantial.

Microcosm services are designed to consume this firehose, process the events, and make them queryable. They sit between the raw firehose and applications that need to search, filter, or analyze the data.

== Spacedust: The Event Indexer

Spacedust connects to the firehose and indexes events for later retrieval. Think of it as a search engine for the ATProto network—it listens to everything and builds indexes that let you find specific content.

=== The Consumption Pattern

Spacedust follows a typical stream processing pattern:

1. *Connect:* Open a WebSocket connection to a Jetstream endpoint

2. *Subscribe:* Specify what types of events you're interested in (or subscribe to everything)

3. *Receive:* Process events as they arrive in real-time

4. *Index:* Store relevant data in a structured format

5. *Serve:* Provide APIs to query the indexed data

The WebSocket connection is persistent and long-lived. If the connection drops, Spacedust reconnects and resumes from where it left off using a cursor.

=== Cursors and Event Ordering

Jetstream assigns a cursor to each event—a monotonically increasing number that represents the position in the stream. When Spacedust connects, it can specify a cursor to start from. If it doesn't specify one, it starts from the current live position (missing historical events) or from the beginning (receiving everything from the start of time).

This cursor mechanism is crucial for reliability. If Spacedust crashes or restarts, it can resume from its last known cursor position and not miss events. The cursor is stored persistently (usually in the data directory) so it survives restarts.

Important: Cursor values are specific to a particular Jetstream instance. If you switch from one Jetstream server to another, the cursors don't align. This is why there's a jetstreamForce option—it tells Spacedust to ignore cursor validation and accept whatever cursor you provide, useful when switching servers.

=== Indexing Strategy

What Spacedust indexes depends on its configuration, but typically includes:

Post content (full-text searchable)
Author information
Timestamps
Record types (posts, likes, follows, etc.)
Associated media references

The indexing strategy balances:

*Completeness:* Index everything for maximum searchability
*Storage:* Indexing everything requires significant disk space
*Performance:* More indexes mean faster queries but slower ingestion
*Relevance:* Not all events are equally useful to index

=== Query Interface

Once events are indexed, Spacedust provides HTTP endpoints to query them:

Full-text search across posts
Filtering by author, time range, or record type
Pagination for large result sets
Sorting by various criteria

These APIs power applications that need to search the network—custom clients, analytics tools, moderation systems, etc.

=== Metrics and Observability

Spacedust exposes Prometheus metrics about its operation:

Events received per second
Indexing lag (how far behind real-time)
Query response times
Database size and growth rate
Error rates

These metrics help you understand the health of the service and plan capacity. For example, if indexing lag is increasing, you might need more CPU/disk resources or the firehose volume might be growing.

== Slingshot: The API Layer

While Spacedust focuses on indexing, Slingshot focuses on serving. It's the HTTP API layer that clients interact with.

=== Separation of Concerns

Spacedust and Slingshot are separate services because they have different scaling characteristics:

*Spacedust* is write-heavy—constantly ingesting and indexing events
*Slingshot* is read-heavy—serving queries to clients

By separating them, you can:

Scale each independently based on load
Upgrade or restart one without affecting the other
Run multiple Slingshot instances behind a load balancer for high availability
Optimize each for its specific workload

=== API Design

Slingshot provides a RESTful (or GraphQL) API for querying the data that Spacedust has indexed. The API design follows ATProto conventions:

HTTP methods indicate actions (GET for queries, POST for mutations)
Standard HTTP status codes for errors
JSON for request/response bodies
Pagination via cursor or offset/limit
Authentication via JWT (for write operations)

=== Caching Strategy

To handle high read loads, Slingshot implements caching:

*In-memory cache:* Frequently accessed data is cached in RAM
*CDN integration:* For public, cacheable responses
*Database query caching:* Avoiding repeated expensive queries

The cache invalidation strategy is crucial—when new events are indexed, relevant cache entries must be invalidated to ensure clients see fresh data.

== The Jetstream Protocol

Understanding how Jetstream works helps you configure and troubleshoot Microcosm services.

=== WebSocket Communication

Jetstream uses WebSockets because:

They're bidirectional (server can push events to client)
They're persistent (single long-lived connection vs. repeated HTTP requests)
They handle framing automatically (message boundaries are clear)

The connection starts as an HTTP request with an "Upgrade: websocket" header. If the server accepts, the connection switches to the WebSocket protocol.

=== Message Format

Events from Jetstream are JSON objects containing:

*Cursor:* Position in the stream
*Timestamp:* When the event occurred
*DID:* Who performed the action
*Record type:* What kind of record (post, like, etc.)
*Record data:* The actual content

This structured format makes it easy to parse and route events to the appropriate handlers.

=== Compression

Given the volume of data, compression is important. Jetstream typically supports per-message compression (using compression extensions to the WebSocket protocol). This reduces bandwidth usage, which matters both for the server and for consumers.

=== Rate Limiting and Backpressure

If a consumer can't keep up with the firehose, Jetstream may drop the connection or the consumer may fall behind. Backpressure handling is important:

Spacedust must ingest events faster than they arrive, or it will fall behind
If it falls behind too far, it may miss events (depending on Jetstream's retention policy)
Monitoring "lag" (current cursor - latest cursor) helps detect falling behind

== Firehose Consumers vs Producers

It's important to distinguish between different types of firehose participants:

*Producers (PDS instances):* Publish events to the firehose. Your PDS is a producer.

*Relays:* Aggregate events from multiple producers and redistribute them. They act as central clearinghouses.

*Consumers (AppViews, indexers):* Subscribe to the firehose and process events. Spacedust is a consumer.

Your server hosts both producers (the PDS) and consumers (Microcosm services). They operate independently but share the network infrastructure.

== Regional Jetstream Endpoints

Bluesky operates multiple Jetstream endpoints in different geographic regions. You might connect to:

jetstream1.us-west.bsky.network (US West Coast)
jetstream1.us-east.bsky.network (US East Coast)
Or other regions as the network expands

*Why region matters:*

*Latency:* Connect to the closest region for lower latency
*Availability:* If one region is down, you can switch to another
*Data sovereignty:* Some applications require data to stay in specific regions

*The cursor problem:* Remember that cursors are region-specific. If you switch regions, you need to either:

1. Start from the current time (miss intervening events)

2. Start from the beginning (reprocess everything)

3. Use jetstreamForce to use a cursor from another region (may cause inconsistencies)

== Data Retention and Storage

The volume of firehose data is substantial. A busy Jetstream might generate:

Hundreds of events per second during normal operation
Thousands per second during viral events or spam waves
Terabytes of data over months

Storage strategy matters:

*Hot storage:* Recent events in fast storage (SSD) for quick querying

*Warm storage:* Older events in slower storage for occasional access

*Cold storage:* Archival storage for compliance or historical analysis

*Pruning:* Deleting old data that's no longer needed

Your configuration should specify retention policies appropriate for your use case. Not everyone needs to keep years of data.

== Filtering and Selective Consumption

Not all consumers need all events. Jetstream supports filtering:

*By DID:* Only events from specific users
*By record type:* Only posts, or only likes, etc.
*By collection:* ATProto organizes data into collections

Filtering reduces the volume of data you need to process and store. If you're only indexing posts, you don't need to receive and discard follows and likes.

However, filtering happens server-side, which reduces load on the Jetstream infrastructure. It's considered good citizenship to filter when you can.

== Error Handling and Resilience

Firehose consumers must be resilient to:

*Network interruptions:* WebSocket connections can drop. The consumer should automatically reconnect and resume.

*Service restarts:* When you update or restart Spacedust, it should pick up where it left off using the cursor.

*Malformed events:* The firehose might occasionally contain unexpected data. Graceful error handling prevents crashes.

*Backpressure:* If indexing can't keep up, the consumer should detect this (via lag metrics) and potentially shed load or scale up.

*Jetstream downtime:* If the Jetstream endpoint is down, the consumer should retry with exponential backoff.

== Integration with NixOS

The NixOS modules for Microcosm services handle:

Installing the Spacedust and Slingshot packages
Creating systemd services that run the binaries
Setting up data directories
Managing configuration files
Ensuring services start after network is available

The services run as dedicated users with limited privileges. They can access their data directories and make network connections, but can't modify system files or access other services' data.

Port conflicts: Spacedust and Slingshot (and other services) need unique ports. Your configuration assigns:

Spacedust: port 9998
Slingshot: port 3000

These are configured in the service settings and referenced in Caddy's reverse proxy configuration.

== Metrics and Monitoring

Microcosm services expose metrics that Prometheus can scrape:

*Spacedust metrics include:*

jetstreamtotalevents_received - Total events ingested
jetstreamtotalbytes_received - Volume of data
indexinglagseconds - How far behind real-time
databasesizebytes - Storage usage
querydurationseconds - API response times

*Slingshot metrics include:*

httprequeststotal - Request count by endpoint and status
httprequestduration_seconds - Response time distribution
cachehitratio - Cache effectiveness
active_connections - Current client connections

These metrics feed into your Grafana dashboards for visualization and alerting.

== Scaling Microcosm Services

As your usage grows, you might need to scale:

*Vertical scaling:* More CPU, RAM, faster disk for a single instance

*Horizontal scaling:* Multiple Spacedust instances partitioning the firehose (each handles a subset of DIDs)

*Read replicas:* Multiple Slingshot instances all querying the same indexed data behind a load balancer

*Sharding:* Splitting data across multiple databases by time range or DID range

The modular design (Spacedust and Slingshot separate) makes many of these scaling strategies possible.

== Use Cases

What do you do with indexed firehose data?

*Search:* Full-text search across the network

*Analytics:* Trending topics, growth metrics, engagement analysis

*Moderation:* Detecting spam, abuse, or policy violations

*Custom feeds:* Algorithmic feeds beyond what Bluesky provides

*Archival:* Long-term preservation of public discourse

*Research:* Academic study of social network dynamics

The firehose is the raw material; Microcosm services are the refineries that turn it into useful products.