= Microcosm Services: Consuming the ATProto Firehose
== Understanding the Firehose
The AT Protocol network generates a constant stream of events—every post, like, follow, profile update, and repository change is broadcast to the network. This stream is called the "firehose." It's the nervous system of the ATProto ecosystem, carrying signals from Personal Data Servers to AppViews, indexers, and other consumers.
The firehose is accessible via WebSocket connections to services called "relays" or "Jetstream" endpoints. When you connect to a firehose, you receive a real-time feed of all public activity on the network. This is both powerful and overwhelming—the volume of data is substantial.
Microcosm services are designed to consume this firehose, process the events, and make them queryable. They sit between the raw firehose and applications that need to search, filter, or analyze the data.
== Spacedust: The Event Indexer
Spacedust connects to the firehose and indexes events for later retrieval. Think of it as a search engine for the ATProto network—it listens to everything and builds indexes that let you find specific content.
=== The Consumption Pattern
Spacedust follows a typical stream processing pattern:
1. *Connect:* Open a WebSocket connection to a Jetstream endpoint
2. *Subscribe:* Specify what types of events you're interested in (or subscribe to everything)
3. *Receive:* Process events as they arrive in real-time
4. *Index:* Store relevant data in a structured format
5. *Serve:* Provide APIs to query the indexed data
The WebSocket connection is persistent and long-lived. If the connection drops, Spacedust reconnects and resumes from where it left off using a cursor.
=== Cursors and Event Ordering
Jetstream assigns a cursor to each event—a monotonically increasing number that represents the position in the stream. When Spacedust connects, it can specify a cursor to start from. If it doesn't specify one, it starts from the current live position (missing historical events) or from the beginning (receiving everything from the start of time).
This cursor mechanism is crucial for reliability. If Spacedust crashes or restarts, it can resume from its last known cursor position and not miss events. The cursor is stored persistently (usually in the data directory) so it survives restarts.
Important: Cursor values are specific to a particular Jetstream instance. If you switch from one Jetstream server to another, the cursors don't align. This is why there's a jetstreamForce option—it tells Spacedust to ignore cursor validation and accept whatever cursor you provide, useful when switching servers.
=== Indexing Strategy
What Spacedust indexes depends on its configuration, but typically includes:
The indexing strategy balances:
=== Query Interface
Once events are indexed, Spacedust provides HTTP endpoints to query them:
These APIs power applications that need to search the network—custom clients, analytics tools, moderation systems, etc.
=== Metrics and Observability
Spacedust exposes Prometheus metrics about its operation:
These metrics help you understand the health of the service and plan capacity. For example, if indexing lag is increasing, you might need more CPU/disk resources or the firehose volume might be growing.
== Slingshot: The API Layer
While Spacedust focuses on indexing, Slingshot focuses on serving. It's the HTTP API layer that clients interact with.
=== Separation of Concerns
Spacedust and Slingshot are separate services because they have different scaling characteristics:
By separating them, you can:
=== API Design
Slingshot provides a RESTful (or GraphQL) API for querying the data that Spacedust has indexed. The API design follows ATProto conventions:
=== Caching Strategy
To handle high read loads, Slingshot implements caching:
The cache invalidation strategy is crucial—when new events are indexed, relevant cache entries must be invalidated to ensure clients see fresh data.
== The Jetstream Protocol
Understanding how Jetstream works helps you configure and troubleshoot Microcosm services.
=== WebSocket Communication
Jetstream uses WebSockets because:
The connection starts as an HTTP request with an "Upgrade: websocket" header. If the server accepts, the connection switches to the WebSocket protocol.
=== Message Format
Events from Jetstream are JSON objects containing:
This structured format makes it easy to parse and route events to the appropriate handlers.
=== Compression
Given the volume of data, compression is important. Jetstream typically supports per-message compression (using compression extensions to the WebSocket protocol). This reduces bandwidth usage, which matters both for the server and for consumers.
=== Rate Limiting and Backpressure
If a consumer can't keep up with the firehose, Jetstream may drop the connection or the consumer may fall behind. Backpressure handling is important:
== Firehose Consumers vs Producers
It's important to distinguish between different types of firehose participants:
*Producers (PDS instances):* Publish events to the firehose. Your PDS is a producer.
*Relays:* Aggregate events from multiple producers and redistribute them. They act as central clearinghouses.
*Consumers (AppViews, indexers):* Subscribe to the firehose and process events. Spacedust is a consumer.
Your server hosts both producers (the PDS) and consumers (Microcosm services). They operate independently but share the network infrastructure.
== Regional Jetstream Endpoints
Bluesky operates multiple Jetstream endpoints in different geographic regions. You might connect to:
jetstream1.us-west.bsky.network (US West Coast)jetstream1.us-east.bsky.network (US East Coast)*Why region matters:*
*The cursor problem:* Remember that cursors are region-specific. If you switch regions, you need to either:
1. Start from the current time (miss intervening events)
2. Start from the beginning (reprocess everything)
3. Use jetstreamForce to use a cursor from another region (may cause inconsistencies)
== Data Retention and Storage
The volume of firehose data is substantial. A busy Jetstream might generate:
Storage strategy matters:
*Hot storage:* Recent events in fast storage (SSD) for quick querying
*Warm storage:* Older events in slower storage for occasional access
*Cold storage:* Archival storage for compliance or historical analysis
*Pruning:* Deleting old data that's no longer needed
Your configuration should specify retention policies appropriate for your use case. Not everyone needs to keep years of data.
== Filtering and Selective Consumption
Not all consumers need all events. Jetstream supports filtering:
Filtering reduces the volume of data you need to process and store. If you're only indexing posts, you don't need to receive and discard follows and likes.
However, filtering happens server-side, which reduces load on the Jetstream infrastructure. It's considered good citizenship to filter when you can.
== Error Handling and Resilience
Firehose consumers must be resilient to:
*Network interruptions:* WebSocket connections can drop. The consumer should automatically reconnect and resume.
*Service restarts:* When you update or restart Spacedust, it should pick up where it left off using the cursor.
*Malformed events:* The firehose might occasionally contain unexpected data. Graceful error handling prevents crashes.
*Backpressure:* If indexing can't keep up, the consumer should detect this (via lag metrics) and potentially shed load or scale up.
*Jetstream downtime:* If the Jetstream endpoint is down, the consumer should retry with exponential backoff.
== Integration with NixOS
The NixOS modules for Microcosm services handle:
The services run as dedicated users with limited privileges. They can access their data directories and make network connections, but can't modify system files or access other services' data.
Port conflicts: Spacedust and Slingshot (and other services) need unique ports. Your configuration assigns:
These are configured in the service settings and referenced in Caddy's reverse proxy configuration.
== Metrics and Monitoring
Microcosm services expose metrics that Prometheus can scrape:
*Spacedust metrics include:*
jetstreamtotalevents_received - Total events ingestedjetstreamtotalbytes_received - Volume of dataindexinglagseconds - How far behind real-timedatabasesizebytes - Storage usagequerydurationseconds - API response times*Slingshot metrics include:*
httprequeststotal - Request count by endpoint and statushttprequestduration_seconds - Response time distributioncachehitratio - Cache effectivenessactive_connections - Current client connectionsThese metrics feed into your Grafana dashboards for visualization and alerting.
== Scaling Microcosm Services
As your usage grows, you might need to scale:
*Vertical scaling:* More CPU, RAM, faster disk for a single instance
*Horizontal scaling:* Multiple Spacedust instances partitioning the firehose (each handles a subset of DIDs)
*Read replicas:* Multiple Slingshot instances all querying the same indexed data behind a load balancer
*Sharding:* Splitting data across multiple databases by time range or DID range
The modular design (Spacedust and Slingshot separate) makes many of these scaling strategies possible.
== Use Cases
What do you do with indexed firehose data?
*Search:* Full-text search across the network
*Analytics:* Trending topics, growth metrics, engagement analysis
*Moderation:* Detecting spam, abuse, or policy violations
*Custom feeds:* Algorithmic feeds beyond what Bluesky provides
*Archival:* Long-term preservation of public discourse
*Research:* Academic study of social network dynamics
The firehose is the raw material; Microcosm services are the refineries that turn it into useful products.