Building Incrementally: Development Workflow

The Philosophy of Incremental Development

Building a complex NixOS configuration from scratch isn't about writing 700 lines of code and deploying it all at once. It's about starting small, getting something working, then adding complexity layer by layer. Each layer should be tested and verified before moving to the next.

This approach has several advantages:

*Immediate feedback:* You know immediately if something works

*Easier debugging:* When something breaks, you know exactly what changed

*Learning at each step:* You understand each component before adding the next

*Safety:* A broken intermediate step is easier to recover from than a broken final system

*Motivation:* Seeing working results keeps you motivated to continue

Phase 0: The Blank Slate

Before writing any configuration, you need:

*A machine to configure:* A VPS, a VM, or physical hardware with NixOS installed

*Basic access:* SSH working with key authentication

*A domain name:* DNS pointing to your server

*A plan:* Know what services you eventually want to run

You don't need the full plan immediately. Start with the core infrastructure, then add services one by one.

Phase 1: Minimal Bootable System

Your first goal is the smallest possible working configuration:

*What to configure:*

Basic system (bootloader, networking, timezone)
SSH access (so you can log in if something goes wrong)
One user account with sudo
That's it

*Why so minimal:*

Proves the flake mechanism works
Establishes you can connect and rebuild
Creates a known-good baseline
Takes minutes, not hours

*Testing:*

1. Build: nixos-rebuild build --flake .#snek

2. Switch: nixos-rebuild switch --flake .#snek

3. Verify: Can you still SSH in? Check systemctl status shows no failed services

*Rollback plan:* If this breaks, you can boot from the previous generation in GRUB, or use the hosting provider's console to fix things.

Phase 2: Add Caddy and Static Site

Once the base system works, add your edge proxy:

*What to configure:*

Caddy with rate limiting plugin
One virtual host serving a static HTML file
Firewall ports 80 and 443 open

*Why this next:*

Caddy is the foundation for all web services
Static site proves TLS and domain configuration work
Easy to verify (just visit the URL)

*Testing:*

1. Rebuild and switch

2. Create /var/www/snek.cc/index.html with some content

3. Visit https://snek.cc in a browser

4. Verify: Do you see your content? Is the certificate valid (green lock)?

*Common issues:*

DNS not propagated yet (wait or check with dig)
Firewall blocking ports
Caddy configuration syntax errors
Wrong domain in virtual host

Phase 3: Add Secrets Management

Before adding services that need secrets, set up sops-nix:

*What to configure:*

Add sops-nix input to flake
Import sops-nix module
Create .sops.yaml with your age key
Create a test secret file and verify encryption/decryption

*Why before services:*

You don't want to add a service, realize you need secrets, then have to retrofit
Testing sops-nix standalone is easier than debugging it with a service
Gets your encryption workflow established

*Testing:*

1. Create secrets/test.yaml with a dummy secret

2. Edit with sops secrets/test.yaml, verify you can encrypt/decrypt

3. Configure NixOS to use that secret

4. Deploy and verify secret appears in /run/secrets/

Phase 4: Add One Service

Now add your first real service. Choose one that's:

Self-contained (no other service dependencies)
Easy to verify (has a health endpoint or obvious functionality)
Not too complex

Good candidates:

QuickDID (simple ATProto service)
Tangled Knot (but verify git clone works)

*What to configure:*

The service module configuration
Caddy virtual host for the service
Any secrets the service needs
Firewall port if needed (though most services are localhost-only)

*Testing:*

1. Rebuild with nixos-rebuild test first (temporary, non-persistent)

2. Check service status: systemctl status service-name

3. Check logs: journalctl -u service-name -f

4. Test functionality: Make a request to the service

5. If all good, switch permanently: nixos-rebuild switch

*Verification checklist:*

Service process is running
No errors in logs
Responds correctly to requests
TLS certificate valid
Rate limiting working (test with many requests)

Phase 5: Iterate and Add More Services

Repeat Phase 4 for each service:

*Order matters:* Consider dependencies:

Lycan needs PostgreSQL, so add PostgreSQL before Lycan
Services that proxy to other machines (via Tailscale) need Tailscale working first
Some services conflict on ports—don't enable both at once initially

*One at a time:* Don't add multiple new services in one rebuild. If something breaks, you won't know which service caused it.

*Verify before adding next:* Make sure the current service is fully working before moving on.

Testing Strategies

The Three Levels of Testing

*Build testing:* nixos-rebuild build

Evaluates configuration
Builds packages
Catches syntax errors and missing packages
Doesn't change running system
Use this when making significant changes

*Test activation:* nixos-rebuild test

Activates configuration temporarily
Services start, changes take effect
Not persistent—reboot reverts to previous configuration
Perfect for verifying changes work before committing
Use this for most changes

*Switch activation:* nixos-rebuild switch

Activates configuration permanently
Adds new boot entry
Previous configuration still available for rollback
Use this after verifying with test

When to Use Each

*Use build when:*

Making major structural changes
Refactoring module structure
Just want to check for syntax errors
Not ready to actually apply changes

*Use test when:*

Adding a new service
Changing service configuration
Testing firewall changes
Making any change you're not 100% confident about

*Use switch when:*

You've tested with test and everything works
Making minor, safe changes (adding a comment, updating a package)
Emergency fixes that need to persist

The Rollback Safety Net

NixOS generations are your safety net. Every switch creates a new generation.

*To see generations:*

nixos-rebuild list-generations
}}}

**To rollback immediately:**

nixos-rebuild switch --rollback

}}}

*To boot a previous generation:*

Reboot
In GRUB menu, select previous generation
System boots with that configuration
You can then switch to make it permanent, or troubleshoot

*When to rollback:*

Service won't start after a change
Network connectivity lost
System behaves unexpectedly
Any time you're not confident in the current state

Common Development Patterns

Pattern 1: The Minimal Test Case

When a service isn't working, create the simplest possible configuration that should work:

Remove rate limiting
Remove custom headers
Use default settings
Get it working, then add complexity back

Pattern 2: Check Logs First

When something breaks:

1. Check service logs: journalctl -u service-name -f

2. Check system logs: journalctl -xe

3. Check Caddy logs: journalctl -u caddy

90% of problems are obvious in the logs.

Pattern 3: Validate Step by Step

Don't assume the whole chain works. Test each link:

1. Is the service running? systemctl status

2. Is it listening on the right port? ss -tlnp | grep PORT

3. Can you reach it locally? curl http://localhost:PORT

4. Can you reach it through Caddy? curl https://domain.com

5. Can you reach it from outside? Test from another machine

Wherever the chain breaks, that's your problem.

Pattern 4: Isolate Changes

Use git to track changes:

Commit working configuration
Make changes on a branch
Test thoroughly
Merge when verified

This lets you diff configurations and understand exactly what changed.

Debugging Workflow

When something doesn't work:

1. *Don't panic:* You can always rollback

2. *Check status:* What's failing? systemctl --failed

3. *Read logs:* What do the logs say? journalctl -u service

4. *Simplify:* Remove complexity, get back to a known state

5. *Test individually:* Test each component separately

6. *Search:* Look up error messages, check documentation

7. *Ask:* If stuck, ask for help with specific error messages

8. *Document:* When you solve it, write down what you learned

Performance Considerations

As you add more services:

*Watch resource usage:*

CPU: top, htop
Memory: free -h
Disk: df -h
Check Grafana dashboards

*Signs you need to scale:*

Services restarting due to OOM (Out Of Memory)
High load average consistently
Disk filling up
Slow response times

*Scaling strategies:*

Vertical: More CPU/RAM on same machine
Horizontal: Split services across multiple machines
Optimization: Tune service settings, add caching

Documentation as You Go

Don't wait until the end to document:

Comment your configuration (explain why, not what)
Keep a log of what you did and why
Note any workarounds or special configuration
Document external dependencies (Tailscale IPs, DNS records)

This guide is that documentation—written as the system is built, explaining the reasoning behind each choice.

Final Verification

Before considering a service "done":

[ ] Service starts automatically on boot
[ ] Service responds correctly to requests
[ ] TLS certificate is valid
[ ] Rate limiting is working
[ ] Logs are being written and rotated
[ ] Monitoring is collecting metrics
[ ] Secrets are properly managed
[ ] Backups are configured (if data needs to persist)
[ ] Documentation is complete

The Journey, Not the Destination

Building this system isn't a one-time task—it's an ongoing process:

*Services evolve:* You'll update configurations, add features, optimize

*Nixpkgs updates:* You'll run nix flake update to get security patches

*Requirements change:* You'll add new services, remove old ones

*Learning continues:* You'll discover better ways to do things

The incremental approach isn't just for the initial build—it's how you maintain and evolve the system over time. Small, tested changes are safer than big bang updates.

Embrace the process. Each working service is a victory. Each problem solved is knowledge gained. The goal isn't just a working system, but understanding how and why it works.