This guide covers most common problems you might encounter when building or running your NixOS system, with solutions and debugging strategies.
3. Network Connectivity Issues
6. Secrets Management Problems
8. Tailscale Connection Problems
Symptoms:
error: undefined variable 'somePackage'
}}}
*Causes:*
- Typo in package name
- Package not available in nixpkgs channel
- Missing `with pkgs;` or fully qualified name
*Solutions:*
# Check if package exists
nix search nixpkgs package-name
# Use fully qualified name
pkgs.package-name instead of just package-name
# Update nixpkgs
nix flake update
}}}
Symptoms:
error: infinite recursion encountered
}}}
*Causes:*
- Circular dependencies in configuration
- Using `config` in option declarations
- `rec` attribute sets with self-references
*Solutions:*
- Use `let...in` instead of `rec` for complex calculations
- Don't reference `config` at top level of modules
- Break circular dependencies by restructuring
=== Syntax Errors ===
*Symptoms:*
error: syntax error, unexpected ',', expecting ';'
}}}
Common mistakes:
Debugging:
# Check syntax before building
nix-instantiate --eval --strict your-file.nix
# Or use nixos-rebuild dry-build
sudo nixos-rebuild dry-build --flake .#snek
}}}
=== Lock File Issues ===
*Symptoms:*
error: flake.lock file is corrupted
}}}
Solutions:
# Regenerate lock file
rm flake.lock
nix flake lock
# Or update all inputs
nix flake update
}}}
== Service Won't Start ==
=== Check Service Status ===
*Basic diagnostics:*
# Check if service is enabled
systemctl is-enabled service-name
# Check if service is active
systemctl is-active service-name
# Get detailed status
systemctl status service-name
# Check for failed services
systemctl --failed
}}}
Most important command:
# View recent logs
journalctl -u service-name -n 100
# Follow logs in real-time
journalctl -u service-name -f
# View all logs since boot
journalctl -u service-name -b
# Check for errors only
journalctl -u service-name -p err
}}}
=== Common Service Issues ===
*Permission Denied:*
# Check file ownership
ls -la /var/lib/service-name/
# Fix ownership
sudo chown -R service-user:service-group /var/lib/service-name/
# Check service user in config
systemctl cat service-name | grep User
}}}
Port Already in Use:
# Find what's using port
sudo ss -tlnp | grep :PORT
# Check service configuration for correct port
# Change port in config or stop conflicting service
}}}
*Missing Environment Variables:*
# Check environment file exists
ls -la /run/secrets/
# Verify sops secrets are decrypted
sudo cat /run/secrets/SECRET_NAME
# Check systemd service environment
systemctl show service-name --property=Environment
}}}
#!/bin/bash
SERVICE=$1
echo "=== Service: $SERVICE ==="
echo "Status:"
systemctl status $SERVICE --no-pager
echo -e "\nRecent Logs:"
journalctl -u $SERVICE --no-pager -n 20
echo -e "\nProcess Info:"
pgrep -f $SERVICE | xargs ps -f -p 2>/dev/null || echo "Not running"
echo -e "\nOpen Files:"
pgrep -f $SERVICE | head -1 | xargs -I {} ls -la /proc/{}/fd 2>/dev/null | head -10
echo -e "\nNetwork Connections:"
ss -tlnp | grep -i $SERVICE || echo "No listening ports"
}}}
== Network Connectivity Issues ==
=== Can't Reach Services Externally ===
*Check from server:*
# Is service listening?
ss -tlnp | grep :PORT
# Can you connect locally?
curl http://localhost:PORT
# Is firewall open?
sudo iptables -L -n | grep PORT
}}}
Check DNS:
# Does domain resolve?
dig +short your-domain.com
# Does it point to your IP?
curl ifconfig.me # Your IP
# Compare with dig output
}}}
*Check Caddy:*
# Is Caddy running?
systemctl status caddy
# Check Caddy logs
journalctl -u caddy -f
# Test Caddy directly
curl -H "Host: your-domain.com" http://localhost
}}}
Symptoms: curl: (7) Failed to connect
Causes:
Solutions:
# Check what interface service binds to
ss -tlnp | grep SERVICE
# Should show 0.0.0.0:PORT or :::PORT for external access
# If it shows 127.0.0.1:PORT, that's correct (Caddy proxies to it)
# Check if service is behind Caddy
curl -v https://your-domain.com
# Should show TLS handshake, not connection refused
}}}
== SSL/Certificate Problems ==
=== Certificate Not Valid ===
*Symptoms:* Browser shows certificate error
*Check certificate:*
# Check certificate info
echo | openssl s_client -servername your-domain.com -connect your-domain.com:443 2>/dev/null | openssl x509 -noout -dates
# Check if expired
openssl s_client -connect your-domain.com:443 -servername your-domain.com 2>/dev/null | openssl x509 -noout -checkend 0 && echo "Valid" || echo "Expired"
}}}
Common issues:
Force certificate renewal:
# Remove Caddy's certificate cache
sudo rm -rf /var/lib/caddy/certificates
# Restart Caddy
sudo systemctl restart caddy
# Check logs for renewal
journalctl -u caddy -f
}}}
=== Wildcard Certificate Issues ===
*For PDS subdomains (*.pds.yourdomain.com):*
# Check on-demand TLS is configured
cat /etc/caddy/Caddyfile | grep on_demand
# Verify PDS is responding to TLS check
curl http://localhost:2583/tls-check?domain=test.pds.yourdomain.com
# Check Caddy logs for on-demand TLS
journalctl -u caddy | grep -i "on_demand\|tls"
}}}
Check logs:
journalctl -u postgresql -n 50
}}}
*Common issues:*
# Disk full?
df -h /var/lib/postgresql
# Permission issues
ls -la /var/lib/postgresql/
sudo chown -R postgres:postgres /var/lib/postgresql
# Corrupted data
# Check PostgreSQL logs for corruption errors
}}}
Test connection:
# As postgres user
sudo -u postgres psql -c "\l"
# Test specific database
sudo -u postgres psql lycan -c "SELECT 1"
}}}
*Check authentication:*
# Check pg_hba.conf
cat /var/lib/postgresql/pg_hba.conf
# Should have trust or peer for local connections
}}}
Check connections:
# Current connections
sudo -u postgres psql -c "SELECT count(*) FROM pg_stat_activity;"
# Max connections
sudo -u postgres psql -c "SHOW max_connections;"
}}}
*Slow queries:*
# Check for long-running queries
sudo -u postgres psql -c "SELECT pid, now() - querystart AS duration, query FROM pgstat_activity WHERE state = 'active' ORDER BY duration DESC;"
}}}
Symptoms:
error: cannot decrypt data key: no key found
}}}
*Solutions:*
# Check .sops.yaml exists and is correct
cat /etc/nixos/secrets/.sops.yaml
# Verify age key is available
# Check sops.age.sshKeyPaths in your config points to valid SSH key
# Decrypt manually to test
sops -d /etc/nixos/secrets/your-secrets.yaml
# If using SSH key, verify it exists
ls -la /etc/ssh/sshhosted25519_key
}}}
Check:
# Are secrets in /run/secrets/?
ls -la /run/secrets/
# Check ownership
ls -la /run/secrets/SECRET_NAME
# Verify service can read
sudo -u service-user cat /run/secrets/SECRET_NAME
}}}
*Common fixes:*
# Rebuild to re-decrypt secrets
sudo nixos-rebuild switch --flake .#snek
# Check sops service status
systemctl status sops-nix
# Manually trigger activation
sudo /nix/var/nix/profiles/system/specialisation/switch-to-configuration switch
}}}
Identify culprit:
# Top processes by CPU
top -o %CPU
# Or use htop (if installed)
htop
# Check specific service
systemctl status high-cpu-service
journalctl -u high-cpu-service -f
}}}
*Common causes:*
- Lycan indexing (normal during initial sync)
- Microcosm services processing firehose
- Spindle running CI builds
=== High Memory Usage ===
*Check memory:*
# Overview
free -h
# By service
systemctl status service-name | grep Memory
# Find memory hogs
ps aux --sort=-%mem | head -10
}}}
Out of Memory (OOM):
# Check OOM killer logs
journalctl | grep -i "killed process"
# Set memory limits in service config
# services.servicename.serviceConfig.MemoryMax = "1G";
}}}
=== Disk Space Issues ===
*Check usage:*
# Overall usage
df -h
# Directory sizes
du -h /var/lib/*/ | sort -h
# Find large files
find /var/log -type f -size +100M
}}}
Cleanup:
# Clean Nix store
sudo nix-collect-garbage -d
# Clean old generations
sudo nix-env --delete-generations +10 --profile /nix/var/nix/profiles/system
# Vacuum logs
sudo journalctl --vacuum-time=7d
# Clean service logs (if any)
find /var/log -name "*.log" -type f -mtime +7 -delete
}}}
== Tailscale Connection Problems ==
=== Can't Connect to Tailscale ===
*Check status:*
# Tailscale status
tailscale status
# If not connected
tailscale up
}}}
Check network:
# Can reach Tailscale coordination server?
ping -c 3 login.tailscale.com
# Check routes
ip route | grep tailscale
}}}
=== Can't Reach Remote Services ===
*Test connectivity:*
# Ping Tailscale IP
ping 100.x.x.x
# Test port
nc -zv 100.x.x.x PORT
# Check if in same tailnet
tailscale status | grep hostname
}}}
Firewall issues:
# Check if Tailscale interface is trusted
# In your NixOS config:
# networking.firewall.trustedInterfaces = [ "tailscale0" ];
}}}
== General Debugging Workflow ==
=== When Something Breaks ===
1. **Don't panic** - You can always rollback
{{{
sudo nixos-rebuild switch --rollback
}}}
2. **Identify problem**
- What changed last?
- What are the symptoms?
- Check logs
3. **Isolate issue**
- Is it one service or whole system?
- Can you reproduce it?
4. **Check configuration**
- Syntax errors?
- Missing dependencies?
- Wrong values?
5. **Test fixes incrementally**
- Use `nixos-rebuild test` first
- Make one change at a time
- Verify each fix
6. **Document solution**
- What was root cause?
- How did you fix it?
- How to prevent it?
=== Emergency Recovery ===
*If you completely break SSH/network access:*
1. Use provider's console/VNC
2. Boot into previous generation from GRUB
3. Or mount disk and fix from rescue mode
*If system won't boot:*
1. Boot from NixOS ISO
2. Mount your partitions
3. Check logs in `/mnt/var/log/`
4. Fix configuration or restore from backup
== Getting Help ==
When asking for help, provide:
1. **Exact error message** (copy-paste)
2. **What you were trying to do**
3. **Recent changes** made
4. **Relevant logs** (journalctl output)
5. **Configuration** (sanitized of secrets)
=== Resources ===
- [[https://nixos.org/manual/nixos/stable/#sec-troubleshooting|NixOS Manual - Troubleshooting]]
- [[https://discourse.nixos.org/|NixOS Discourse]]
- [[https://wiki.nixos.org/wiki/Troubleshooting|NixOS Wiki - Troubleshooting]]
- [[https://atproto.com/support|AT Protocol Support]]