diff --git a/CLAUDE_STATUS.md b/CLAUDE_STATUS.md index 8010248..5011996 100644 --- a/CLAUDE_STATUS.md +++ b/CLAUDE_STATUS.md @@ -1,9 +1,9 @@ # Homelab Status Tracker -**Last Updated**: 2025-11-30 13:25:00 +**Last Updated**: 2025-11-30 17:37:00 **Goal**: Document and commit recent infrastructure planning and integration documentation **Phase**: Completed -**Current Context**: All pre-commit tasks completed successfully. Documentation committed to repository with proper security sanitization. Commit hash: a1841f1c4193b143c9fa71746929cfe3cd9cbdbe +**Current Context**: All documentation corrections committed. Architecture updates for Debian 12 and NPM committed to repository. Latest commit hash: c16d5210709c38ccf3ef22785c23ac99a61f1703 --- @@ -135,6 +135,171 @@ - Maintenance simplified (web UI vs config files) - Documentation accuracy: Aligned with real deployment environment +- [x] **Commit Architecture Corrections to Repository** + - Status: Completed at 2025-11-30 17:37:00 + - Owner: Librarian + - Action: Created commit with conventional commit message for n8n architecture corrections + - Result: Commit hash c16d5210709c38ccf3ef22785c23ac99a61f1703 + - Changes: 2 files changed, 325 insertions(+), 194 deletions(-) + * CLAUDE_STATUS.md: Updated with detailed change log + * n8n/N8N-SETUP-PLAN.md: 30+ architecture corrections (Debian 12 + NPM) + +--- + +--- + +## Active Troubleshooting: n8n 502 Bad Gateway + +**Started**: 2025-11-30 +**Updated**: 2025-12-01 +**Status**: Ready for Deployment +**Issue**: n8n returns 502 Bad Gateway - Root cause identified and fix script prepared + +### Problem Summary + +**Symptoms**: +- ❌ External access: `https://n8n.apophisnetworking.net` returns 502 Bad Gateway (from mobile) +- ❌ Internal access: Returns nginx default page or connection issues +- ✅ Comparison: `beszel.apophisnetworking.net` works perfectly (both internal and external) + +**Configuration Context**: +- n8n location: CT 113 at 192.168.2.113:5678 +- NPM location: CT 102 at 192.168.2.101 +- Beszel location: 192.168.2.102:8090 (working reference) +- All services behind same NPM, same Cloudflare DNS setup + +### n8n Configuration (from /opt/n8n/.env) + +```bash +# n8n Configuration +N8N_PROTOCOL=https +N8N_HOST=n8n.apophisnetworking.net +N8N_PORT=5678 +N8N_PATH=/ +WEBHOOK_URL=https://n8n.apophisnetworking.net/ + +# Database +DB_TYPE=postgresdb +DB_POSTGRESDB_HOST=localhost +DB_POSTGRESDB_PORT=5432 +DB_POSTGRESDB_DATABASE=n8n_db +DB_POSTGRESDB_USER=n8n_user +``` + +### NPM Proxy Host Configuration (from screenshots) + +**Details Tab**: +- Domain: `n8n.apophisnetworking.net` +- Scheme: `http` +- Forward to: `192.168.2.113:5678` +- Websockets: ✓ Enabled +- Status: Online (green) + +**SSL Tab**: +- Certificate: `*.apophisnetworking.net` (wildcard) +- Force SSL: ✓ Enabled +- HTTP/2: ✓ Enabled +- HSTS: ✓ Enabled + +### Diagnostic Steps Completed + +- [x] **Verify n8n service status** (Lab-Operator) + - Status: Service in crash loop - repeatedly starting and failing + - Command: `systemctl status n8n` showed "activating (auto-restart)" + +- [x] **Review service logs** (Lab-Operator) + - Command: `journalctl -u n8n -n 100` + - Errors found: Encryption key validation failures + - Log showed: n8n exiting immediately after start attempt + +- [x] **Analyze .env configuration** (Backend-Builder) + - Found: `N8N_ENCRYPTION_KEY=$(openssl rand -hex 32)` + - Issue: .env files don't execute shell commands - this is a literal string + - Missing: `N8N_LISTEN_ADDRESS=0.0.0.0` + - Missing: `NODE_ENV=production` + - Password needs quoting: `DB_POSTGRESDB_PASSWORD="Nbkx4mdmay1)"` + +### Root Cause Analysis + +**PRIMARY ISSUE**: Invalid N8N_ENCRYPTION_KEY in /opt/n8n/.env + +**Technical Explanation**: +The `.env` file contained `N8N_ENCRYPTION_KEY=$(openssl rand -hex 32)` which was intended to generate a random encryption key. However, `.env` files are not shell scripts - they don't execute commands. The variable was set to the **literal string** `$(openssl rand -hex 32)` instead of an actual 64-character hexadecimal key. + +**Impact**: +- n8n service fails encryption key validation on startup +- Service enters crash loop (start → fail → restart → fail) +- NPM returns 502 Bad Gateway because backend service is down +- Issue was NOT hairpin NAT or NPM misconfiguration (beszel works fine with same setup) + +**Additional Configuration Issues Identified**: +1. Missing `N8N_LISTEN_ADDRESS=0.0.0.0` - would cause service to listen only on localhost +2. Missing `NODE_ENV=production` - affects performance and security +3. Database password not quoted - special characters need proper escaping + +### Attempted Solutions & Lessons Learned + +**Attempt 1-3: Heredoc Script Failures** +- Created fix script using heredoc syntax for .env generation +- Error: `warning: here-document at line 22 delimited by end-of-file (wanted 'ENVEOF')` +- Root cause: Windows/Linux line ending issues when copying script from WSL to LXC container +- Backend-Builder's first attempt incorrectly changed to SQLite (corrected to maintain PostgreSQL) +- Lesson: Heredoc syntax fragile in cross-platform environments + +**Final Solution: Simple Echo-Based Script** +- Replaced heredoc with simple `echo` statements +- More robust to copy-paste and line ending issues +- Avoids CRLF/LF conversion problems + +### Solution: Fix Script Ready for Deployment + +**Script Location**: `/tmp/fix_n8n_simple.sh` (on WSL, ready to transfer to CT 113) + +**Script Actions**: +1. Generates proper encryption key: `ENCRYPTION_KEY=$(openssl rand -hex 32)` +2. Backs up existing .env with timestamp: `/opt/n8n/.env.backup.YYYYMMDD_HHMMSS` +3. Creates new .env file with corrected configuration: + - Actual generated encryption key (not shell command) + - Adds `N8N_LISTEN_ADDRESS=0.0.0.0` + - Adds `NODE_ENV=production` + - Properly quotes `DB_POSTGRESDB_PASSWORD` + - Maintains PostgreSQL database configuration +4. Sets secure permissions: `chmod 600` and `chown n8n:n8n` +5. Restarts n8n service +6. Verifies service status and local connectivity + +**Reviews Completed**: +- ✅ **Backend-Builder**: Code review APPROVED (95% confidence, technically sound) +- ✅ **Lab-Operator**: Operational review APPROVED with safeguards documented + - Minimal downtime (~13 seconds) + - No database corruption risk + - Rollback procedures documented + - Security recommendations provided + +**Pre-Execution Safeguards**: +1. Create ZFS snapshot of CT 113: `pct snapshot 113 pre-n8n-fix` +2. Backup PostgreSQL database: `pg_dump n8n_db > /tmp/n8n_db_pre_fix_backup.sql` +3. Verify no encrypted credentials exist (likely none since service never started) + +**Security Notes**: +- Script contains hardcoded password - **delete after use**: `shred -u /tmp/fix_n8n_simple.sh` +- Do NOT commit script to git repository +- Encryption key properly secured in .env with 600 permissions + +### Next Actions + +- [ ] User to deploy fix script on CT 113 tomorrow (2025-12-02) +- [ ] Test external access after fix: `https://n8n.apophisnetworking.net` +- [ ] Verify service stability for 24 hours +- [ ] Update this status file to RESOLVED after successful deployment + +### Files Referenced + +- `/home/jramos/homelab/n8n/N8N-SETUP-PLAN.md` - Phase 5 configuration +- `/opt/n8n/.env` - n8n configuration (on CT 113) +- `/tmp/fix_n8n_simple.sh` - Fix script (NOT committed to git - contains password) +- `/data/nginx/proxy_host/*.conf` - NPM proxy configs (on CT 102) + --- **Repository**: /home/jramos/homelab | **Branch**: main