# Homelab Infrastructure Status **Last Updated**: 2025-12-18 17:00:00 **Export Reference**: disaster-recovery/homelab-export-20251211-144345 ## Current Infrastructure Snapshot ### Proxmox Environment - **Node**: serviceslab - **Version**: Proxmox VE 8.4.0 - **Management IP**: 192.168.2.200 - **Architecture**: Single-node cluster - **Total Resources**: 9 VMs, 2 Templates, 5 LXC Containers --- ## Virtual Machines (QEMU/KVM) - 9 VMs | VM ID | Name | IP Address | Status | Purpose | |-------|------|------------|--------|---------| | 100 | docker-hub | 192.168.2.XXX | Running | Container registry/Docker hub mirror | | 101 | monitoring-docker | 192.168.2.114 | Running | Monitoring stack (Grafana/Prometheus/PVE Exporter) | | 105 | dev | - | Stopped | General-purpose development workstation | | 106 | Ansible-Control | 192.168.2.XXX | Running | IaC orchestration, configuration management | | 108 | CML | - | Stopped | Cisco Modeling Labs - network simulation | | 109 | web-server-01 | 192.168.2.XXX | Running | Web application server (clustered) | | 110 | web-server-02 | 192.168.2.XXX | Running | Load-balanced pair with web-server-01 | | 111 | db-server-01 | 192.168.2.XXX | Running | Backend database server | | 114 | haos | 192.168.2.XXX | Running | Home Assistant OS - smart home automation platform | **Recent Changes**: - Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure - Removed VM 101 (gitlab) - service decommissioned --- ## VM Templates - 2 Templates | Template ID | Name | Purpose | |-------------|------|---------| | 104 | ubuntu-dev | Ubuntu development environment template for cloning | | 107 | ubuntu-docker | Ubuntu Docker host template for rapid deployment | **Note**: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning. --- ## Containers (LXC) - 5 Containers | CT ID | Name | IP Address | Status | Purpose | |-------|------|------------|--------|---------| | 102 | nginx | 192.168.2.101 | Running | Reverse proxy/load balancer & NPM | | 103 | netbox | 192.168.2.XXX | Running | Network documentation/IPAM | | 112 | twingate-connector | 192.168.2.XXX | Running | Zero-trust network access connector | | 113 | n8n | 192.168.2.107 | Running | Workflow automation platform | | 115 | tinyauth | 192.168.2.10 | Running | SSO authentication layer for NetBox | **Recent Changes**: - Added CT 115 (tinyauth) for SSO authentication integration with NetBox - Added CT 112 (twingate-connector) for zero-trust network security - Added CT 113 (n8n) for workflow automation - Removed CT 112 (Anytype) - replaced by n8n --- ## Storage Architecture | Storage Pool | Type | Total | Used | % Used | Purpose | |--------------|------|-------|------|--------|---------| | local | Directory | - | - | 19.11% | System files, ISOs, templates | | local-lvm | LVM-Thin | - | - | 0.01% | VM disk images (thin provisioned) | | Vault | NFS/Directory | - | - | 12.13% | Secure storage for sensitive data | | PBS-Backups | PBS | - | - | 28.27% | Automated backup repository | | iso-share | NFS/CIFS | - | - | 1.45% | Installation media library | | localnetwork | Network Share | - | - | N/A | Shared resources across infrastructure | **Capacity Notes**: - PBS-Backups utilization increased to 28.27% (healthy retention) - Vault utilization increased to 12.13% (data growth monitored) - local storage at 19.11% (system overhead within normal range) --- ## Key Services & Stacks ### Monitoring & Observability (NEW) **VM 101** - monitoring-docker (192.168.2.114) - **Grafana**: Port 3000 - Visualization and dashboards - **Prometheus**: Port 9090 - Metrics collection and time-series database - **PVE Exporter**: Port 9221 - Proxmox VE metrics exporter - **Documentation**: `/home/jramos/homelab/monitoring/README.md` - **Status**: Fully operational ### Network Security (NEW) **CT 112** - twingate-connector - **Purpose**: Zero-trust network access - **Type**: Lightweight connector - **Status**: Running - **Integration**: Connects homelab to Twingate network ### Automation & Integration **CT 113** - n8n (192.168.2.107) - **Purpose**: Workflow automation platform - **Technology**: n8n.io - **Database**: PostgreSQL 15+ - **Features**: API integration, scheduled workflows, webhook triggers - **Documentation**: `/home/jramos/homelab/services/README.md#n8n-workflow-automation` - **Status**: Operational (resolved database locale issues) ### Authentication & SSO **CT 115** - tinyauth (192.168.2.10) - **Purpose**: Lightweight SSO authentication layer - **Technology**: TinyAuth v4 (Docker container) - **Port**: 8000 - **Domain**: tinyauth.apophisnetworking.net - **Integration**: Authentication gateway for NetBox via Nginx Proxy Manager - **Security**: Bcrypt-hashed credentials, HTTPS enforcement - **Documentation**: `/home/jramos/homelab/services/tinyauth/README.md` - **Status**: Operational ### Infrastructure Documentation **CT 103** - netbox - **Purpose**: Network documentation and IPAM - **Status**: Stopped (on-demand use) - **Function**: Infrastructure source of truth ### Reverse Proxy & Load Balancing **CT 102** - nginx (192.168.2.101) - **Purpose**: Nginx Proxy Manager - **Ports**: 80, 81, 443 - **Function**: SSL termination, reverse proxy, certificate management - **Upstream Services**: All web-facing applications ### Three-Tier Application Stack **Web Tier**: - VM 109 (web-server-01) - Primary web server - VM 110 (web-server-02) - Load-balanced pair **Database Tier**: - VM 111 (db-server-01) - Backend database **Proxy Tier**: - CT 102 (nginx) - Load balancer and SSL termination ### Development & Automation **VM 106** - Ansible-Control - **Purpose**: Infrastructure as Code orchestration - **Tools**: Ansible, Terraform/OpenTofu (potential) - **Status**: Running ### Container Registry **VM 100** - docker-hub - **Purpose**: Local Docker registry and hub mirror - **Function**: Caching container images for faster deployments - **Status**: Running ### Network Simulation **VM 108** - CML - **Purpose**: Cisco Modeling Labs - **Function**: Network topology testing and simulation - **Status**: Stopped (resource-intensive, on-demand use) --- ## Architecture Patterns ### Monitoring & Observability (NEW) The infrastructure now implements a comprehensive monitoring stack following industry best practices: - **Metrics Collection**: Prometheus scraping Proxmox metrics via PVE Exporter - **Visualization**: Grafana providing real-time dashboards and alerting - **Isolation**: Dedicated VM for monitoring services (fault isolation) - **Integration**: Ready for AlertManager, additional exporters, and integrations **Design Decision**: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services. ### Zero-Trust Security (NEW) Implementation of zero-trust network access principles: - **Twingate Connector**: Lightweight connector providing secure access without VPNs - **Container Deployment**: LXC container for minimal resource overhead - **Network Segmentation**: Secure access to homelab from external networks **Design Decision**: LXC container chosen for quick provisioning and low resource consumption. ### Automation-First Approach Workflow automation and infrastructure orchestration: - **n8n Platform**: Visual workflow builder for API integrations - **Scheduled Tasks**: Automated backup checks, monitoring alerts, reports - **Integration Hub**: Connects monitoring, documentation, and operational tools **Design Decision**: PostgreSQL backend ensures data persistence and supports complex workflows. ### Tiered Application Architecture Classic three-tier design for production-like environments: - **Presentation Tier**: Paired web servers (109, 110) behind load balancer - **Business Logic**: Application processing on web tier - **Data Tier**: Dedicated database server (111) with backup strategy **Design Decision**: Separation of concerns, scalability testing, high availability patterns. ### Selective Containerization Strategy Hybrid approach balancing performance and resource efficiency: - **LXC Containers**: Stateless services (nginx, netbox, twingate, n8n) - **Full VMs**: Complex applications, kernel dependencies, heavy workloads - **Rationale**: LXC for ~10x lower overhead, VMs for isolation and compatibility --- ## Recent Infrastructure Changes ### 2025-12-20: Comprehensive Security Audit Completed **Activity:** Complete infrastructure security assessment and remediation planning **Audit Scope:** - All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser) - Proxmox VE infrastructure and API access - Network security and segmentation - Credential management and storage - SSL/TLS configuration - Container security and runtime configuration **Findings Summary:** - **CRITICAL (6)**: Docker socket exposure, hardcoded credentials, database passwords in git - **HIGH (3)**: Missing SSL/TLS, weak passwords, containers running as root - **MEDIUM (2)**: SSL verification disabled, missing authentication - **LOW (20)**: Documentation gaps, monitoring improvements, backup encryption **Deliverables:** 1. **Security Policy** (`SECURITY.md`): 864 lines - Comprehensive security best practices 2. **Audit Report** (`troubleshooting/SECURITY_AUDIT_2025-12-20.md`): 2,350 lines - Detailed findings and remediation plan 3. **Security Checklist** (`templates/SECURITY_CHECKLIST.md`): 750 lines - Pre-deployment validation template 4. **Validation Report** (`scripts/security/VALIDATION_REPORT.md`): 2,092 lines - Script safety assessment 5. **Container Fixes** (`scripts/security/CONTAINER_NAME_FIXES.md`): 621 lines - Container name verification 6. **Security Scripts** (8 total): - `verify-service-status.sh` - Service health checker - `backup-before-remediation.sh` - Comprehensive backup utility - `rotate-pve-credentials.sh` - Proxmox credential rotation - `rotate-paperless-password.sh` - Database password rotation - `rotate-bytestash-jwt.sh` - JWT secret rotation - `rotate-logward-credentials.sh` - Multi-service credential rotation - `docker-socket-proxy/docker-compose.yml` - Security proxy deployment - `portainer/docker-compose.socket-proxy.yml` - Portainer migration config **Script Validation:** - **Ready for execution**: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy) - **Needs container name fixes**: 3/8 scripts (see CONTAINER_NAME_FIXES.md) **4-Phase Remediation Roadmap:** - Phase 1 (Week 1): Immediate actions - Backups, secrets migration - Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation - Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS - Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines **Estimated Timeline:** - Total downtime: 6-13 minutes (sequential script execution) - Full remediation: 8-16 weeks **Risk Assessment:** - Current risk: HIGH - Multiple CRITICAL vulnerabilities active - Post-Phase 1 risk: MEDIUM - Credential exposure mitigated - Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated - Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented **Status:** Documentation complete, awaiting remediation execution approval --- ### 2025-12-18: TinyAuth SSO Deployment **Service Deployed:** CT 115 - TinyAuth authentication layer **Purpose:** Centralized SSO authentication for NetBox and future homelab services **Specifications:** - **Container**: CT 115 (LXC with Docker) - **IP Address**: 192.168.2.10 - **Domain**: tinyauth.apophisnetworking.net - **Port**: 8000 (external), 3000 (internal) - **Docker Image**: ghcr.io/steveiliop56/tinyauth:v4 - **Resource Usage**: ~50-100 MB memory, <1% CPU **Integration Architecture:** - Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103) - NPM uses `auth_request` directive to validate credentials via TinyAuth - Bcrypt-hashed password storage for security - HTTPS enforcement via NPM SSL termination **Issues Resolved During Deployment:** 1. **500 Internal Server Error**: Fixed Nginx advanced config syntax 2. **IP addresses not allowed**: Changed APP_URL from IP to domain 3. **Port mapping**: Corrected Docker port mapping from 8000:8000 to 8000:3000 4. **Invalid password**: Implemented bcrypt hash requirement for TinyAuth v4 **Integration Impact:** - NetBox now protected by centralized authentication - Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates) - Authentication logs available for security auditing **Documentation:** Complete guide at `/home/jramos/homelab/services/tinyauth/README.md` **Status:** ✅ Operational - Successfully authenticating NetBox access --- ### 2025-12-11: Loki-Stack Monitoring Fully Operational **Issue Resolved:** Centralized logging pipeline now receiving syslog from UniFi router **Root Cause:** rsyslog filter in `/etc/rsyslog.d/unifi-router.conf` was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1) **Fix Applied:** Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1) **Status:** ✅ Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana **Services Affected:** - VM 101 (monitoring-docker): rsyslog configuration updated - Loki-stack: All components operational - Grafana: Dashboards receiving real-time syslog data **Technical Details:** See `troubleshooting/loki-stack-bugfix.md` for complete 5-phase troubleshooting history --- ### 2025-12-11: Infrastructure Expansion & System Updates #### Proxmox VE Platform Upgrade - **Upgraded**: Proxmox VE 8.3.3 → 8.4.0 - **Kernel**: 6.8.12-8-pve - **pve-manager**: 8.4.14 - **Impact**: Enhanced performance, security updates, bug fixes - **Status**: ✅ Complete - All VMs and containers operating normally #### New VM 114: Home Assistant OS Deployment - **Service**: haos (Home Assistant Operating System) - **Purpose**: Smart home automation and integration platform - **Specifications**: - Memory: 4 GB (87% utilized) - CPU: 2 vCPUs - Boot Disk: 50 GB - Status: Running (~3 days uptime) - **Rationale**: Centralized home automation hub for IoT device management - **Integration**: Will integrate with monitoring stack for infrastructure metrics #### CT 103: NetBox IPAM Activated - **Service**: netbox (Network Documentation & IPAM) - **Status Change**: Stopped → Running - **Uptime**: ~3.1 days - **Resource Usage**: 1.28 GB / 2 GB memory (64%) - **Purpose**: Active network documentation and IP address management - **Rationale**: Required for ongoing infrastructure expansion planning #### Storage Utilization Trends - **PBS-Backups**: 27.43% → 28.27% (+0.84%) - Normal backup retention growth - **Vault (ZFS)**: 10.88% → 12.13% (+1.25%) - Data accumulation monitored - **local**: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates - **iso-share**: 1.4% → 1.45% (+0.05%) - Minimal change - **local-lvm**: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline --- ### 2025-12-07: Infrastructure Documentation & Monitoring Stack #### Additions 1. **VM 101 (monitoring-docker)**: New dedicated monitoring infrastructure - Grafana for visualization - Prometheus for metrics collection - PVE Exporter for Proxmox integration - IP: 192.168.2.114 2. **CT 112 (twingate-connector)**: Zero-trust network security - Lightweight connector - Secure remote access without VPN 3. **CT 113 (n8n)**: Workflow automation platform - PostgreSQL 15+ backend - IP: 192.168.2.107 - Resolved database locale issues ### Modifications - Storage utilization updated across all pools - PBS-Backups now at 27.43% (increased retention) - Vault optimized to 10.88% (reduced usage) ### Removals - **VM 101 (gitlab)**: Decommissioned (previously at this ID) - **CT 112 (Anytype)**: Replaced by n8n for better integration ### Documentation Updates - Created comprehensive monitoring stack documentation - Updated all infrastructure tables with current VMs/CTs - Added architecture patterns for observability and zero-trust - Updated storage statistics - Referenced latest export: disaster-recovery/homelab-export-20251207-120040 --- ## Repository Structure ``` homelab/ monitoring/ # NEW: Monitoring stack configurations README.md # Comprehensive monitoring documentation grafana/ docker-compose.yml prometheus/ docker-compose.yml prometheus.yml pve-exporter/ docker-compose.yml pve.yml .env services/ # Docker Compose service configurations n8n/ # n8n workflow automation netbox/ # Network documentation & IPAM README.md # Services overview (updated) disaster-recovery/ homelab-export-20251207-120040/ # Latest infrastructure export scripts/ crawlers-exporters/ # Infrastructure collection scripts fixers/ # Problem-solving scripts qol/ # Quality of life improvements CLAUDE.md # AI assistant guidance (updated) INDEX.md # Navigation index (updated) README.md # Repository overview (updated) CLAUDE_STATUS.md # This file - current infrastructure status ``` --- ## Security Status **Latest Audit**: 2025-12-20 **Total Findings**: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW) **Remediation Status**: Planning Phase - Documentation Complete **Critical Vulnerabilities**: - Docker socket exposure (3 containers) - Proxmox credentials in plaintext - Database passwords in git repository - Missing SSL/TLS for internal services - Weak/default passwords across services - Containers running as root **Documentation**: - Security Policy: `/home/jramos/homelab/SECURITY.md` - Audit Report: `/home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md` - Security Checklist: `/home/jramos/homelab/templates/SECURITY_CHECKLIST.md` - Script Validation: `/home/jramos/homelab/scripts/security/VALIDATION_REPORT.md` --- ## Current Initiative: Security Audit Remediation - Q4 2025 ### Goal Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration. ### Phase Planning - Documentation Complete, Remediation Pending ### Progress Checklist **Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime** - [x] Complete security audit (31 findings documented) - [x] Create remediation scripts (8 scripts validated) - [x] Document security baseline in SECURITY.md - [ ] Backup all service configurations (`backup-before-remediation.sh`) - [ ] Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker) **Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime** - [ ] Deploy docker-socket-proxy - [ ] Rotate Proxmox API credentials (`rotate-pve-credentials.sh`) - [ ] Rotate database passwords (`rotate-paperless-password.sh`) - [ ] Rotate JWT secrets (`rotate-bytestash-jwt.sh`) **Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime** - [ ] Migrate Portainer to socket proxy - [ ] Migrate NPM to socket proxy or remove socket access - [ ] Remove socket mounts from Speedtest Tracker - [ ] Implement SSL/TLS for internal services - [ ] Enable container user namespacing **Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours** - [ ] Implement network segmentation (VLANs for service tiers) - [ ] Deploy fail2ban for rate limiting - [ ] Enable backup encryption (PBS configuration) - [ ] Container vulnerability scanning pipeline - [ ] Automated credential rotation system ### Context Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise. **Risk Management**: - Phase 1: Zero downtime (configuration changes only) - Phase 2: Minimal downtime (credential rotation, proxy deployment) - Phase 3: Moderate downtime (service reconfiguration) - Phase 4: Planned maintenance windows (infrastructure changes) **Success Metrics**: - All CRITICAL findings remediated (6/6) - All HIGH findings remediated (3/3) - Secrets removed from git repository - Docker socket access eliminated or proxied - SSL/TLS enabled for all external services --- ## Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18) ### Goal Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit `tools:` declarations receive only a subset of their configured tools, with first and last array elements consistently dropped. ### Phase COMPLETED - Bug confirmed, comprehensive report generated for Anthropic ### Progress Checklist - [x] Reproduce bug with scribe agent (confirmed: missing Read and Write) - [x] Reproduce bug with lab-operator agent (confirmed: missing Bash and Write) - [x] Test backend-builder agent (working correctly - exception to pattern) - [x] Test librarian agent (working correctly - no tools: declaration) - [x] Identify pattern: First and last tools dropped for agents with explicit tools: arrays - [x] Document impact: Scribe cannot create docs, lab-operator cannot execute commands - [x] Generate comprehensive bug report for Anthropic with all evidence - [x] Update CLAUDE_STATUS.md with investigation status - [ ] Submit bug report to Anthropic via GitHub issues ### Key Findings **Bug Pattern**: Sub-agents with `tools: [A, B, C, D, E]` receive only `[B, C, D]` at runtime **Affected**: scribe (no Read/Write), lab-operator (no Bash/Write) **Unaffected**: backend-builder (exception), librarian (no tools: line) **Workaround**: Remove `tools:` declarations to grant all tools by default **Artifacts**: - Bug report: `/home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md` - Original report: `/home/jramos/homelab/troubleshooting/BUG_REPORT.md` - Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7 ### Context Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue. --- ## Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07) ### Goal Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks). ### Phase COMPLETED - All sub-agent improvements and validations finished ### Progress Checklist - [x] Prompt engineering analysis completed (Opus model) - Analyzed CLAUDE.md and all 4 sub-agent files - Identified 5 critical issues, 12 high-impact improvements - Generated comprehensive improvement recommendations - [x] scribe.md improved (29 340 lines) - Added 6 usage examples (4 positive, 2 negative redirects) - Implemented comprehensive responsibilities section - Added 3 complete ASCII diagram templates - Included safety protocols and decision frameworks - Quality now matches librarian.md standard - [x] backend-builder.md improved (40 291 lines) - Added 6 usage examples with clear boundaries - Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell - Added technology stack table and validation rules table - Included safety protocols for secrets and destructive operations - Added handoff protocol for lab-operator deployment - Defined clear boundaries (CREATES code, does NOT deploy) - [x] lab-operator.md improved (37 193 lines) - Added 6 usage examples with role clarity - Expanded domain expertise with specific commands - Added command style guide (5-step pattern) - Included safety protocols and decision-making framework - Added error handling and escalation guidelines - Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC) - [x] CLAUDE.md structural fixes - Moved YAML frontmatter to line 1 (was at line 89) - Fixed trailing pipe character on line 87 - Completed incomplete sentence about backup strategy - Completed incomplete sentence about storage growth - Removed redundant "Key Services" reference - Expanded status file template with actual structure and recovery instructions - [x] Final validation and testing - librarian: Git status check successful, clear output format - scribe: File reading functional (note: reported encoding issue, likely false positive) - backend-builder: YAML validation successful, proper syntax checking - lab-operator: Directory listing successful, proper command execution - All agents demonstrate improved structure and clarity ### Context **Why It Matters**: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement. **Next Steps**: Improve backend-builder.md and lab-operator.md using scribe.md as quality template. --- ## Previous Phase: Infrastructure Documentation Complete ### Goal Comprehensive documentation of monitoring stack and updated infrastructure inventory. ### Phase Documentation & Maintenance ### Completed Tasks - [x] Created `/home/jramos/homelab/monitoring/README.md` with comprehensive monitoring documentation - [x] Updated `CLAUDE_STATUS.md` with current infrastructure state - [x] Documented 8 VMs, 2 Templates, and 4 LXC containers - [x] Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%) - [x] Added monitoring stack architecture and deployment procedures - [x] Documented new services: monitoring-docker, twingate-connector, n8n - [x] Referenced latest export: disaster-recovery/homelab-export-20251207-120040 ### Remaining Documentation Tasks - [x] Update INDEX.md with monitoring section and current VM/CT counts - [x] Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC) - [x] Update CLAUDE.md with architecture tables for monitoring and zero-trust - [x] Update services/README.md with monitoring stack and twingate sections - [x] Verify all documentation cross-references are accurate - [ ] Test monitoring stack deployment procedures --- ## Access Information ### Management Interfaces - **Proxmox UI**: https://192.168.2.200:8006 - **Grafana**: http://192.168.2.114:3000 - **Prometheus**: http://192.168.2.114:9090 - **Nginx Proxy Manager**: http://192.168.2.101:81 - **n8n**: http://192.168.2.107:5678 - **TinyAuth**: https://tinyauth.apophisnetworking.net (internal: http://192.168.2.10:8000) ### Key Network Segments - **Management Network**: 192.168.2.0/24 - **Proxmox Host**: 192.168.2.200 - **Reverse Proxy**: 192.168.2.101 (CT 102) - **TinyAuth**: 192.168.2.10 (CT 115) - **n8n**: 192.168.2.107 (CT 113) - **Monitoring**: 192.168.2.114 (VM 101) --- ## Maintenance Schedule ### Automated Tasks - **Backups**: Proxmox Backup Server - Daily incremental, Weekly full - **Monitoring Scrapes**: Prometheus - Every 30 seconds - **Certificate Renewal**: Nginx Proxy Manager - Automatic via Let's Encrypt ### Recommended Manual Tasks - **Weekly**: Review Grafana dashboards for anomalies - **Monthly**: Update monitoring stack Docker images - **Quarterly**: Review backup retention policies - **Semi-Annual**: Kernel updates on Proxmox host and VMs --- ## Known Issues & Resolutions ### Resolved - n8n PostgreSQL locale errors (fixed with `fix_n8n_db_c_locale.sh`) - n8n database permissions (fixed with `fix_n8n_db_permissions.sh`) ### Active Security Vulnerabilities (2025-12-20 Audit) **CRITICAL Severity:** 1. **Docker Socket Exposure** (CVSS 9.8) - Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker - Impact: Container escape to root access - Remediation: Deploy docker-socket-proxy (Phase 2) 2. **Proxmox Credentials in Plaintext** (CVSS 9.1) - Affected: PVE Exporter `.env` and `pve.yml` - Impact: Full infrastructure compromise - Remediation: Rotate credentials, use API tokens (Phase 2) 3. **Database Passwords in Git** (CVSS 8.5) - Affected: Paperless-ngx, ByteStash, Speedtest Tracker - Impact: Credential exposure to all repository users - Remediation: Migrate to `.env` files, scrub git history (Phase 1) **HIGH Severity:** 4. **Missing SSL/TLS** (CVSS 7.5) - Affected: Internal service communication - Impact: Traffic interception, credential sniffing - Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3) 5. **Weak/Default Passwords** (CVSS 7.2) - Affected: Multiple services - Impact: Brute-force attacks, unauthorized access - Remediation: Generate strong passwords, implement rotation (Phase 2) 6. **Containers Running as Root** (CVSS 7.0) - Affected: Most Docker containers - Impact: Privilege escalation if container compromised - Remediation: Enable user namespacing, set non-root users (Phase 3) **Remediation Timeline:** See "Security Audit Remediation - Q4 2025" initiative above ### Active Monitoring - PVE Exporter SSL verification (set to false for self-signed certificates) - **SECURITY RISK** - Prometheus retention policies (currently 15 days, may need adjustment) - Security script container names need verification (3/8 scripts) ### Deferred - NetBox container offline (on-demand service) - Development VMs stopped (resource conservation) - Network segmentation implementation (Phase 4) - Backup encryption (Phase 4) --- ## Version History - **v2.1.0** (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts - **v2.0.0** (2025-12-02): Repository reorganization, services migration from GitLab - **v1.0.0** (2025-11-29): Initial infrastructure documentation --- **Maintained by**: jramos **Repository**: Homelab Infrastructure Configuration **Platform**: Proxmox VE 8.4.0 **Infrastructure Scale**: 9 VMs, 2 Templates, 4 Containers **Current Status**: Operational - Home Automation Integration Deployed