Files
homelab/CLAUDE_STATUS.md
Jordan Ramos e481c95da4 docs(security): comprehensive security audit and remediation documentation
- Add SECURITY.md policy with credential management, Docker security, SSL/TLS guidance
- Add security audit report (2025-12-20) with 31 findings across 4 severity levels
- Add pre-deployment security checklist template
- Update CLAUDE_STATUS.md with security audit initiative
- Expand services/README.md with comprehensive security sections
- Add script validation report and container name fix guide

Audit identified 6 CRITICAL, 3 HIGH, 2 MEDIUM findings
4-phase remediation roadmap created (estimated 6-13 min downtime)
All security scripts validated and ready for execution

Related: Security Audit Q4 2025, CRITICAL-001 through CRITICAL-006

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-21 13:52:34 -07:00

30 KiB

Homelab Infrastructure Status

Last Updated: 2025-12-18 17:00:00 Export Reference: disaster-recovery/homelab-export-20251211-144345

Current Infrastructure Snapshot

Proxmox Environment

  • Node: serviceslab
  • Version: Proxmox VE 8.4.0
  • Management IP: 192.168.2.200
  • Architecture: Single-node cluster
  • Total Resources: 9 VMs, 2 Templates, 5 LXC Containers

Virtual Machines (QEMU/KVM) - 9 VMs

VM ID Name IP Address Status Purpose
100 docker-hub 192.168.2.XXX Running Container registry/Docker hub mirror
101 monitoring-docker 192.168.2.114 Running Monitoring stack (Grafana/Prometheus/PVE Exporter)
105 dev - Stopped General-purpose development workstation
106 Ansible-Control 192.168.2.XXX Running IaC orchestration, configuration management
108 CML - Stopped Cisco Modeling Labs - network simulation
109 web-server-01 192.168.2.XXX Running Web application server (clustered)
110 web-server-02 192.168.2.XXX Running Load-balanced pair with web-server-01
111 db-server-01 192.168.2.XXX Running Backend database server
114 haos 192.168.2.XXX Running Home Assistant OS - smart home automation platform

Recent Changes:

  • Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
  • Removed VM 101 (gitlab) - service decommissioned

VM Templates - 2 Templates

Template ID Name Purpose
104 ubuntu-dev Ubuntu development environment template for cloning
107 ubuntu-docker Ubuntu Docker host template for rapid deployment

Note: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.


Containers (LXC) - 5 Containers

CT ID Name IP Address Status Purpose
102 nginx 192.168.2.101 Running Reverse proxy/load balancer & NPM
103 netbox 192.168.2.XXX Running Network documentation/IPAM
112 twingate-connector 192.168.2.XXX Running Zero-trust network access connector
113 n8n 192.168.2.107 Running Workflow automation platform
115 tinyauth 192.168.2.10 Running SSO authentication layer for NetBox

Recent Changes:

  • Added CT 115 (tinyauth) for SSO authentication integration with NetBox
  • Added CT 112 (twingate-connector) for zero-trust network security
  • Added CT 113 (n8n) for workflow automation
  • Removed CT 112 (Anytype) - replaced by n8n

Storage Architecture

Storage Pool Type Total Used % Used Purpose
local Directory - - 19.11% System files, ISOs, templates
local-lvm LVM-Thin - - 0.01% VM disk images (thin provisioned)
Vault NFS/Directory - - 12.13% Secure storage for sensitive data
PBS-Backups PBS - - 28.27% Automated backup repository
iso-share NFS/CIFS - - 1.45% Installation media library
localnetwork Network Share - - N/A Shared resources across infrastructure

Capacity Notes:

  • PBS-Backups utilization increased to 28.27% (healthy retention)
  • Vault utilization increased to 12.13% (data growth monitored)
  • local storage at 19.11% (system overhead within normal range)

Key Services & Stacks

Monitoring & Observability (NEW)

VM 101 - monitoring-docker (192.168.2.114)

  • Grafana: Port 3000 - Visualization and dashboards
  • Prometheus: Port 9090 - Metrics collection and time-series database
  • PVE Exporter: Port 9221 - Proxmox VE metrics exporter
  • Documentation: /home/jramos/homelab/monitoring/README.md
  • Status: Fully operational

Network Security (NEW)

CT 112 - twingate-connector

  • Purpose: Zero-trust network access
  • Type: Lightweight connector
  • Status: Running
  • Integration: Connects homelab to Twingate network

Automation & Integration

CT 113 - n8n (192.168.2.107)

  • Purpose: Workflow automation platform
  • Technology: n8n.io
  • Database: PostgreSQL 15+
  • Features: API integration, scheduled workflows, webhook triggers
  • Documentation: /home/jramos/homelab/services/README.md#n8n-workflow-automation
  • Status: Operational (resolved database locale issues)

Authentication & SSO

CT 115 - tinyauth (192.168.2.10)

  • Purpose: Lightweight SSO authentication layer
  • Technology: TinyAuth v4 (Docker container)
  • Port: 8000
  • Domain: tinyauth.apophisnetworking.net
  • Integration: Authentication gateway for NetBox via Nginx Proxy Manager
  • Security: Bcrypt-hashed credentials, HTTPS enforcement
  • Documentation: /home/jramos/homelab/services/tinyauth/README.md
  • Status: Operational

Infrastructure Documentation

CT 103 - netbox

  • Purpose: Network documentation and IPAM
  • Status: Stopped (on-demand use)
  • Function: Infrastructure source of truth

Reverse Proxy & Load Balancing

CT 102 - nginx (192.168.2.101)

  • Purpose: Nginx Proxy Manager
  • Ports: 80, 81, 443
  • Function: SSL termination, reverse proxy, certificate management
  • Upstream Services: All web-facing applications

Three-Tier Application Stack

Web Tier:

  • VM 109 (web-server-01) - Primary web server
  • VM 110 (web-server-02) - Load-balanced pair

Database Tier:

  • VM 111 (db-server-01) - Backend database

Proxy Tier:

  • CT 102 (nginx) - Load balancer and SSL termination

Development & Automation

VM 106 - Ansible-Control

  • Purpose: Infrastructure as Code orchestration
  • Tools: Ansible, Terraform/OpenTofu (potential)
  • Status: Running

Container Registry

VM 100 - docker-hub

  • Purpose: Local Docker registry and hub mirror
  • Function: Caching container images for faster deployments
  • Status: Running

Network Simulation

VM 108 - CML

  • Purpose: Cisco Modeling Labs
  • Function: Network topology testing and simulation
  • Status: Stopped (resource-intensive, on-demand use)

Architecture Patterns

Monitoring & Observability (NEW)

The infrastructure now implements a comprehensive monitoring stack following industry best practices:

  • Metrics Collection: Prometheus scraping Proxmox metrics via PVE Exporter
  • Visualization: Grafana providing real-time dashboards and alerting
  • Isolation: Dedicated VM for monitoring services (fault isolation)
  • Integration: Ready for AlertManager, additional exporters, and integrations

Design Decision: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.

Zero-Trust Security (NEW)

Implementation of zero-trust network access principles:

  • Twingate Connector: Lightweight connector providing secure access without VPNs
  • Container Deployment: LXC container for minimal resource overhead
  • Network Segmentation: Secure access to homelab from external networks

Design Decision: LXC container chosen for quick provisioning and low resource consumption.

Automation-First Approach

Workflow automation and infrastructure orchestration:

  • n8n Platform: Visual workflow builder for API integrations
  • Scheduled Tasks: Automated backup checks, monitoring alerts, reports
  • Integration Hub: Connects monitoring, documentation, and operational tools

Design Decision: PostgreSQL backend ensures data persistence and supports complex workflows.

Tiered Application Architecture

Classic three-tier design for production-like environments:

  • Presentation Tier: Paired web servers (109, 110) behind load balancer
  • Business Logic: Application processing on web tier
  • Data Tier: Dedicated database server (111) with backup strategy

Design Decision: Separation of concerns, scalability testing, high availability patterns.

Selective Containerization Strategy

Hybrid approach balancing performance and resource efficiency:

  • LXC Containers: Stateless services (nginx, netbox, twingate, n8n)
  • Full VMs: Complex applications, kernel dependencies, heavy workloads
  • Rationale: LXC for ~10x lower overhead, VMs for isolation and compatibility

Recent Infrastructure Changes

2025-12-20: Comprehensive Security Audit Completed

Activity: Complete infrastructure security assessment and remediation planning

Audit Scope:

  • All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser)
  • Proxmox VE infrastructure and API access
  • Network security and segmentation
  • Credential management and storage
  • SSL/TLS configuration
  • Container security and runtime configuration

Findings Summary:

  • CRITICAL (6): Docker socket exposure, hardcoded credentials, database passwords in git
  • HIGH (3): Missing SSL/TLS, weak passwords, containers running as root
  • MEDIUM (2): SSL verification disabled, missing authentication
  • LOW (20): Documentation gaps, monitoring improvements, backup encryption

Deliverables:

  1. Security Policy (SECURITY.md): 864 lines - Comprehensive security best practices
  2. Audit Report (troubleshooting/SECURITY_AUDIT_2025-12-20.md): 2,350 lines - Detailed findings and remediation plan
  3. Security Checklist (templates/SECURITY_CHECKLIST.md): 750 lines - Pre-deployment validation template
  4. Validation Report (scripts/security/VALIDATION_REPORT.md): 2,092 lines - Script safety assessment
  5. Container Fixes (scripts/security/CONTAINER_NAME_FIXES.md): 621 lines - Container name verification
  6. Security Scripts (8 total):
    • verify-service-status.sh - Service health checker
    • backup-before-remediation.sh - Comprehensive backup utility
    • rotate-pve-credentials.sh - Proxmox credential rotation
    • rotate-paperless-password.sh - Database password rotation
    • rotate-bytestash-jwt.sh - JWT secret rotation
    • rotate-logward-credentials.sh - Multi-service credential rotation
    • docker-socket-proxy/docker-compose.yml - Security proxy deployment
    • portainer/docker-compose.socket-proxy.yml - Portainer migration config

Script Validation:

  • Ready for execution: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy)
  • Needs container name fixes: 3/8 scripts (see CONTAINER_NAME_FIXES.md)

4-Phase Remediation Roadmap:

  • Phase 1 (Week 1): Immediate actions - Backups, secrets migration
  • Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation
  • Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS
  • Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines

Estimated Timeline:

  • Total downtime: 6-13 minutes (sequential script execution)
  • Full remediation: 8-16 weeks

Risk Assessment:

  • Current risk: HIGH - Multiple CRITICAL vulnerabilities active
  • Post-Phase 1 risk: MEDIUM - Credential exposure mitigated
  • Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated
  • Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented

Status: Documentation complete, awaiting remediation execution approval


2025-12-18: TinyAuth SSO Deployment

Service Deployed: CT 115 - TinyAuth authentication layer

Purpose: Centralized SSO authentication for NetBox and future homelab services

Specifications:

  • Container: CT 115 (LXC with Docker)
  • IP Address: 192.168.2.10
  • Domain: tinyauth.apophisnetworking.net
  • Port: 8000 (external), 3000 (internal)
  • Docker Image: ghcr.io/steveiliop56/tinyauth:v4
  • Resource Usage: ~50-100 MB memory, <1% CPU

Integration Architecture:

  • Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103)
  • NPM uses auth_request directive to validate credentials via TinyAuth
  • Bcrypt-hashed password storage for security
  • HTTPS enforcement via NPM SSL termination

Issues Resolved During Deployment:

  1. 500 Internal Server Error: Fixed Nginx advanced config syntax
  2. IP addresses not allowed: Changed APP_URL from IP to domain
  3. Port mapping: Corrected Docker port mapping from 8000:8000 to 8000:3000
  4. Invalid password: Implemented bcrypt hash requirement for TinyAuth v4

Integration Impact:

  • NetBox now protected by centralized authentication
  • Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates)
  • Authentication logs available for security auditing

Documentation: Complete guide at /home/jramos/homelab/services/tinyauth/README.md

Status: Operational - Successfully authenticating NetBox access


2025-12-11: Loki-Stack Monitoring Fully Operational

Issue Resolved: Centralized logging pipeline now receiving syslog from UniFi router

Root Cause: rsyslog filter in /etc/rsyslog.d/unifi-router.conf was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1)

Fix Applied: Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1)

Status: Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana

Services Affected:

  • VM 101 (monitoring-docker): rsyslog configuration updated
  • Loki-stack: All components operational
  • Grafana: Dashboards receiving real-time syslog data

Technical Details: See troubleshooting/loki-stack-bugfix.md for complete 5-phase troubleshooting history


2025-12-11: Infrastructure Expansion & System Updates

Proxmox VE Platform Upgrade

  • Upgraded: Proxmox VE 8.3.3 → 8.4.0
  • Kernel: 6.8.12-8-pve
  • pve-manager: 8.4.14
  • Impact: Enhanced performance, security updates, bug fixes
  • Status: Complete - All VMs and containers operating normally

New VM 114: Home Assistant OS Deployment

  • Service: haos (Home Assistant Operating System)
  • Purpose: Smart home automation and integration platform
  • Specifications:
    • Memory: 4 GB (87% utilized)
    • CPU: 2 vCPUs
    • Boot Disk: 50 GB
    • Status: Running (~3 days uptime)
  • Rationale: Centralized home automation hub for IoT device management
  • Integration: Will integrate with monitoring stack for infrastructure metrics

CT 103: NetBox IPAM Activated

  • Service: netbox (Network Documentation & IPAM)
  • Status Change: Stopped → Running
  • Uptime: ~3.1 days
  • Resource Usage: 1.28 GB / 2 GB memory (64%)
  • Purpose: Active network documentation and IP address management
  • Rationale: Required for ongoing infrastructure expansion planning
  • PBS-Backups: 27.43% → 28.27% (+0.84%) - Normal backup retention growth
  • Vault (ZFS): 10.88% → 12.13% (+1.25%) - Data accumulation monitored
  • local: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates
  • iso-share: 1.4% → 1.45% (+0.05%) - Minimal change
  • local-lvm: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline

2025-12-07: Infrastructure Documentation & Monitoring Stack

Additions

  1. VM 101 (monitoring-docker): New dedicated monitoring infrastructure

    • Grafana for visualization
    • Prometheus for metrics collection
    • PVE Exporter for Proxmox integration
    • IP: 192.168.2.114
  2. CT 112 (twingate-connector): Zero-trust network security

    • Lightweight connector
    • Secure remote access without VPN
  3. CT 113 (n8n): Workflow automation platform

    • PostgreSQL 15+ backend
    • IP: 192.168.2.107
    • Resolved database locale issues

Modifications

  • Storage utilization updated across all pools
  • PBS-Backups now at 27.43% (increased retention)
  • Vault optimized to 10.88% (reduced usage)

Removals

  • VM 101 (gitlab): Decommissioned (previously at this ID)
  • CT 112 (Anytype): Replaced by n8n for better integration

Documentation Updates

  • Created comprehensive monitoring stack documentation
  • Updated all infrastructure tables with current VMs/CTs
  • Added architecture patterns for observability and zero-trust
  • Updated storage statistics
  • Referenced latest export: disaster-recovery/homelab-export-20251207-120040

Repository Structure

homelab/
    monitoring/                      # NEW: Monitoring stack configurations
        README.md                   # Comprehensive monitoring documentation
        grafana/
            docker-compose.yml
        prometheus/
            docker-compose.yml
            prometheus.yml
        pve-exporter/
            docker-compose.yml
            pve.yml
            .env
    services/                        # Docker Compose service configurations
        n8n/                        # n8n workflow automation
        netbox/                     # Network documentation & IPAM
        README.md                   # Services overview (updated)
    disaster-recovery/
        homelab-export-20251207-120040/  # Latest infrastructure export
    scripts/
        crawlers-exporters/         # Infrastructure collection scripts
        fixers/                     # Problem-solving scripts
        qol/                        # Quality of life improvements
    CLAUDE.md                        # AI assistant guidance (updated)
    INDEX.md                         # Navigation index (updated)
    README.md                        # Repository overview (updated)
    CLAUDE_STATUS.md                # This file - current infrastructure status

Security Status

Latest Audit: 2025-12-20 Total Findings: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW) Remediation Status: Planning Phase - Documentation Complete

Critical Vulnerabilities:

  • Docker socket exposure (3 containers)
  • Proxmox credentials in plaintext
  • Database passwords in git repository
  • Missing SSL/TLS for internal services
  • Weak/default passwords across services
  • Containers running as root

Documentation:

  • Security Policy: /home/jramos/homelab/SECURITY.md
  • Audit Report: /home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md
  • Security Checklist: /home/jramos/homelab/templates/SECURITY_CHECKLIST.md
  • Script Validation: /home/jramos/homelab/scripts/security/VALIDATION_REPORT.md

Current Initiative: Security Audit Remediation - Q4 2025

Goal

Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration.

Phase

Planning - Documentation Complete, Remediation Pending

Progress Checklist

Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime

  • Complete security audit (31 findings documented)
  • Create remediation scripts (8 scripts validated)
  • Document security baseline in SECURITY.md
  • Backup all service configurations (backup-before-remediation.sh)
  • Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker)

Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime

  • Deploy docker-socket-proxy
  • Rotate Proxmox API credentials (rotate-pve-credentials.sh)
  • Rotate database passwords (rotate-paperless-password.sh)
  • Rotate JWT secrets (rotate-bytestash-jwt.sh)

Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime

  • Migrate Portainer to socket proxy
  • Migrate NPM to socket proxy or remove socket access
  • Remove socket mounts from Speedtest Tracker
  • Implement SSL/TLS for internal services
  • Enable container user namespacing

Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours

  • Implement network segmentation (VLANs for service tiers)
  • Deploy fail2ban for rate limiting
  • Enable backup encryption (PBS configuration)
  • Container vulnerability scanning pipeline
  • Automated credential rotation system

Context

Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise.

Risk Management:

  • Phase 1: Zero downtime (configuration changes only)
  • Phase 2: Minimal downtime (credential rotation, proxy deployment)
  • Phase 3: Moderate downtime (service reconfiguration)
  • Phase 4: Planned maintenance windows (infrastructure changes)

Success Metrics:

  • All CRITICAL findings remediated (6/6)
  • All HIGH findings remediated (3/3)
  • Secrets removed from git repository
  • Docker socket access eliminated or proxied
  • SSL/TLS enabled for all external services

Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18)

Goal

Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit tools: declarations receive only a subset of their configured tools, with first and last array elements consistently dropped.

Phase

COMPLETED - Bug confirmed, comprehensive report generated for Anthropic

Progress Checklist

  • Reproduce bug with scribe agent (confirmed: missing Read and Write)
  • Reproduce bug with lab-operator agent (confirmed: missing Bash and Write)
  • Test backend-builder agent (working correctly - exception to pattern)
  • Test librarian agent (working correctly - no tools: declaration)
  • Identify pattern: First and last tools dropped for agents with explicit tools: arrays
  • Document impact: Scribe cannot create docs, lab-operator cannot execute commands
  • Generate comprehensive bug report for Anthropic with all evidence
  • Update CLAUDE_STATUS.md with investigation status
  • Submit bug report to Anthropic via GitHub issues

Key Findings

Bug Pattern: Sub-agents with tools: [A, B, C, D, E] receive only [B, C, D] at runtime Affected: scribe (no Read/Write), lab-operator (no Bash/Write) Unaffected: backend-builder (exception), librarian (no tools: line) Workaround: Remove tools: declarations to grant all tools by default

Artifacts:

  • Bug report: /home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md
  • Original report: /home/jramos/homelab/troubleshooting/BUG_REPORT.md
  • Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7

Context

Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue.


Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07)

Goal

Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).

Phase

COMPLETED - All sub-agent improvements and validations finished

Progress Checklist

  • Prompt engineering analysis completed (Opus model)
    • Analyzed CLAUDE.md and all 4 sub-agent files
    • Identified 5 critical issues, 12 high-impact improvements
    • Generated comprehensive improvement recommendations
  • scribe.md improved (29 340 lines)
    • Added 6 usage examples (4 positive, 2 negative redirects)
    • Implemented comprehensive responsibilities section
    • Added 3 complete ASCII diagram templates
    • Included safety protocols and decision frameworks
    • Quality now matches librarian.md standard
  • backend-builder.md improved (40 291 lines)
    • Added 6 usage examples with clear boundaries
    • Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
    • Added technology stack table and validation rules table
    • Included safety protocols for secrets and destructive operations
    • Added handoff protocol for lab-operator deployment
    • Defined clear boundaries (CREATES code, does NOT deploy)
  • lab-operator.md improved (37 193 lines)
    • Added 6 usage examples with role clarity
    • Expanded domain expertise with specific commands
    • Added command style guide (5-step pattern)
    • Included safety protocols and decision-making framework
    • Added error handling and escalation guidelines
    • Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
  • CLAUDE.md structural fixes
    • Moved YAML frontmatter to line 1 (was at line 89)
    • Fixed trailing pipe character on line 87
    • Completed incomplete sentence about backup strategy
    • Completed incomplete sentence about storage growth
    • Removed redundant "Key Services" reference
    • Expanded status file template with actual structure and recovery instructions
  • Final validation and testing
    • librarian: Git status check successful, clear output format
    • scribe: File reading functional (note: reported encoding issue, likely false positive)
    • backend-builder: YAML validation successful, proper syntax checking
    • lab-operator: Directory listing successful, proper command execution
    • All agents demonstrate improved structure and clarity

Context

Why It Matters: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.

Next Steps: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.


Previous Phase: Infrastructure Documentation Complete

Goal

Comprehensive documentation of monitoring stack and updated infrastructure inventory.

Phase

Documentation & Maintenance

Completed Tasks

  • Created /home/jramos/homelab/monitoring/README.md with comprehensive monitoring documentation
  • Updated CLAUDE_STATUS.md with current infrastructure state
  • Documented 8 VMs, 2 Templates, and 4 LXC containers
  • Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
  • Added monitoring stack architecture and deployment procedures
  • Documented new services: monitoring-docker, twingate-connector, n8n
  • Referenced latest export: disaster-recovery/homelab-export-20251207-120040

Remaining Documentation Tasks

  • Update INDEX.md with monitoring section and current VM/CT counts
  • Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
  • Update CLAUDE.md with architecture tables for monitoring and zero-trust
  • Update services/README.md with monitoring stack and twingate sections
  • Verify all documentation cross-references are accurate
  • Test monitoring stack deployment procedures

Access Information

Management Interfaces

Key Network Segments

  • Management Network: 192.168.2.0/24
  • Proxmox Host: 192.168.2.200
  • Reverse Proxy: 192.168.2.101 (CT 102)
  • TinyAuth: 192.168.2.10 (CT 115)
  • n8n: 192.168.2.107 (CT 113)
  • Monitoring: 192.168.2.114 (VM 101)

Maintenance Schedule

Automated Tasks

  • Backups: Proxmox Backup Server - Daily incremental, Weekly full
  • Monitoring Scrapes: Prometheus - Every 30 seconds
  • Certificate Renewal: Nginx Proxy Manager - Automatic via Let's Encrypt
  • Weekly: Review Grafana dashboards for anomalies
  • Monthly: Update monitoring stack Docker images
  • Quarterly: Review backup retention policies
  • Semi-Annual: Kernel updates on Proxmox host and VMs

Known Issues & Resolutions

Resolved

  • n8n PostgreSQL locale errors (fixed with fix_n8n_db_c_locale.sh)
  • n8n database permissions (fixed with fix_n8n_db_permissions.sh)

Active Security Vulnerabilities (2025-12-20 Audit)

CRITICAL Severity:

  1. Docker Socket Exposure (CVSS 9.8)

    • Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker
    • Impact: Container escape to root access
    • Remediation: Deploy docker-socket-proxy (Phase 2)
  2. Proxmox Credentials in Plaintext (CVSS 9.1)

    • Affected: PVE Exporter .env and pve.yml
    • Impact: Full infrastructure compromise
    • Remediation: Rotate credentials, use API tokens (Phase 2)
  3. Database Passwords in Git (CVSS 8.5)

    • Affected: Paperless-ngx, ByteStash, Speedtest Tracker
    • Impact: Credential exposure to all repository users
    • Remediation: Migrate to .env files, scrub git history (Phase 1)

HIGH Severity: 4. Missing SSL/TLS (CVSS 7.5)

  • Affected: Internal service communication
  • Impact: Traffic interception, credential sniffing
  • Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3)
  1. Weak/Default Passwords (CVSS 7.2)

    • Affected: Multiple services
    • Impact: Brute-force attacks, unauthorized access
    • Remediation: Generate strong passwords, implement rotation (Phase 2)
  2. Containers Running as Root (CVSS 7.0)

    • Affected: Most Docker containers
    • Impact: Privilege escalation if container compromised
    • Remediation: Enable user namespacing, set non-root users (Phase 3)

Remediation Timeline: See "Security Audit Remediation - Q4 2025" initiative above

Active Monitoring

  • PVE Exporter SSL verification (set to false for self-signed certificates) - SECURITY RISK
  • Prometheus retention policies (currently 15 days, may need adjustment)
  • Security script container names need verification (3/8 scripts)

Deferred

  • NetBox container offline (on-demand service)
  • Development VMs stopped (resource conservation)
  • Network segmentation implementation (Phase 4)
  • Backup encryption (Phase 4)

Version History

  • v2.1.0 (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
  • v2.0.0 (2025-12-02): Repository reorganization, services migration from GitLab
  • v1.0.0 (2025-11-29): Initial infrastructure documentation

Maintained by: jramos Repository: Homelab Infrastructure Configuration Platform: Proxmox VE 8.4.0 Infrastructure Scale: 9 VMs, 2 Templates, 4 Containers Current Status: Operational - Home Automation Integration Deployed