homelab/CLAUDE_STATUS.md

# Homelab Infrastructure Status

**Last Updated**: 2025-12-18 17:00:00
**Export Reference**: disaster-recovery/homelab-export-20251211-144345

## Current Infrastructure Snapshot

### Proxmox Environment
- **Node**: serviceslab
- **Version**: Proxmox VE 8.4.0
- **Management IP**: 192.168.2.200
- **Architecture**: Single-node cluster
- **Total Resources**: 9 VMs, 2 Templates, 5 LXC Containers

---

## Virtual Machines (QEMU/KVM) - 9 VMs

| VM ID | Name | IP Address | Status | Purpose |
|-------|------|------------|--------|---------|
| 100 | docker-hub | 192.168.2.XXX | Running | Container registry/Docker hub mirror |
| 101 | monitoring-docker | 192.168.2.114 | Running | Monitoring stack (Grafana/Prometheus/PVE Exporter) |
| 105 | dev | - | Stopped | General-purpose development workstation |
| 106 | Ansible-Control | 192.168.2.XXX | Running | IaC orchestration, configuration management |
| 108 | CML | - | Stopped | Cisco Modeling Labs - network simulation |
| 109 | web-server-01 | 192.168.2.XXX | Running | Web application server (clustered) |
| 110 | web-server-02 | 192.168.2.XXX | Running | Load-balanced pair with web-server-01 |
| 111 | db-server-01 | 192.168.2.XXX | Running | Backend database server |
| 114 | haos | 192.168.2.XXX | Running | Home Assistant OS - smart home automation platform |

**Recent Changes**:
- Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
- Removed VM 101 (gitlab) - service decommissioned

---

## VM Templates - 2 Templates

| Template ID | Name | Purpose |
|-------------|------|---------|
| 104 | ubuntu-dev | Ubuntu development environment template for cloning |
| 107 | ubuntu-docker | Ubuntu Docker host template for rapid deployment |

**Note**: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.

---

## Containers (LXC) - 5 Containers

| CT ID | Name | IP Address | Status | Purpose |
|-------|------|------------|--------|---------|
| 102 | nginx | 192.168.2.101 | Running | Reverse proxy/load balancer & NPM |
| 103 | netbox | 192.168.2.XXX | Running | Network documentation/IPAM |
| 112 | twingate-connector | 192.168.2.XXX | Running | Zero-trust network access connector |
| 113 | n8n | 192.168.2.107 | Running | Workflow automation platform |
| 115 | tinyauth | 192.168.2.10 | Running | SSO authentication layer for NetBox |

**Recent Changes**:
- Added CT 115 (tinyauth) for SSO authentication integration with NetBox
- Added CT 112 (twingate-connector) for zero-trust network security
- Added CT 113 (n8n) for workflow automation
- Removed CT 112 (Anytype) - replaced by n8n

---

## Storage Architecture

| Storage Pool | Type | Total | Used | % Used | Purpose |
|--------------|------|-------|------|--------|---------|
| local | Directory | - | - | 19.11% | System files, ISOs, templates |
| local-lvm | LVM-Thin | - | - | 0.01% | VM disk images (thin provisioned) |
| Vault | NFS/Directory | - | - | 12.13% | Secure storage for sensitive data |
| PBS-Backups | PBS | - | - | 28.27% | Automated backup repository |
| iso-share | NFS/CIFS | - | - | 1.45% | Installation media library |
| localnetwork | Network Share | - | - | N/A | Shared resources across infrastructure |

**Capacity Notes**:
- PBS-Backups utilization increased to 28.27% (healthy retention)
- Vault utilization increased to 12.13% (data growth monitored)
- local storage at 19.11% (system overhead within normal range)

---

## Key Services & Stacks

### Monitoring & Observability (NEW)
**VM 101** - monitoring-docker (192.168.2.114)
- **Grafana**: Port 3000 - Visualization and dashboards
- **Prometheus**: Port 9090 - Metrics collection and time-series database
- **PVE Exporter**: Port 9221 - Proxmox VE metrics exporter
- **Documentation**: `/home/jramos/homelab/monitoring/README.md`
- **Status**: Fully operational

### Network Security (NEW)
**CT 112** - twingate-connector
- **Purpose**: Zero-trust network access
- **Type**: Lightweight connector
- **Status**: Running
- **Integration**: Connects homelab to Twingate network

### Automation & Integration
**CT 113** - n8n (192.168.2.107)
- **Purpose**: Workflow automation platform
- **Technology**: n8n.io
- **Database**: PostgreSQL 15+
- **Features**: API integration, scheduled workflows, webhook triggers
- **Documentation**: `/home/jramos/homelab/services/README.md#n8n-workflow-automation`
- **Status**: Operational (resolved database locale issues)

### Authentication & SSO
**CT 115** - tinyauth (192.168.2.10)
- **Purpose**: Lightweight SSO authentication layer
- **Technology**: TinyAuth v4 (Docker container)
- **Port**: 8000
- **Domain**: tinyauth.apophisnetworking.net
- **Integration**: Authentication gateway for NetBox via Nginx Proxy Manager
- **Security**: Bcrypt-hashed credentials, HTTPS enforcement
- **Documentation**: `/home/jramos/homelab/services/tinyauth/README.md`
- **Status**: Operational

### Infrastructure Documentation
**CT 103** - netbox
- **Purpose**: Network documentation and IPAM
- **Status**: Stopped (on-demand use)
- **Function**: Infrastructure source of truth

### Reverse Proxy & Load Balancing
**CT 102** - nginx (192.168.2.101)
- **Purpose**: Nginx Proxy Manager
- **Ports**: 80, 81, 443
- **Function**: SSL termination, reverse proxy, certificate management
- **Upstream Services**: All web-facing applications

### Three-Tier Application Stack
**Web Tier**:
- VM 109 (web-server-01) - Primary web server
- VM 110 (web-server-02) - Load-balanced pair

**Database Tier**:
- VM 111 (db-server-01) - Backend database

**Proxy Tier**:
- CT 102 (nginx) - Load balancer and SSL termination

### Development & Automation
**VM 106** - Ansible-Control
- **Purpose**: Infrastructure as Code orchestration
- **Tools**: Ansible, Terraform/OpenTofu (potential)
- **Status**: Running

### Container Registry
**VM 100** - docker-hub
- **Purpose**: Local Docker registry and hub mirror
- **Function**: Caching container images for faster deployments
- **Status**: Running

### Network Simulation
**VM 108** - CML
- **Purpose**: Cisco Modeling Labs
- **Function**: Network topology testing and simulation
- **Status**: Stopped (resource-intensive, on-demand use)

---

## Architecture Patterns

### Monitoring & Observability (NEW)
The infrastructure now implements a comprehensive monitoring stack following industry best practices:

- **Metrics Collection**: Prometheus scraping Proxmox metrics via PVE Exporter
- **Visualization**: Grafana providing real-time dashboards and alerting
- **Isolation**: Dedicated VM for monitoring services (fault isolation)
- **Integration**: Ready for AlertManager, additional exporters, and integrations

**Design Decision**: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.

### Zero-Trust Security (NEW)
Implementation of zero-trust network access principles:

- **Twingate Connector**: Lightweight connector providing secure access without VPNs
- **Container Deployment**: LXC container for minimal resource overhead
- **Network Segmentation**: Secure access to homelab from external networks

**Design Decision**: LXC container chosen for quick provisioning and low resource consumption.

### Automation-First Approach
Workflow automation and infrastructure orchestration:

- **n8n Platform**: Visual workflow builder for API integrations
- **Scheduled Tasks**: Automated backup checks, monitoring alerts, reports
- **Integration Hub**: Connects monitoring, documentation, and operational tools

**Design Decision**: PostgreSQL backend ensures data persistence and supports complex workflows.

### Tiered Application Architecture
Classic three-tier design for production-like environments:

- **Presentation Tier**: Paired web servers (109, 110) behind load balancer
- **Business Logic**: Application processing on web tier
- **Data Tier**: Dedicated database server (111) with backup strategy

**Design Decision**: Separation of concerns, scalability testing, high availability patterns.

### Selective Containerization Strategy
Hybrid approach balancing performance and resource efficiency:

- **LXC Containers**: Stateless services (nginx, netbox, twingate, n8n)
- **Full VMs**: Complex applications, kernel dependencies, heavy workloads
- **Rationale**: LXC for ~10x lower overhead, VMs for isolation and compatibility

---

## Recent Infrastructure Changes

### 2025-12-20: Comprehensive Security Audit Completed

**Activity:** Complete infrastructure security assessment and remediation planning

**Audit Scope:**
- All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser)
- Proxmox VE infrastructure and API access
- Network security and segmentation
- Credential management and storage
- SSL/TLS configuration
- Container security and runtime configuration

**Findings Summary:**
- **CRITICAL (6)**: Docker socket exposure, hardcoded credentials, database passwords in git
- **HIGH (3)**: Missing SSL/TLS, weak passwords, containers running as root
- **MEDIUM (2)**: SSL verification disabled, missing authentication
- **LOW (20)**: Documentation gaps, monitoring improvements, backup encryption

**Deliverables:**
1. **Security Policy** (`SECURITY.md`): 864 lines - Comprehensive security best practices
2. **Audit Report** (`troubleshooting/SECURITY_AUDIT_2025-12-20.md`): 2,350 lines - Detailed findings and remediation plan
3. **Security Checklist** (`templates/SECURITY_CHECKLIST.md`): 750 lines - Pre-deployment validation template
4. **Validation Report** (`scripts/security/VALIDATION_REPORT.md`): 2,092 lines - Script safety assessment
5. **Container Fixes** (`scripts/security/CONTAINER_NAME_FIXES.md`): 621 lines - Container name verification
6. **Security Scripts** (8 total):
   - `verify-service-status.sh` - Service health checker
   - `backup-before-remediation.sh` - Comprehensive backup utility
   - `rotate-pve-credentials.sh` - Proxmox credential rotation
   - `rotate-paperless-password.sh` - Database password rotation
   - `rotate-bytestash-jwt.sh` - JWT secret rotation
   - `rotate-logward-credentials.sh` - Multi-service credential rotation
   - `docker-socket-proxy/docker-compose.yml` - Security proxy deployment
   - `portainer/docker-compose.socket-proxy.yml` - Portainer migration config

**Script Validation:**
- **Ready for execution**: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy)
- **Needs container name fixes**: 3/8 scripts (see CONTAINER_NAME_FIXES.md)

**4-Phase Remediation Roadmap:**
- Phase 1 (Week 1): Immediate actions - Backups, secrets migration
- Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation
- Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS
- Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines

**Estimated Timeline:**
- Total downtime: 6-13 minutes (sequential script execution)
- Full remediation: 8-16 weeks

**Risk Assessment:**
- Current risk: HIGH - Multiple CRITICAL vulnerabilities active
- Post-Phase 1 risk: MEDIUM - Credential exposure mitigated
- Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated
- Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented

**Status:** Documentation complete, awaiting remediation execution approval

---

### 2025-12-18: TinyAuth SSO Deployment

**Service Deployed:** CT 115 - TinyAuth authentication layer

**Purpose:** Centralized SSO authentication for NetBox and future homelab services

**Specifications:**
- **Container**: CT 115 (LXC with Docker)
- **IP Address**: 192.168.2.10
- **Domain**: tinyauth.apophisnetworking.net
- **Port**: 8000 (external), 3000 (internal)
- **Docker Image**: ghcr.io/steveiliop56/tinyauth:v4
- **Resource Usage**: ~50-100 MB memory, <1% CPU

**Integration Architecture:**
- Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103)
- NPM uses `auth_request` directive to validate credentials via TinyAuth
- Bcrypt-hashed password storage for security
- HTTPS enforcement via NPM SSL termination

**Issues Resolved During Deployment:**
1. **500 Internal Server Error**: Fixed Nginx advanced config syntax
2. **IP addresses not allowed**: Changed APP_URL from IP to domain
3. **Port mapping**: Corrected Docker port mapping from 8000:8000 to 8000:3000
4. **Invalid password**: Implemented bcrypt hash requirement for TinyAuth v4

**Integration Impact:**
- NetBox now protected by centralized authentication
- Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates)
- Authentication logs available for security auditing

**Documentation:** Complete guide at `/home/jramos/homelab/services/tinyauth/README.md`

**Status:** ✅ Operational - Successfully authenticating NetBox access

---

### 2025-12-11: Loki-Stack Monitoring Fully Operational

**Issue Resolved:** Centralized logging pipeline now receiving syslog from UniFi router

**Root Cause:** rsyslog filter in `/etc/rsyslog.d/unifi-router.conf` was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1)

**Fix Applied:** Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1)

**Status:** ✅ Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana

**Services Affected:**
- VM 101 (monitoring-docker): rsyslog configuration updated
- Loki-stack: All components operational
- Grafana: Dashboards receiving real-time syslog data

**Technical Details:** See `troubleshooting/loki-stack-bugfix.md` for complete 5-phase troubleshooting history

---

### 2025-12-11: Infrastructure Expansion & System Updates

#### Proxmox VE Platform Upgrade
- **Upgraded**: Proxmox VE 8.3.3 → 8.4.0
- **Kernel**: 6.8.12-8-pve
- **pve-manager**: 8.4.14
- **Impact**: Enhanced performance, security updates, bug fixes
- **Status**: ✅ Complete - All VMs and containers operating normally

#### New VM 114: Home Assistant OS Deployment
- **Service**: haos (Home Assistant Operating System)
- **Purpose**: Smart home automation and integration platform
- **Specifications**:
  - Memory: 4 GB (87% utilized)
  - CPU: 2 vCPUs
  - Boot Disk: 50 GB
  - Status: Running (~3 days uptime)
- **Rationale**: Centralized home automation hub for IoT device management
- **Integration**: Will integrate with monitoring stack for infrastructure metrics

#### CT 103: NetBox IPAM Activated
- **Service**: netbox (Network Documentation & IPAM)
- **Status Change**: Stopped → Running
- **Uptime**: ~3.1 days
- **Resource Usage**: 1.28 GB / 2 GB memory (64%)
- **Purpose**: Active network documentation and IP address management
- **Rationale**: Required for ongoing infrastructure expansion planning

#### Storage Utilization Trends
- **PBS-Backups**: 27.43% → 28.27% (+0.84%) - Normal backup retention growth
- **Vault (ZFS)**: 10.88% → 12.13% (+1.25%) - Data accumulation monitored
- **local**: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates
- **iso-share**: 1.4% → 1.45% (+0.05%) - Minimal change
- **local-lvm**: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline

---

### 2025-12-07: Infrastructure Documentation & Monitoring Stack

#### Additions
1. **VM 101 (monitoring-docker)**: New dedicated monitoring infrastructure
   - Grafana for visualization
   - Prometheus for metrics collection
   - PVE Exporter for Proxmox integration
   - IP: 192.168.2.114

2. **CT 112 (twingate-connector)**: Zero-trust network security
   - Lightweight connector
   - Secure remote access without VPN

3. **CT 113 (n8n)**: Workflow automation platform
   - PostgreSQL 15+ backend
   - IP: 192.168.2.107
   - Resolved database locale issues

### Modifications
- Storage utilization updated across all pools
- PBS-Backups now at 27.43% (increased retention)
- Vault optimized to 10.88% (reduced usage)

### Removals
- **VM 101 (gitlab)**: Decommissioned (previously at this ID)
- **CT 112 (Anytype)**: Replaced by n8n for better integration

### Documentation Updates
- Created comprehensive monitoring stack documentation
- Updated all infrastructure tables with current VMs/CTs
- Added architecture patterns for observability and zero-trust
- Updated storage statistics
- Referenced latest export: disaster-recovery/homelab-export-20251207-120040

---

## Repository Structure

```
homelab/
    monitoring/                      # NEW: Monitoring stack configurations
        README.md                   # Comprehensive monitoring documentation
        grafana/
            docker-compose.yml
        prometheus/
            docker-compose.yml
            prometheus.yml
        pve-exporter/
            docker-compose.yml
            pve.yml
            .env
    services/                        # Docker Compose service configurations
        n8n/                        # n8n workflow automation
        netbox/                     # Network documentation & IPAM
        README.md                   # Services overview (updated)
    disaster-recovery/
        homelab-export-20251207-120040/  # Latest infrastructure export
    scripts/
        crawlers-exporters/         # Infrastructure collection scripts
        fixers/                     # Problem-solving scripts
        qol/                        # Quality of life improvements
    CLAUDE.md                        # AI assistant guidance (updated)
    INDEX.md                         # Navigation index (updated)
    README.md                        # Repository overview (updated)
    CLAUDE_STATUS.md                # This file - current infrastructure status
```

---

## Security Status

**Latest Audit**: 2025-12-20
**Total Findings**: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW)
**Remediation Status**: Planning Phase - Documentation Complete

**Critical Vulnerabilities**:
- Docker socket exposure (3 containers)
- Proxmox credentials in plaintext
- Database passwords in git repository
- Missing SSL/TLS for internal services
- Weak/default passwords across services
- Containers running as root

**Documentation**:
- Security Policy: `/home/jramos/homelab/SECURITY.md`
- Audit Report: `/home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md`
- Security Checklist: `/home/jramos/homelab/templates/SECURITY_CHECKLIST.md`
- Script Validation: `/home/jramos/homelab/scripts/security/VALIDATION_REPORT.md`

---

## Current Initiative: Security Audit Remediation - Q4 2025

### Goal
Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration.

### Phase
Planning - Documentation Complete, Remediation Pending

### Progress Checklist

**Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime**
- [x] Complete security audit (31 findings documented)
- [x] Create remediation scripts (8 scripts validated)
- [x] Document security baseline in SECURITY.md
- [ ] Backup all service configurations (`backup-before-remediation.sh`)
- [ ] Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker)

**Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime**
- [ ] Deploy docker-socket-proxy
- [ ] Rotate Proxmox API credentials (`rotate-pve-credentials.sh`)
- [ ] Rotate database passwords (`rotate-paperless-password.sh`)
- [ ] Rotate JWT secrets (`rotate-bytestash-jwt.sh`)

**Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime**
- [ ] Migrate Portainer to socket proxy
- [ ] Migrate NPM to socket proxy or remove socket access
- [ ] Remove socket mounts from Speedtest Tracker
- [ ] Implement SSL/TLS for internal services
- [ ] Enable container user namespacing

**Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours**
- [ ] Implement network segmentation (VLANs for service tiers)
- [ ] Deploy fail2ban for rate limiting
- [ ] Enable backup encryption (PBS configuration)
- [ ] Container vulnerability scanning pipeline
- [ ] Automated credential rotation system

### Context
Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise.

**Risk Management**:
- Phase 1: Zero downtime (configuration changes only)
- Phase 2: Minimal downtime (credential rotation, proxy deployment)
- Phase 3: Moderate downtime (service reconfiguration)
- Phase 4: Planned maintenance windows (infrastructure changes)

**Success Metrics**:
- All CRITICAL findings remediated (6/6)
- All HIGH findings remediated (3/3)
- Secrets removed from git repository
- Docker socket access eliminated or proxied
- SSL/TLS enabled for all external services

---

## Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18)

### Goal
Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit `tools:` declarations receive only a subset of their configured tools, with first and last array elements consistently dropped.

### Phase
COMPLETED - Bug confirmed, comprehensive report generated for Anthropic

### Progress Checklist
- [x] Reproduce bug with scribe agent (confirmed: missing Read and Write)
- [x] Reproduce bug with lab-operator agent (confirmed: missing Bash and Write)
- [x] Test backend-builder agent (working correctly - exception to pattern)
- [x] Test librarian agent (working correctly - no tools: declaration)
- [x] Identify pattern: First and last tools dropped for agents with explicit tools: arrays
- [x] Document impact: Scribe cannot create docs, lab-operator cannot execute commands
- [x] Generate comprehensive bug report for Anthropic with all evidence
- [x] Update CLAUDE_STATUS.md with investigation status
- [ ] Submit bug report to Anthropic via GitHub issues

### Key Findings
**Bug Pattern**: Sub-agents with `tools: [A, B, C, D, E]` receive only `[B, C, D]` at runtime
**Affected**: scribe (no Read/Write), lab-operator (no Bash/Write)
**Unaffected**: backend-builder (exception), librarian (no tools: line)
**Workaround**: Remove `tools:` declarations to grant all tools by default

**Artifacts**:
- Bug report: `/home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md`
- Original report: `/home/jramos/homelab/troubleshooting/BUG_REPORT.md`
- Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7

### Context
Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue.

---

## Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07)

### Goal
Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).

### Phase
COMPLETED - All sub-agent improvements and validations finished

### Progress Checklist
- [x] Prompt engineering analysis completed (Opus model)
  - Analyzed CLAUDE.md and all 4 sub-agent files
  - Identified 5 critical issues, 12 high-impact improvements
  - Generated comprehensive improvement recommendations
- [x] scribe.md improved (29   340 lines)
  - Added 6 usage examples (4 positive, 2 negative redirects)
  - Implemented comprehensive responsibilities section
  - Added 3 complete ASCII diagram templates
  - Included safety protocols and decision frameworks
  - Quality now matches librarian.md standard
- [x] backend-builder.md improved (40   291 lines)
  - Added 6 usage examples with clear boundaries
  - Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
  - Added technology stack table and validation rules table
  - Included safety protocols for secrets and destructive operations
  - Added handoff protocol for lab-operator deployment
  - Defined clear boundaries (CREATES code, does NOT deploy)
- [x] lab-operator.md improved (37   193 lines)
  - Added 6 usage examples with role clarity
  - Expanded domain expertise with specific commands
  - Added command style guide (5-step pattern)
  - Included safety protocols and decision-making framework
  - Added error handling and escalation guidelines
  - Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
- [x] CLAUDE.md structural fixes
  - Moved YAML frontmatter to line 1 (was at line 89)
  - Fixed trailing pipe character on line 87
  - Completed incomplete sentence about backup strategy
  - Completed incomplete sentence about storage growth
  - Removed redundant "Key Services" reference
  - Expanded status file template with actual structure and recovery instructions
- [x] Final validation and testing
  - librarian:     Git status check successful, clear output format
  - scribe:     File reading functional (note: reported encoding issue, likely false positive)
  - backend-builder:     YAML validation successful, proper syntax checking
  - lab-operator:     Directory listing successful, proper command execution
  - All agents demonstrate improved structure and clarity

### Context
**Why It Matters**: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.

**Next Steps**: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.

---

## Previous Phase: Infrastructure Documentation Complete

### Goal
Comprehensive documentation of monitoring stack and updated infrastructure inventory.

### Phase
Documentation & Maintenance

### Completed Tasks
- [x] Created `/home/jramos/homelab/monitoring/README.md` with comprehensive monitoring documentation
- [x] Updated `CLAUDE_STATUS.md` with current infrastructure state
- [x] Documented 8 VMs, 2 Templates, and 4 LXC containers
- [x] Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
- [x] Added monitoring stack architecture and deployment procedures
- [x] Documented new services: monitoring-docker, twingate-connector, n8n
- [x] Referenced latest export: disaster-recovery/homelab-export-20251207-120040

### Remaining Documentation Tasks
- [x] Update INDEX.md with monitoring section and current VM/CT counts
- [x] Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
- [x] Update CLAUDE.md with architecture tables for monitoring and zero-trust
- [x] Update services/README.md with monitoring stack and twingate sections
- [x] Verify all documentation cross-references are accurate
- [ ] Test monitoring stack deployment procedures

---

## Access Information

### Management Interfaces
- **Proxmox UI**: https://192.168.2.200:8006
- **Grafana**: http://192.168.2.114:3000
- **Prometheus**: http://192.168.2.114:9090
- **Nginx Proxy Manager**: http://192.168.2.101:81
- **n8n**: http://192.168.2.107:5678
- **TinyAuth**: https://tinyauth.apophisnetworking.net (internal: http://192.168.2.10:8000)

### Key Network Segments
- **Management Network**: 192.168.2.0/24
- **Proxmox Host**: 192.168.2.200
- **Reverse Proxy**: 192.168.2.101 (CT 102)
- **TinyAuth**: 192.168.2.10 (CT 115)
- **n8n**: 192.168.2.107 (CT 113)
- **Monitoring**: 192.168.2.114 (VM 101)

---

## Maintenance Schedule

### Automated Tasks
- **Backups**: Proxmox Backup Server - Daily incremental, Weekly full
- **Monitoring Scrapes**: Prometheus - Every 30 seconds
- **Certificate Renewal**: Nginx Proxy Manager - Automatic via Let's Encrypt

### Recommended Manual Tasks
- **Weekly**: Review Grafana dashboards for anomalies
- **Monthly**: Update monitoring stack Docker images
- **Quarterly**: Review backup retention policies
- **Semi-Annual**: Kernel updates on Proxmox host and VMs

---

## Known Issues & Resolutions

### Resolved
-   n8n PostgreSQL locale errors (fixed with `fix_n8n_db_c_locale.sh`)
-   n8n database permissions (fixed with `fix_n8n_db_permissions.sh`)

### Active Security Vulnerabilities (2025-12-20 Audit)

**CRITICAL Severity:**
1. **Docker Socket Exposure** (CVSS 9.8)
   - Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker
   - Impact: Container escape to root access
   - Remediation: Deploy docker-socket-proxy (Phase 2)

2. **Proxmox Credentials in Plaintext** (CVSS 9.1)
   - Affected: PVE Exporter `.env` and `pve.yml`
   - Impact: Full infrastructure compromise
   - Remediation: Rotate credentials, use API tokens (Phase 2)

3. **Database Passwords in Git** (CVSS 8.5)
   - Affected: Paperless-ngx, ByteStash, Speedtest Tracker
   - Impact: Credential exposure to all repository users
   - Remediation: Migrate to `.env` files, scrub git history (Phase 1)

**HIGH Severity:**
4. **Missing SSL/TLS** (CVSS 7.5)
   - Affected: Internal service communication
   - Impact: Traffic interception, credential sniffing
   - Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3)

5. **Weak/Default Passwords** (CVSS 7.2)
   - Affected: Multiple services
   - Impact: Brute-force attacks, unauthorized access
   - Remediation: Generate strong passwords, implement rotation (Phase 2)

6. **Containers Running as Root** (CVSS 7.0)
   - Affected: Most Docker containers
   - Impact: Privilege escalation if container compromised
   - Remediation: Enable user namespacing, set non-root users (Phase 3)

**Remediation Timeline:** See "Security Audit Remediation - Q4 2025" initiative above

### Active Monitoring
- PVE Exporter SSL verification (set to false for self-signed certificates) - **SECURITY RISK**
- Prometheus retention policies (currently 15 days, may need adjustment)
- Security script container names need verification (3/8 scripts)

### Deferred
- NetBox container offline (on-demand service)
- Development VMs stopped (resource conservation)
- Network segmentation implementation (Phase 4)
- Backup encryption (Phase 4)

---

## Version History

- **v2.1.0** (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
- **v2.0.0** (2025-12-02): Repository reorganization, services migration from GitLab
- **v1.0.0** (2025-11-29): Initial infrastructure documentation

---

**Maintained by**: jramos
**Repository**: Homelab Infrastructure Configuration
**Platform**: Proxmox VE 8.4.0
**Infrastructure Scale**: 9 VMs, 2 Templates, 4 Containers
**Current Status**: Operational - Home Automation Integration Deployed