- Marked 5 documentation tasks as complete (INDEX.md, README.md, CLAUDE.md, services/README.md, cross-references) - Corrected infrastructure counts from "10 VMs, 4 Containers" to "8 VMs, 2 Templates, 4 Containers" - Fixed 71 control character corruptions affecting file formatting and readability - Updated current status to "Operational - Documentation Complete" - Added .gitignore patterns for backup files (*.nullbyte-backup, *.control-chars-backup) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
411 lines
16 KiB
Markdown
411 lines
16 KiB
Markdown
# Homelab Infrastructure Status
|
|
|
|
**Last Updated**: 2025-12-07 12:00:40
|
|
**Export Reference**: disaster-recovery/homelab-export-20251207-120040
|
|
|
|
## Current Infrastructure Snapshot
|
|
|
|
### Proxmox Environment
|
|
- **Node**: serviceslab
|
|
- **Version**: Proxmox VE 8.3.3
|
|
- **Management IP**: 192.168.2.200
|
|
- **Architecture**: Single-node cluster
|
|
- **Total Resources**: 8 VMs, 2 Templates, 4 LXC Containers
|
|
|
|
---
|
|
|
|
## Virtual Machines (QEMU/KVM) - 8 VMs
|
|
|
|
| VM ID | Name | IP Address | Status | Purpose |
|
|
|-------|------|------------|--------|---------|
|
|
| 100 | docker-hub | 192.168.2.XXX | Running | Container registry/Docker hub mirror |
|
|
| 101 | monitoring-docker | 192.168.2.114 | Running | Monitoring stack (Grafana/Prometheus/PVE Exporter) |
|
|
| 105 | dev | - | Stopped | General-purpose development workstation |
|
|
| 106 | Ansible-Control | 192.168.2.XXX | Running | IaC orchestration, configuration management |
|
|
| 108 | CML | - | Stopped | Cisco Modeling Labs - network simulation |
|
|
| 109 | web-server-01 | 192.168.2.XXX | Running | Web application server (clustered) |
|
|
| 110 | web-server-02 | 192.168.2.XXX | Running | Load-balanced pair with web-server-01 |
|
|
| 111 | db-server-01 | 192.168.2.XXX | Running | Backend database server |
|
|
|
|
**Recent Changes**:
|
|
- Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
|
|
- Removed VM 101 (gitlab) - service decommissioned
|
|
|
|
---
|
|
|
|
## VM Templates - 2 Templates
|
|
|
|
| Template ID | Name | Purpose |
|
|
|-------------|------|---------|
|
|
| 104 | ubuntu-dev | Ubuntu development environment template for cloning |
|
|
| 107 | ubuntu-docker | Ubuntu Docker host template for rapid deployment |
|
|
|
|
**Note**: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.
|
|
|
|
---
|
|
|
|
## Containers (LXC) - 4 Containers
|
|
|
|
| CT ID | Name | IP Address | Status | Purpose |
|
|
|-------|------|------------|--------|---------|
|
|
| 102 | nginx | 192.168.2.101 | Running | Reverse proxy/load balancer & NPM |
|
|
| 103 | netbox | 192.168.2.XXX | Stopped | Network documentation/IPAM |
|
|
| 112 | twingate-connector | 192.168.2.XXX | Running | Zero-trust network access connector |
|
|
| 113 | n8n | 192.168.2.107 | Running | Workflow automation platform |
|
|
|
|
**Recent Changes**:
|
|
- Added CT 112 (twingate-connector) for zero-trust network security
|
|
- Added CT 113 (n8n) for workflow automation
|
|
- Removed CT 112 (Anytype) - replaced by n8n
|
|
|
|
---
|
|
|
|
## Storage Architecture
|
|
|
|
| Storage Pool | Type | Total | Used | % Used | Purpose |
|
|
|--------------|------|-------|------|--------|---------|
|
|
| local | Directory | - | - | 15.13% | System files, ISOs, templates |
|
|
| local-lvm | LVM-Thin | - | - | 0.0% | VM disk images (thin provisioned) |
|
|
| Vault | NFS/Directory | - | - | 10.88% | Secure storage for sensitive data |
|
|
| PBS-Backups | PBS | - | - | 27.43% | Automated backup repository |
|
|
| iso-share | NFS/CIFS | - | - | 1.4% | Installation media library |
|
|
| localnetwork | Network Share | - | - | N/A | Shared resources across infrastructure |
|
|
|
|
**Capacity Notes**:
|
|
- PBS-Backups utilization increased to 27.43% (healthy retention)
|
|
- Vault utilization decreased to 10.88% (space optimization)
|
|
- local storage at 15.13% (system overhead normal)
|
|
|
|
---
|
|
|
|
## Key Services & Stacks
|
|
|
|
### Monitoring & Observability (NEW)
|
|
**VM 101** - monitoring-docker (192.168.2.114)
|
|
- **Grafana**: Port 3000 - Visualization and dashboards
|
|
- **Prometheus**: Port 9090 - Metrics collection and time-series database
|
|
- **PVE Exporter**: Port 9221 - Proxmox VE metrics exporter
|
|
- **Documentation**: `/home/jramos/homelab/monitoring/README.md`
|
|
- **Status**: Fully operational
|
|
|
|
### Network Security (NEW)
|
|
**CT 112** - twingate-connector
|
|
- **Purpose**: Zero-trust network access
|
|
- **Type**: Lightweight connector
|
|
- **Status**: Running
|
|
- **Integration**: Connects homelab to Twingate network
|
|
|
|
### Automation & Integration
|
|
**CT 113** - n8n (192.168.2.107)
|
|
- **Purpose**: Workflow automation platform
|
|
- **Technology**: n8n.io
|
|
- **Database**: PostgreSQL 15+
|
|
- **Features**: API integration, scheduled workflows, webhook triggers
|
|
- **Documentation**: `/home/jramos/homelab/services/README.md#n8n-workflow-automation`
|
|
- **Status**: Operational (resolved database locale issues)
|
|
|
|
### Infrastructure Documentation
|
|
**CT 103** - netbox
|
|
- **Purpose**: Network documentation and IPAM
|
|
- **Status**: Stopped (on-demand use)
|
|
- **Function**: Infrastructure source of truth
|
|
|
|
### Reverse Proxy & Load Balancing
|
|
**CT 102** - nginx (192.168.2.101)
|
|
- **Purpose**: Nginx Proxy Manager
|
|
- **Ports**: 80, 81, 443
|
|
- **Function**: SSL termination, reverse proxy, certificate management
|
|
- **Upstream Services**: All web-facing applications
|
|
|
|
### Three-Tier Application Stack
|
|
**Web Tier**:
|
|
- VM 109 (web-server-01) - Primary web server
|
|
- VM 110 (web-server-02) - Load-balanced pair
|
|
|
|
**Database Tier**:
|
|
- VM 111 (db-server-01) - Backend database
|
|
|
|
**Proxy Tier**:
|
|
- CT 102 (nginx) - Load balancer and SSL termination
|
|
|
|
### Development & Automation
|
|
**VM 106** - Ansible-Control
|
|
- **Purpose**: Infrastructure as Code orchestration
|
|
- **Tools**: Ansible, Terraform/OpenTofu (potential)
|
|
- **Status**: Running
|
|
|
|
### Container Registry
|
|
**VM 100** - docker-hub
|
|
- **Purpose**: Local Docker registry and hub mirror
|
|
- **Function**: Caching container images for faster deployments
|
|
- **Status**: Running
|
|
|
|
### Network Simulation
|
|
**VM 108** - CML
|
|
- **Purpose**: Cisco Modeling Labs
|
|
- **Function**: Network topology testing and simulation
|
|
- **Status**: Stopped (resource-intensive, on-demand use)
|
|
|
|
---
|
|
|
|
## Architecture Patterns
|
|
|
|
### Monitoring & Observability (NEW)
|
|
The infrastructure now implements a comprehensive monitoring stack following industry best practices:
|
|
|
|
- **Metrics Collection**: Prometheus scraping Proxmox metrics via PVE Exporter
|
|
- **Visualization**: Grafana providing real-time dashboards and alerting
|
|
- **Isolation**: Dedicated VM for monitoring services (fault isolation)
|
|
- **Integration**: Ready for AlertManager, additional exporters, and integrations
|
|
|
|
**Design Decision**: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.
|
|
|
|
### Zero-Trust Security (NEW)
|
|
Implementation of zero-trust network access principles:
|
|
|
|
- **Twingate Connector**: Lightweight connector providing secure access without VPNs
|
|
- **Container Deployment**: LXC container for minimal resource overhead
|
|
- **Network Segmentation**: Secure access to homelab from external networks
|
|
|
|
**Design Decision**: LXC container chosen for quick provisioning and low resource consumption.
|
|
|
|
### Automation-First Approach
|
|
Workflow automation and infrastructure orchestration:
|
|
|
|
- **n8n Platform**: Visual workflow builder for API integrations
|
|
- **Scheduled Tasks**: Automated backup checks, monitoring alerts, reports
|
|
- **Integration Hub**: Connects monitoring, documentation, and operational tools
|
|
|
|
**Design Decision**: PostgreSQL backend ensures data persistence and supports complex workflows.
|
|
|
|
### Tiered Application Architecture
|
|
Classic three-tier design for production-like environments:
|
|
|
|
- **Presentation Tier**: Paired web servers (109, 110) behind load balancer
|
|
- **Business Logic**: Application processing on web tier
|
|
- **Data Tier**: Dedicated database server (111) with backup strategy
|
|
|
|
**Design Decision**: Separation of concerns, scalability testing, high availability patterns.
|
|
|
|
### Selective Containerization Strategy
|
|
Hybrid approach balancing performance and resource efficiency:
|
|
|
|
- **LXC Containers**: Stateless services (nginx, netbox, twingate, n8n)
|
|
- **Full VMs**: Complex applications, kernel dependencies, heavy workloads
|
|
- **Rationale**: LXC for ~10x lower overhead, VMs for isolation and compatibility
|
|
|
|
---
|
|
|
|
## Recent Infrastructure Changes (2025-12-07)
|
|
|
|
### Additions
|
|
1. **VM 101 (monitoring-docker)**: New dedicated monitoring infrastructure
|
|
- Grafana for visualization
|
|
- Prometheus for metrics collection
|
|
- PVE Exporter for Proxmox integration
|
|
- IP: 192.168.2.114
|
|
|
|
2. **CT 112 (twingate-connector)**: Zero-trust network security
|
|
- Lightweight connector
|
|
- Secure remote access without VPN
|
|
|
|
3. **CT 113 (n8n)**: Workflow automation platform
|
|
- PostgreSQL 15+ backend
|
|
- IP: 192.168.2.107
|
|
- Resolved database locale issues
|
|
|
|
### Modifications
|
|
- Storage utilization updated across all pools
|
|
- PBS-Backups now at 27.43% (increased retention)
|
|
- Vault optimized to 10.88% (reduced usage)
|
|
|
|
### Removals
|
|
- **VM 101 (gitlab)**: Decommissioned (previously at this ID)
|
|
- **CT 112 (Anytype)**: Replaced by n8n for better integration
|
|
|
|
### Documentation Updates
|
|
- Created comprehensive monitoring stack documentation
|
|
- Updated all infrastructure tables with current VMs/CTs
|
|
- Added architecture patterns for observability and zero-trust
|
|
- Updated storage statistics
|
|
- Referenced latest export: disaster-recovery/homelab-export-20251207-120040
|
|
|
|
---
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
homelab/
|
|
monitoring/ # NEW: Monitoring stack configurations
|
|
README.md # Comprehensive monitoring documentation
|
|
grafana/
|
|
docker-compose.yml
|
|
prometheus/
|
|
docker-compose.yml
|
|
prometheus.yml
|
|
pve-exporter/
|
|
docker-compose.yml
|
|
pve.yml
|
|
.env
|
|
services/ # Docker Compose service configurations
|
|
n8n/ # n8n workflow automation
|
|
netbox/ # Network documentation & IPAM
|
|
README.md # Services overview (updated)
|
|
disaster-recovery/
|
|
homelab-export-20251207-120040/ # Latest infrastructure export
|
|
scripts/
|
|
crawlers-exporters/ # Infrastructure collection scripts
|
|
fixers/ # Problem-solving scripts
|
|
qol/ # Quality of life improvements
|
|
CLAUDE.md # AI assistant guidance (updated)
|
|
INDEX.md # Navigation index (updated)
|
|
README.md # Repository overview (updated)
|
|
CLAUDE_STATUS.md # This file - current infrastructure status
|
|
```
|
|
|
|
---
|
|
|
|
## Current Initiative: Sub-Agent Architecture Optimization (2025-12-07)
|
|
|
|
### Goal
|
|
Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).
|
|
|
|
### Phase
|
|
COMPLETED - All sub-agent improvements and validations finished
|
|
|
|
### Progress Checklist
|
|
- [x] Prompt engineering analysis completed (Opus model)
|
|
- Analyzed CLAUDE.md and all 4 sub-agent files
|
|
- Identified 5 critical issues, 12 high-impact improvements
|
|
- Generated comprehensive improvement recommendations
|
|
- [x] scribe.md improved (29 340 lines)
|
|
- Added 6 usage examples (4 positive, 2 negative redirects)
|
|
- Implemented comprehensive responsibilities section
|
|
- Added 3 complete ASCII diagram templates
|
|
- Included safety protocols and decision frameworks
|
|
- Quality now matches librarian.md standard
|
|
- [x] backend-builder.md improved (40 291 lines)
|
|
- Added 6 usage examples with clear boundaries
|
|
- Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
|
|
- Added technology stack table and validation rules table
|
|
- Included safety protocols for secrets and destructive operations
|
|
- Added handoff protocol for lab-operator deployment
|
|
- Defined clear boundaries (CREATES code, does NOT deploy)
|
|
- [x] lab-operator.md improved (37 193 lines)
|
|
- Added 6 usage examples with role clarity
|
|
- Expanded domain expertise with specific commands
|
|
- Added command style guide (5-step pattern)
|
|
- Included safety protocols and decision-making framework
|
|
- Added error handling and escalation guidelines
|
|
- Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
|
|
- [x] CLAUDE.md structural fixes
|
|
- Moved YAML frontmatter to line 1 (was at line 89)
|
|
- Fixed trailing pipe character on line 87
|
|
- Completed incomplete sentence about backup strategy
|
|
- Completed incomplete sentence about storage growth
|
|
- Removed redundant "Key Services" reference
|
|
- Expanded status file template with actual structure and recovery instructions
|
|
- [x] Final validation and testing
|
|
- librarian: Git status check successful, clear output format
|
|
- scribe: File reading functional (note: reported encoding issue, likely false positive)
|
|
- backend-builder: YAML validation successful, proper syntax checking
|
|
- lab-operator: Directory listing successful, proper command execution
|
|
- All agents demonstrate improved structure and clarity
|
|
|
|
### Context
|
|
**Why It Matters**: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.
|
|
|
|
**Next Steps**: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.
|
|
|
|
---
|
|
|
|
## Previous Phase: Infrastructure Documentation Complete
|
|
|
|
### Goal
|
|
Comprehensive documentation of monitoring stack and updated infrastructure inventory.
|
|
|
|
### Phase
|
|
Documentation & Maintenance
|
|
|
|
### Completed Tasks
|
|
- [x] Created `/home/jramos/homelab/monitoring/README.md` with comprehensive monitoring documentation
|
|
- [x] Updated `CLAUDE_STATUS.md` with current infrastructure state
|
|
- [x] Documented 8 VMs, 2 Templates, and 4 LXC containers
|
|
- [x] Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
|
|
- [x] Added monitoring stack architecture and deployment procedures
|
|
- [x] Documented new services: monitoring-docker, twingate-connector, n8n
|
|
- [x] Referenced latest export: disaster-recovery/homelab-export-20251207-120040
|
|
|
|
### Remaining Documentation Tasks
|
|
- [x] Update INDEX.md with monitoring section and current VM/CT counts
|
|
- [x] Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
|
|
- [x] Update CLAUDE.md with architecture tables for monitoring and zero-trust
|
|
- [x] Update services/README.md with monitoring stack and twingate sections
|
|
- [x] Verify all documentation cross-references are accurate
|
|
- [ ] Test monitoring stack deployment procedures
|
|
|
|
---
|
|
|
|
## Access Information
|
|
|
|
### Management Interfaces
|
|
- **Proxmox UI**: https://192.168.2.200:8006
|
|
- **Grafana**: http://192.168.2.114:3000
|
|
- **Prometheus**: http://192.168.2.114:9090
|
|
- **Nginx Proxy Manager**: http://192.168.2.101:81
|
|
- **n8n**: http://192.168.2.107:5678
|
|
|
|
### Key Network Segments
|
|
- **Management Network**: 192.168.2.0/24
|
|
- **Proxmox Host**: 192.168.2.200
|
|
- **Reverse Proxy**: 192.168.2.101 (CT 102)
|
|
- **n8n**: 192.168.2.107 (CT 113)
|
|
- **Monitoring**: 192.168.2.114 (VM 101)
|
|
|
|
---
|
|
|
|
## Maintenance Schedule
|
|
|
|
### Automated Tasks
|
|
- **Backups**: Proxmox Backup Server - Daily incremental, Weekly full
|
|
- **Monitoring Scrapes**: Prometheus - Every 30 seconds
|
|
- **Certificate Renewal**: Nginx Proxy Manager - Automatic via Let's Encrypt
|
|
|
|
### Recommended Manual Tasks
|
|
- **Weekly**: Review Grafana dashboards for anomalies
|
|
- **Monthly**: Update monitoring stack Docker images
|
|
- **Quarterly**: Review backup retention policies
|
|
- **Semi-Annual**: Kernel updates on Proxmox host and VMs
|
|
|
|
---
|
|
|
|
## Known Issues & Resolutions
|
|
|
|
### Resolved
|
|
- n8n PostgreSQL locale errors (fixed with `fix_n8n_db_c_locale.sh`)
|
|
- n8n database permissions (fixed with `fix_n8n_db_permissions.sh`)
|
|
|
|
### Active Monitoring
|
|
- PVE Exporter SSL verification (set to false for self-signed certificates)
|
|
- Prometheus retention policies (currently 15 days, may need adjustment)
|
|
|
|
### Deferred
|
|
- NetBox container offline (on-demand service)
|
|
- Development VMs stopped (resource conservation)
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
- **v2.1.0** (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
|
|
- **v2.0.0** (2025-12-02): Repository reorganization, services migration from GitLab
|
|
- **v1.0.0** (2025-11-29): Initial infrastructure documentation
|
|
|
|
---
|
|
|
|
**Maintained by**: jramos
|
|
**Repository**: Homelab Infrastructure Configuration
|
|
**Platform**: Proxmox VE 8.3.3
|
|
**Infrastructure Scale**: 8 VMs, 2 Templates, 4 Containers
|
|
**Current Status**: Operational - Documentation Complete
|