feat(docs): update documentation for monitoring stack and infrastructure changes

- Update INDEX.md with VM 101 (monitoring-docker) and CT 112 (twingate-connector) - Update README.md with monitoring and security sections - Update CLAUDE.md with new architecture patterns - Update services/README.md with monitoring stack documentation - Update CLAUDE_STATUS.md with current infrastructure state - Update infrastructure counts: 10 VMs, 4 Containers - Update storage stats: PBS 27.43%, Vault 10.88% - Create comprehensive monitoring/README.md - Add .gitignore rules for monitoring sensitive files (pve.yml, .env) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
2025-12-07 12:41:08 -07:00
parent 0366c63d51
commit f42eeaba92
7 changed files with 1367 additions and 1000 deletions
--- a/services/README.md
+++ b/services/README.md
@@ -132,6 +132,205 @@ cd speedtest-tracker
 docker compose up -d
 ```

+## Monitoring Stack (VM-based)
+
+**Deployment**: VM 101 (monitoring-docker) at 192.168.2.114
+**Technology**: Docker Compose
+**Components**: Grafana, Prometheus, PVE Exporter
+
+### Overview
+Comprehensive monitoring and observability stack for the Proxmox homelab environment providing real-time metrics, visualization, and alerting capabilities.
+
+### Components
+
+**Grafana** (Port 3000):
+- Visualization and dashboards
+- Pre-configured Proxmox VE dashboards
+- User authentication and RBAC
+- Alerting capabilities
+- Access: http://192.168.2.114:3000
+
+**Prometheus** (Port 9090):
+- Metrics collection and time-series database
+- PromQL query language
+- 15-day retention (configurable)
+- Service discovery
+- Access: http://192.168.2.114:9090
+
+**PVE Exporter** (Port 9221):
+- Proxmox VE metrics exporter
+- Connects to Proxmox API
+- Exports node, VM, CT, and storage metrics
+- Access: http://192.168.2.114:9221
+
+### Key Features
+- Real-time Proxmox infrastructure monitoring
+- VM and container resource utilization tracking
+- Storage pool capacity planning
+- Network traffic analysis
+- Backup job status monitoring
+- Custom alerting rules
+
+### Deployment
+
+```bash
+# Navigate to monitoring directory
+cd /home/jramos/homelab/monitoring
+
+# Deploy PVE Exporter
+cd pve-exporter
+docker compose up -d
+
+# Deploy Prometheus
+cd ../prometheus
+docker compose up -d
+
+# Deploy Grafana
+cd ../grafana
+docker compose up -d
+
+# Verify all services
+docker ps | grep -E 'grafana|prometheus|pve-exporter'
+```
+
+### Configuration
+
+**PVE Exporter**:
+- Environment file: `monitoring/pve-exporter/.env`
+- Configuration: `monitoring/pve-exporter/pve.yml`
+- Requires Proxmox API user with PVEAuditor role
+
+**Prometheus**:
+- Configuration: `monitoring/prometheus/prometheus.yml`
+- Scrapes PVE Exporter every 30 seconds
+- Targets: localhost:9090, pve-exporter:9221
+
+**Grafana**:
+- Default credentials: admin/admin (change on first login)
+- Data source: Prometheus at http://prometheus:9090
+- Recommended dashboard: Grafana ID 10347 (Proxmox VE)
+
+### Maintenance
+
+```bash
+# Update images
+cd /home/jramos/homelab/monitoring/<component>
+docker compose pull
+docker compose up -d
+
+# View logs
+docker compose logs -f
+
+# Restart services
+docker compose restart
+```
+
+### Troubleshooting
+
+**PVE Exporter connection issues**:
+1. Verify Proxmox API is accessible: `curl -k https://192.168.2.200:8006`
+2. Check credentials in `.env` file
+3. Verify user has PVEAuditor role: `pveum user list` (on Proxmox)
+
+**Grafana shows no data**:
+1. Verify Prometheus data source configuration
+2. Check Prometheus targets: http://192.168.2.114:9090/targets
+3. Test queries in Prometheus UI before using in Grafana
+
+**High memory usage**:
+1. Reduce Prometheus retention period
+2. Limit Grafana concurrent queries
+3. Increase VM 101 memory allocation
+
+**Complete Documentation**: See `/home/jramos/homelab/monitoring/README.md`
+
+---
+
+## Twingate Connector
+
+**Deployment**: CT 112 (twingate-connector)
+**Technology**: LXC Container
+**Purpose**: Zero-trust network access
+
+### Overview
+Lightweight connector providing secure remote access to homelab resources without traditional VPN complexity. Part of Twingate's zero-trust network access (ZTNA) solution.
+
+### Features
+- **Zero-Trust Architecture**: Grant access to specific resources, not entire networks
+- **No VPN Required**: Simplified connection without VPN client configuration
+- **Identity-Based Access**: User and device authentication
+- **Automatic Updates**: Connector auto-updates for security patches
+- **Low Resource Overhead**: Minimal CPU and memory footprint
+
+### Architecture
+```
+External User → Twingate Cloud → Twingate Connector (CT 112) → Homelab Resources
+```
+
+### Deployment Considerations
+
+**LXC vs Docker**:
+- LXC chosen for lightweight, always-on service
+- Minimal resource consumption
+- System-level integration
+- Quick restart and recovery
+
+**Network Placement**:
+- Deployed on homelab management network (192.168.2.0/24)
+- Access to all internal resources
+- No inbound port forwarding required
+
+### Configuration
+
+The Twingate connector is configured via the Twingate Admin Console:
+
+1. **Create Connector** in Twingate Admin Console
+2. **Generate Token** for connector authentication
+3. **Deploy Container** with provided token
+4. **Configure Resources** to route through connector
+5. **Assign Users** to resources
+
+### Maintenance
+
+**Health Monitoring**:
+- Check connector status in Twingate Admin Console
+- Monitor CPU/memory usage on CT 112
+- Review connection logs
+
+**Updates**:
+- Connector auto-updates by default
+- Manual updates: Restart container or redeploy
+
+**Troubleshooting**:
+- Verify network connectivity to Twingate cloud
+- Check connector token validity
+- Review resource routing configuration
+- Ensure firewall allows outbound HTTPS
+
+### Security Best Practices
+
+1. **Least Privilege**: Grant access only to required resources
+2. **MFA Enforcement**: Require multi-factor authentication for users
+3. **Device Trust**: Enable device posture checks
+4. **Audit Logs**: Regularly review access logs in Twingate Console
+5. **Connector Isolation**: Consider dedicated network segment for connector
+
+### Integration with Homelab
+
+**Protected Resources**:
+- Proxmox Web UI (192.168.2.200:8006)
+- Grafana Monitoring (192.168.2.114:3000)
+- Nginx Proxy Manager (192.168.2.101:81)
+- n8n Workflows (192.168.2.107:5678)
+- Development VMs and services
+
+**Access Policies**:
+- Admin users: Full access to all resources
+- Monitoring users: Read-only Grafana access
+- Developers: Access to dev VMs and services
+
+---
+
 ## General Deployment Instructions

 ### Prerequisites
@@ -308,6 +507,39 @@ Several services have embedded secrets in their docker-compose.yaml files:
 2. Verify host directory ownership: `chown -R <user>:<group> /path/to/volume`
 3. Check SELinux context (if applicable): `ls -Z /path/to/volume`

+### Monitoring Stack Issues
+
+**Metrics Not Appearing**:
+1. Verify PVE Exporter can reach Proxmox API
+2. Check Prometheus scrape targets status
+3. Ensure Grafana data source is configured correctly
+4. Review retention policies (data may be expired)
+
+**Authentication Failures (PVE Exporter)**:
+1. Verify Proxmox user credentials in `.env` file
+2. Check user has PVEAuditor role
+3. Test API access: `curl -k https://192.168.2.200:8006/api2/json/version`
+
+**High Resource Usage**:
+1. Adjust Prometheus retention: `--storage.tsdb.retention.time=7d`
+2. Reduce scrape frequency in prometheus.yml
+3. Limit Grafana query concurrency
+4. Increase VM 101 resources if needed
+
+### Twingate Connector Issues
+
+**Connector Offline**:
+1. Check CT 112 is running: `pct status 112`
+2. Verify network connectivity from container
+3. Check connector token validity in Twingate Console
+4. Review container logs for error messages
+
+**Cannot Access Resources**:
+1. Verify resource is configured in Twingate Console
+2. Check user has permission to access resource
+3. Ensure connector is online and healthy
+4. Verify network routes on CT 112
+
 ## Migration Notes

 ### Post-Migration Tasks
@@ -353,6 +585,7 @@ For homelab-specific questions or issues:

 ---

-**Last Updated**: 2025-12-02
+**Last Updated**: 2025-12-07
 **Maintainer**: jramos
 **Repository**: http://192.168.2.102:3060/jramos/homelab
+**Infrastructure**: 10 VMs, 4 LXC Containers