feat(docs): update documentation for monitoring stack and infrastructure changes
- Update INDEX.md with VM 101 (monitoring-docker) and CT 112 (twingate-connector) - Update README.md with monitoring and security sections - Update CLAUDE.md with new architecture patterns - Update services/README.md with monitoring stack documentation - Update CLAUDE_STATUS.md with current infrastructure state - Update infrastructure counts: 10 VMs, 4 Containers - Update storage stats: PBS 27.43%, Vault 10.88% - Create comprehensive monitoring/README.md - Add .gitignore rules for monitoring sensitive files (pve.yml, .env) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
@@ -132,6 +132,205 @@ cd speedtest-tracker
|
||||
docker compose up -d
|
||||
```
|
||||
|
||||
## Monitoring Stack (VM-based)
|
||||
|
||||
**Deployment**: VM 101 (monitoring-docker) at 192.168.2.114
|
||||
**Technology**: Docker Compose
|
||||
**Components**: Grafana, Prometheus, PVE Exporter
|
||||
|
||||
### Overview
|
||||
Comprehensive monitoring and observability stack for the Proxmox homelab environment providing real-time metrics, visualization, and alerting capabilities.
|
||||
|
||||
### Components
|
||||
|
||||
**Grafana** (Port 3000):
|
||||
- Visualization and dashboards
|
||||
- Pre-configured Proxmox VE dashboards
|
||||
- User authentication and RBAC
|
||||
- Alerting capabilities
|
||||
- Access: http://192.168.2.114:3000
|
||||
|
||||
**Prometheus** (Port 9090):
|
||||
- Metrics collection and time-series database
|
||||
- PromQL query language
|
||||
- 15-day retention (configurable)
|
||||
- Service discovery
|
||||
- Access: http://192.168.2.114:9090
|
||||
|
||||
**PVE Exporter** (Port 9221):
|
||||
- Proxmox VE metrics exporter
|
||||
- Connects to Proxmox API
|
||||
- Exports node, VM, CT, and storage metrics
|
||||
- Access: http://192.168.2.114:9221
|
||||
|
||||
### Key Features
|
||||
- Real-time Proxmox infrastructure monitoring
|
||||
- VM and container resource utilization tracking
|
||||
- Storage pool capacity planning
|
||||
- Network traffic analysis
|
||||
- Backup job status monitoring
|
||||
- Custom alerting rules
|
||||
|
||||
### Deployment
|
||||
|
||||
```bash
|
||||
# Navigate to monitoring directory
|
||||
cd /home/jramos/homelab/monitoring
|
||||
|
||||
# Deploy PVE Exporter
|
||||
cd pve-exporter
|
||||
docker compose up -d
|
||||
|
||||
# Deploy Prometheus
|
||||
cd ../prometheus
|
||||
docker compose up -d
|
||||
|
||||
# Deploy Grafana
|
||||
cd ../grafana
|
||||
docker compose up -d
|
||||
|
||||
# Verify all services
|
||||
docker ps | grep -E 'grafana|prometheus|pve-exporter'
|
||||
```
|
||||
|
||||
### Configuration
|
||||
|
||||
**PVE Exporter**:
|
||||
- Environment file: `monitoring/pve-exporter/.env`
|
||||
- Configuration: `monitoring/pve-exporter/pve.yml`
|
||||
- Requires Proxmox API user with PVEAuditor role
|
||||
|
||||
**Prometheus**:
|
||||
- Configuration: `monitoring/prometheus/prometheus.yml`
|
||||
- Scrapes PVE Exporter every 30 seconds
|
||||
- Targets: localhost:9090, pve-exporter:9221
|
||||
|
||||
**Grafana**:
|
||||
- Default credentials: admin/admin (change on first login)
|
||||
- Data source: Prometheus at http://prometheus:9090
|
||||
- Recommended dashboard: Grafana ID 10347 (Proxmox VE)
|
||||
|
||||
### Maintenance
|
||||
|
||||
```bash
|
||||
# Update images
|
||||
cd /home/jramos/homelab/monitoring/<component>
|
||||
docker compose pull
|
||||
docker compose up -d
|
||||
|
||||
# View logs
|
||||
docker compose logs -f
|
||||
|
||||
# Restart services
|
||||
docker compose restart
|
||||
```
|
||||
|
||||
### Troubleshooting
|
||||
|
||||
**PVE Exporter connection issues**:
|
||||
1. Verify Proxmox API is accessible: `curl -k https://192.168.2.200:8006`
|
||||
2. Check credentials in `.env` file
|
||||
3. Verify user has PVEAuditor role: `pveum user list` (on Proxmox)
|
||||
|
||||
**Grafana shows no data**:
|
||||
1. Verify Prometheus data source configuration
|
||||
2. Check Prometheus targets: http://192.168.2.114:9090/targets
|
||||
3. Test queries in Prometheus UI before using in Grafana
|
||||
|
||||
**High memory usage**:
|
||||
1. Reduce Prometheus retention period
|
||||
2. Limit Grafana concurrent queries
|
||||
3. Increase VM 101 memory allocation
|
||||
|
||||
**Complete Documentation**: See `/home/jramos/homelab/monitoring/README.md`
|
||||
|
||||
---
|
||||
|
||||
## Twingate Connector
|
||||
|
||||
**Deployment**: CT 112 (twingate-connector)
|
||||
**Technology**: LXC Container
|
||||
**Purpose**: Zero-trust network access
|
||||
|
||||
### Overview
|
||||
Lightweight connector providing secure remote access to homelab resources without traditional VPN complexity. Part of Twingate's zero-trust network access (ZTNA) solution.
|
||||
|
||||
### Features
|
||||
- **Zero-Trust Architecture**: Grant access to specific resources, not entire networks
|
||||
- **No VPN Required**: Simplified connection without VPN client configuration
|
||||
- **Identity-Based Access**: User and device authentication
|
||||
- **Automatic Updates**: Connector auto-updates for security patches
|
||||
- **Low Resource Overhead**: Minimal CPU and memory footprint
|
||||
|
||||
### Architecture
|
||||
```
|
||||
External User → Twingate Cloud → Twingate Connector (CT 112) → Homelab Resources
|
||||
```
|
||||
|
||||
### Deployment Considerations
|
||||
|
||||
**LXC vs Docker**:
|
||||
- LXC chosen for lightweight, always-on service
|
||||
- Minimal resource consumption
|
||||
- System-level integration
|
||||
- Quick restart and recovery
|
||||
|
||||
**Network Placement**:
|
||||
- Deployed on homelab management network (192.168.2.0/24)
|
||||
- Access to all internal resources
|
||||
- No inbound port forwarding required
|
||||
|
||||
### Configuration
|
||||
|
||||
The Twingate connector is configured via the Twingate Admin Console:
|
||||
|
||||
1. **Create Connector** in Twingate Admin Console
|
||||
2. **Generate Token** for connector authentication
|
||||
3. **Deploy Container** with provided token
|
||||
4. **Configure Resources** to route through connector
|
||||
5. **Assign Users** to resources
|
||||
|
||||
### Maintenance
|
||||
|
||||
**Health Monitoring**:
|
||||
- Check connector status in Twingate Admin Console
|
||||
- Monitor CPU/memory usage on CT 112
|
||||
- Review connection logs
|
||||
|
||||
**Updates**:
|
||||
- Connector auto-updates by default
|
||||
- Manual updates: Restart container or redeploy
|
||||
|
||||
**Troubleshooting**:
|
||||
- Verify network connectivity to Twingate cloud
|
||||
- Check connector token validity
|
||||
- Review resource routing configuration
|
||||
- Ensure firewall allows outbound HTTPS
|
||||
|
||||
### Security Best Practices
|
||||
|
||||
1. **Least Privilege**: Grant access only to required resources
|
||||
2. **MFA Enforcement**: Require multi-factor authentication for users
|
||||
3. **Device Trust**: Enable device posture checks
|
||||
4. **Audit Logs**: Regularly review access logs in Twingate Console
|
||||
5. **Connector Isolation**: Consider dedicated network segment for connector
|
||||
|
||||
### Integration with Homelab
|
||||
|
||||
**Protected Resources**:
|
||||
- Proxmox Web UI (192.168.2.200:8006)
|
||||
- Grafana Monitoring (192.168.2.114:3000)
|
||||
- Nginx Proxy Manager (192.168.2.101:81)
|
||||
- n8n Workflows (192.168.2.107:5678)
|
||||
- Development VMs and services
|
||||
|
||||
**Access Policies**:
|
||||
- Admin users: Full access to all resources
|
||||
- Monitoring users: Read-only Grafana access
|
||||
- Developers: Access to dev VMs and services
|
||||
|
||||
---
|
||||
|
||||
## General Deployment Instructions
|
||||
|
||||
### Prerequisites
|
||||
@@ -308,6 +507,39 @@ Several services have embedded secrets in their docker-compose.yaml files:
|
||||
2. Verify host directory ownership: `chown -R <user>:<group> /path/to/volume`
|
||||
3. Check SELinux context (if applicable): `ls -Z /path/to/volume`
|
||||
|
||||
### Monitoring Stack Issues
|
||||
|
||||
**Metrics Not Appearing**:
|
||||
1. Verify PVE Exporter can reach Proxmox API
|
||||
2. Check Prometheus scrape targets status
|
||||
3. Ensure Grafana data source is configured correctly
|
||||
4. Review retention policies (data may be expired)
|
||||
|
||||
**Authentication Failures (PVE Exporter)**:
|
||||
1. Verify Proxmox user credentials in `.env` file
|
||||
2. Check user has PVEAuditor role
|
||||
3. Test API access: `curl -k https://192.168.2.200:8006/api2/json/version`
|
||||
|
||||
**High Resource Usage**:
|
||||
1. Adjust Prometheus retention: `--storage.tsdb.retention.time=7d`
|
||||
2. Reduce scrape frequency in prometheus.yml
|
||||
3. Limit Grafana query concurrency
|
||||
4. Increase VM 101 resources if needed
|
||||
|
||||
### Twingate Connector Issues
|
||||
|
||||
**Connector Offline**:
|
||||
1. Check CT 112 is running: `pct status 112`
|
||||
2. Verify network connectivity from container
|
||||
3. Check connector token validity in Twingate Console
|
||||
4. Review container logs for error messages
|
||||
|
||||
**Cannot Access Resources**:
|
||||
1. Verify resource is configured in Twingate Console
|
||||
2. Check user has permission to access resource
|
||||
3. Ensure connector is online and healthy
|
||||
4. Verify network routes on CT 112
|
||||
|
||||
## Migration Notes
|
||||
|
||||
### Post-Migration Tasks
|
||||
@@ -353,6 +585,7 @@ For homelab-specific questions or issues:
|
||||
|
||||
---
|
||||
|
||||
**Last Updated**: 2025-12-02
|
||||
**Last Updated**: 2025-12-07
|
||||
**Maintainer**: jramos
|
||||
**Repository**: http://192.168.2.102:3060/jramos/homelab
|
||||
**Infrastructure**: 10 VMs, 4 LXC Containers
|
||||
|
||||
Reference in New Issue
Block a user