feat(docs): update documentation for monitoring stack and infrastructure changes

- Update INDEX.md with VM 101 (monitoring-docker) and CT 112 (twingate-connector)
- Update README.md with monitoring and security sections
- Update CLAUDE.md with new architecture patterns
- Update services/README.md with monitoring stack documentation
- Update CLAUDE_STATUS.md with current infrastructure state
- Update infrastructure counts: 10 VMs, 4 Containers
- Update storage stats: PBS 27.43%, Vault 10.88%
- Create comprehensive monitoring/README.md
- Add .gitignore rules for monitoring sensitive files (pve.yml, .env)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit is contained in:
2025-12-07 12:41:08 -07:00
parent 0366c63d51
commit f42eeaba92
7 changed files with 1367 additions and 1000 deletions

755
monitoring/README.md Normal file
View File

@@ -0,0 +1,755 @@
# Monitoring Stack
Comprehensive monitoring and observability stack for the Proxmox homelab environment, providing real-time metrics, visualization, and alerting capabilities.
## Overview
The monitoring stack consists of three primary components deployed on VM 101 (monitoring-docker) at 192.168.2.114:
- **Grafana**: Visualization and dashboards (Port 3000)
- **Prometheus**: Metrics collection and time-series database (Port 9090)
- **PVE Exporter**: Proxmox VE metrics exporter (Port 9221)
## Architecture
```
┌─────────────────────────────────────────────────────────────────┐
│ Proxmox Host (serviceslab) │
│ 192.168.2.200 │
└────────────────────────────┬────────────────────────────────────┘
│ API (8006)
┌────────▼────────┐
│ PVE Exporter │
│ Port: 9221 │
│ (VM 101) │
└────────┬────────┘
│ Metrics
┌────────▼────────┐
│ Prometheus │
│ Port: 9090 │
│ (VM 101) │
└────────┬────────┘
│ Query
┌────────▼────────┐
│ Grafana │
│ Port: 3000 │
│ (VM 101) │
└─────────────────┘
│ HTTPS
┌────────▼────────┐
│ Nginx Proxy │
│ (CT 102) │
│ 192.168.2.101 │
└─────────────────┘
```
## Components
### VM 101: monitoring-docker
**Specifications**:
- **IP Address**: 192.168.2.114
- **Operating System**: Ubuntu 22.04/24.04 LTS
- **Docker Version**: 24.0+
- **Purpose**: Dedicated monitoring infrastructure host
**Resource Allocation**:
- **CPU**: 2-4 cores
- **Memory**: 4-8 GB
- **Storage**: 50-100 GB (thin provisioned)
### Grafana
**Version**: Latest stable
**Port**: 3000
**Access**: http://192.168.2.114:3000
**Features**:
- Pre-configured Proxmox VE dashboards
- Prometheus data source integration
- User authentication and authorization
- Dashboard templating and variables
- Alerting capabilities
- Panel plugins for advanced visualizations
**Default Credentials**:
- Username: `admin`
- Password: Check `.env` file or initial setup
**Key Dashboards**:
- Proxmox Host Overview
- VM Resource Utilization
- Container Resource Utilization
- Storage Pool Metrics
- Network Traffic Analysis
### Prometheus
**Version**: Latest stable
**Port**: 9090
**Access**: http://192.168.2.114:9090
**Configuration**: `/home/jramos/homelab/monitoring/prometheus/prometheus.yml`
**Scrape Targets**:
```yaml
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'pve'
static_configs:
- targets: ['pve-exporter:9221']
metrics_path: /pve
params:
module: [default]
```
**Features**:
- Time-series metrics database
- PromQL query language
- Service discovery
- Alert manager integration (configurable)
- Data retention policies
- Remote storage support
**Retention Policy**: 15 days (configurable via command line args)
### PVE Exporter
**Version**: prompve/prometheus-pve-exporter:latest
**Port**: 9221
**Access**: http://192.168.2.114:9221
**Configuration**:
- File: `/home/jramos/homelab/monitoring/pve-exporter/pve.yml`
- Environment: `/home/jramos/homelab/monitoring/pve-exporter/.env`
**Proxmox Connection**:
```yaml
default:
user: monitoring@pve
password: <stored in .env>
verify_ssl: false
```
**Metrics Exported**:
- Proxmox cluster status
- Node CPU, memory, disk usage
- VM/CT status and resource usage
- Storage pool utilization
- Network interface statistics
- Backup job status
- Service health
**Environment Variables**:
- `PVE_USER`: Proxmox API user (typically `monitoring@pve`)
- `PVE_PASSWORD`: API user password
- `PVE_VERIFY_SSL`: SSL verification (false for self-signed certs)
## Deployment
### Prerequisites
1. **VM 101 Setup**:
```bash
# Install Docker and Docker Compose
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER
# Verify installation
docker --version
docker compose version
```
2. **Proxmox API User**:
```bash
# On Proxmox host, create monitoring user
pveum user add monitoring@pve
pveum passwd monitoring@pve
pveum aclmod / -user monitoring@pve -role PVEAuditor
```
3. **Clone Repository**:
```bash
cd /home/jramos
git clone <repository-url> homelab
cd homelab/monitoring
```
### Configuration
1. **PVE Exporter Environment**:
```bash
cd pve-exporter
nano .env
```
Add:
```env
PVE_USER=monitoring@pve
PVE_PASSWORD=your-secure-password
PVE_VERIFY_SSL=false
```
2. **Verify Configuration Files**:
```bash
# Check PVE exporter config
cat pve-exporter/pve.yml
# Check Prometheus config
cat prometheus/prometheus.yml
```
### Deployment Steps
1. **Deploy PVE Exporter**:
```bash
cd /home/jramos/homelab/monitoring/pve-exporter
docker compose up -d
docker compose logs -f
```
2. **Deploy Prometheus**:
```bash
cd /home/jramos/homelab/monitoring/prometheus
docker compose up -d
docker compose logs -f
```
3. **Deploy Grafana**:
```bash
cd /home/jramos/homelab/monitoring/grafana
docker compose up -d
docker compose logs -f
```
4. **Verify All Services**:
```bash
# Check running containers
docker ps
# Test PVE Exporter
curl http://192.168.2.114:9221/pve?target=192.168.2.200&module=default
# Test Prometheus
curl http://192.168.2.114:9090/-/healthy
# Test Grafana
curl http://192.168.2.114:3000/api/health
```
### Initial Grafana Setup
1. **Access Grafana**:
- Navigate to http://192.168.2.114:3000
- Login with default credentials (admin/admin)
- Change password when prompted
2. **Add Prometheus Data Source**:
- Go to Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- URL: `http://prometheus:9090`
- Click "Save & Test"
3. **Import Proxmox Dashboard**:
- Go to Dashboards → Import
- Dashboard ID: 10347 (Proxmox VE)
- Select Prometheus data source
- Click "Import"
4. **Configure Alerting** (Optional):
- Go to Alerting → Notification channels
- Add email, Slack, or other notification methods
- Create alert rules in dashboards
## Network Configuration
### Internal Access
All services are accessible within the homelab network:
- **Grafana**: http://192.168.2.114:3000
- **Prometheus**: http://192.168.2.114:9090
- **PVE Exporter**: http://192.168.2.114:9221
### External Access (via Nginx Proxy Manager)
Configure reverse proxy on CT 102 (nginx at 192.168.2.101):
1. **Create Proxy Host**:
- Domain: `monitoring.yourdomain.com`
- Scheme: `http`
- Forward Hostname: `192.168.2.114`
- Forward Port: `3000`
2. **SSL Configuration**:
- Enable "Force SSL"
- Request Let's Encrypt certificate
- Enable HTTP/2
3. **Access List** (Optional):
- Create access list for authentication
- Apply to proxy host for additional security
## Maintenance
### Update Services
```bash
# Update all monitoring services
cd /home/jramos/homelab/monitoring
# Update PVE Exporter
cd pve-exporter
docker compose pull
docker compose up -d
# Update Prometheus
cd ../prometheus
docker compose pull
docker compose up -d
# Update Grafana
cd ../grafana
docker compose pull
docker compose up -d
```
### Backup Grafana Dashboards
```bash
# Backup Grafana data
docker exec -t grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
# Or use Grafana's provisioning
# Dashboards can be exported as JSON and stored in git
```
### Prometheus Data Retention
```bash
# Check Prometheus storage size
docker exec prometheus du -sh /prometheus
# Adjust retention in docker-compose.yml:
# command:
# - '--storage.tsdb.retention.time=30d'
# - '--storage.tsdb.retention.size=50GB'
```
### View Logs
```bash
# PVE Exporter logs
cd /home/jramos/homelab/monitoring/pve-exporter
docker compose logs -f
# Prometheus logs
cd /home/jramos/homelab/monitoring/prometheus
docker compose logs -f
# Grafana logs
cd /home/jramos/homelab/monitoring/grafana
docker compose logs -f
# All logs together
docker logs -f pve-exporter
docker logs -f prometheus
docker logs -f grafana
```
## Troubleshooting
### PVE Exporter Cannot Connect to Proxmox
**Symptoms**: No metrics from Proxmox, connection refused errors
**Solutions**:
1. Verify Proxmox API is accessible:
```bash
curl -k https://192.168.2.200:8006/api2/json/version
```
2. Check PVE Exporter environment variables:
```bash
cd /home/jramos/homelab/monitoring/pve-exporter
cat .env
docker compose config
```
3. Test authentication:
```bash
# From VM 101
curl -k -d "username=monitoring@pve&password=yourpassword" \
https://192.168.2.200:8006/api2/json/access/ticket
```
4. Verify user permissions on Proxmox:
```bash
# On Proxmox host
pveum user list
pveum aclmod / -user monitoring@pve -role PVEAuditor
```
### Prometheus Not Scraping Targets
**Symptoms**: Targets shown as down in Prometheus UI
**Solutions**:
1. Check Prometheus targets:
- Navigate to http://192.168.2.114:9090/targets
- Verify target status and error messages
2. Verify network connectivity:
```bash
docker exec prometheus curl http://pve-exporter:9221/pve
```
3. Check Prometheus configuration:
```bash
cd /home/jramos/homelab/monitoring/prometheus
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
```
4. Reload Prometheus configuration:
```bash
docker compose restart prometheus
```
### Grafana Shows No Data
**Symptoms**: Dashboards display "No data" or empty graphs
**Solutions**:
1. Verify Prometheus data source:
- Go to Configuration → Data Sources
- Test connection to Prometheus
- URL should be `http://prometheus:9090`
2. Check Prometheus has data:
- Navigate to http://192.168.2.114:9090
- Run query: `up`
- Should show all scrape targets
3. Verify dashboard queries:
- Edit panel
- Check PromQL query syntax
- Test query in Prometheus UI first
4. Check time range:
- Ensure dashboard time range includes recent data
- Prometheus retention period not exceeded
### Docker Compose Network Issues
**Symptoms**: Containers cannot communicate
**Solutions**:
1. Check Docker network:
```bash
docker network ls
docker network inspect monitoring_default
```
2. Verify container connectivity:
```bash
docker exec prometheus ping pve-exporter
docker exec grafana ping prometheus
```
3. Recreate network:
```bash
cd /home/jramos/homelab/monitoring
docker compose down
docker network prune
docker compose up -d
```
### High Memory Usage
**Symptoms**: VM 101 running out of memory
**Solutions**:
1. Check container memory usage:
```bash
docker stats
```
2. Reduce Prometheus retention:
```yaml
# In prometheus/docker-compose.yml
command:
- '--storage.tsdb.retention.time=7d'
- '--storage.tsdb.retention.size=10GB'
```
3. Limit Grafana image rendering:
```yaml
# In grafana/docker-compose.yml
environment:
- GF_RENDERING_SERVER_URL=
- GF_RENDERING_CALLBACK_URL=
```
4. Increase VM memory allocation in Proxmox
### SSL/TLS Certificate Errors
**Symptoms**: PVE Exporter cannot verify SSL certificate
**Solutions**:
1. Set `verify_ssl: false` in `pve.yml` (for self-signed certs)
2. Or import Proxmox CA certificate:
```bash
# Copy CA from Proxmox to VM 101
scp root@192.168.2.200:/etc/pve/pve-root-ca.pem .
# Add to trust store
sudo cp pve-root-ca.pem /usr/local/share/ca-certificates/pve-root-ca.crt
sudo update-ca-certificates
```
## Metrics Reference
### Key Proxmox Metrics
**Node Metrics**:
- `pve_node_cpu_usage_ratio`: CPU utilization (0-1)
- `pve_node_memory_usage_bytes`: Memory used
- `pve_node_memory_total_bytes`: Total memory
- `pve_node_disk_usage_bytes`: Root disk used
- `pve_node_uptime_seconds`: Node uptime
**VM/CT Metrics**:
- `pve_guest_info`: Guest information (labels: id, name, type, node)
- `pve_guest_cpu_usage_ratio`: Guest CPU usage
- `pve_guest_memory_usage_bytes`: Guest memory used
- `pve_guest_disk_read_bytes_total`: Disk read bytes
- `pve_guest_disk_write_bytes_total`: Disk write bytes
- `pve_guest_network_receive_bytes_total`: Network received
- `pve_guest_network_transmit_bytes_total`: Network transmitted
**Storage Metrics**:
- `pve_storage_usage_bytes`: Storage used
- `pve_storage_size_bytes`: Total storage size
- `pve_storage_info`: Storage information (labels: storage, type)
### Useful PromQL Queries
**CPU Usage by VM**:
```promql
pve_guest_cpu_usage_ratio{type="qemu"} * 100
```
**Memory Usage Percentage**:
```promql
(pve_guest_memory_usage_bytes / pve_guest_memory_size_bytes) * 100
```
**Storage Usage Percentage**:
```promql
(pve_storage_usage_bytes / pve_storage_size_bytes) * 100
```
**Network Bandwidth (rate)**:
```promql
rate(pve_guest_network_transmit_bytes_total[5m])
```
**Top 5 VMs by CPU**:
```promql
topk(5, pve_guest_cpu_usage_ratio{type="qemu"})
```
## Security Considerations
### API Credentials
1. **PVE Exporter `.env` file**:
- Never commit to version control
- Use strong passwords
- Restrict file permissions: `chmod 600 .env`
2. **Proxmox API User**:
- Use dedicated monitoring user
- Grant minimal required permissions (PVEAuditor role)
- Consider token-based authentication
3. **Grafana Authentication**:
- Change default admin password
- Enable OAuth/LDAP for user authentication
- Use role-based access control
### Network Security
1. **Firewall Rules**:
```bash
# On VM 101, restrict access
ufw allow from 192.168.2.0/24 to any port 3000
ufw allow from 192.168.2.0/24 to any port 9090
ufw allow from 192.168.2.0/24 to any port 9221
```
2. **Reverse Proxy**:
- Use Nginx Proxy Manager for SSL termination
- Implement access lists
- Enable fail2ban for brute force protection
3. **Docker Security**:
- Run containers as non-root users
- Use read-only filesystems where possible
- Limit container capabilities
## Performance Tuning
### Prometheus Optimization
**Scrape Interval**:
```yaml
global:
scrape_interval: 30s # Increase for less frequent scraping
evaluation_interval: 30s
```
**Target Relabeling**:
```yaml
relabel_configs:
- source_labels: [__address__]
regex: '.*'
action: keep # Keep only matching targets
```
### Grafana Optimization
**Query Optimization**:
- Use recording rules in Prometheus for complex queries
- Set appropriate refresh intervals on dashboards
- Limit time range on expensive queries
**Caching**:
```ini
# In grafana.ini or environment variables
[caching]
enabled = true
ttl = 3600
```
## Advanced Configuration
### Alerting with Alertmanager
1. **Add Alertmanager to stack**:
```bash
cd /home/jramos/homelab/monitoring
# Create alertmanager directory with docker-compose.yml
```
2. **Configure alerts in Prometheus**:
```yaml
# In prometheus.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
rule_files:
- 'alerts.yml'
```
3. **Example alert rules**:
```yaml
# alerts.yml
groups:
- name: proxmox
interval: 30s
rules:
- alert: HighCPUUsage
expr: pve_node_cpu_usage_ratio > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.node }}"
```
### Multi-Node Proxmox Cluster
For clustered Proxmox environments:
```yaml
# In pve.yml
cluster1:
user: monitoring@pve
password: ${PVE_PASSWORD}
verify_ssl: false
cluster2:
user: monitoring@pve
password: ${PVE_PASSWORD}
verify_ssl: false
```
### Dashboard Provisioning
Store dashboards as code:
```bash
# Create provisioning directory
mkdir -p grafana/provisioning/dashboards
# Add provisioning config
# grafana/provisioning/dashboards/dashboards.yml
```
## Integration with Other Services
### n8n Workflow Automation
Create workflows in n8n (CT 113) to:
- Send alerts to Slack/Discord based on Prometheus alerts
- Generate daily/weekly infrastructure reports
- Automate backup verification checks
### NetBox IPAM
Sync monitoring targets with NetBox (CT 103):
- Automatically discover new VMs/CTs
- Update service inventory
- Link metrics to network documentation
## Additional Resources
### Documentation
- [Prometheus Documentation](https://prometheus.io/docs/)
- [Grafana Documentation](https://grafana.com/docs/)
- [PVE Exporter GitHub](https://github.com/prometheus-pve/prometheus-pve-exporter)
- [Proxmox API Documentation](https://pve.proxmox.com/pve-docs/api-viewer/)
### Community Dashboards
- Grafana Dashboard 10347: Proxmox VE
- Grafana Dashboard 15356: Proxmox Cluster
- Grafana Dashboard 15362: Proxmox Summary
### Related Homelab Documentation
- [Homelab Overview](../README.md)
- [Services Documentation](../services/README.md)
- [Infrastructure Index](../INDEX.md)
- [n8n Setup Guide](../services/README.md#n8n-workflow-automation)
---
**Last Updated**: 2025-12-07
**Maintainer**: jramos
**VM**: 101 (monitoring-docker) at 192.168.2.114
**Stack Version**: Prometheus 2.x, Grafana 10.x, PVE Exporter latest