# Monitoring Stack Comprehensive monitoring and observability stack for the Proxmox homelab environment, providing real-time metrics, visualization, and alerting capabilities. ## Overview The monitoring stack consists of three primary components deployed on VM 101 (monitoring-docker) at 192.168.2.114: - **Grafana**: Visualization and dashboards (Port 3000) - **Prometheus**: Metrics collection and time-series database (Port 9090) - **PVE Exporter**: Proxmox VE metrics exporter (Port 9221) ## Architecture ``` ┌─────────────────────────────────────────────────────────────────┐ │ Proxmox Host (serviceslab) │ │ 192.168.2.200 │ └────────────────────────────┬────────────────────────────────────┘ │ │ API (8006) │ ┌────────▼────────┐ │ PVE Exporter │ │ Port: 9221 │ │ (VM 101) │ └────────┬────────┘ │ │ Metrics │ ┌────────▼────────┐ │ Prometheus │ │ Port: 9090 │ │ (VM 101) │ └────────┬────────┘ │ │ Query │ ┌────────▼────────┐ │ Grafana │ │ Port: 3000 │ │ (VM 101) │ └─────────────────┘ │ │ HTTPS │ ┌────────▼────────┐ │ Nginx Proxy │ │ (CT 102) │ │ 192.168.2.101 │ └─────────────────┘ ``` ## Components ### VM 101: monitoring-docker **Specifications**: - **IP Address**: 192.168.2.114 - **Operating System**: Ubuntu 22.04/24.04 LTS - **Docker Version**: 24.0+ - **Purpose**: Dedicated monitoring infrastructure host **Resource Allocation**: - **CPU**: 2-4 cores - **Memory**: 4-8 GB - **Storage**: 50-100 GB (thin provisioned) ### Grafana **Version**: Latest stable **Port**: 3000 **Access**: http://192.168.2.114:3000 **Features**: - Pre-configured Proxmox VE dashboards - Prometheus data source integration - User authentication and authorization - Dashboard templating and variables - Alerting capabilities - Panel plugins for advanced visualizations **Default Credentials**: - Username: `admin` - Password: Check `.env` file or initial setup **Key Dashboards**: - Proxmox Host Overview - VM Resource Utilization - Container Resource Utilization - Storage Pool Metrics - Network Traffic Analysis ### Prometheus **Version**: Latest stable **Port**: 9090 **Access**: http://192.168.2.114:9090 **Configuration**: `/home/jramos/homelab/monitoring/prometheus/prometheus.yml` **Scrape Targets**: ```yaml scrape_configs: - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] - job_name: 'pve' static_configs: - targets: ['pve-exporter:9221'] metrics_path: /pve params: module: [default] ``` **Features**: - Time-series metrics database - PromQL query language - Service discovery - Alert manager integration (configurable) - Data retention policies - Remote storage support **Retention Policy**: 15 days (configurable via command line args) ### PVE Exporter **Version**: prompve/prometheus-pve-exporter:latest **Port**: 9221 **Access**: http://192.168.2.114:9221 **Configuration**: - File: `/home/jramos/homelab/monitoring/pve-exporter/pve.yml` - Environment: `/home/jramos/homelab/monitoring/pve-exporter/.env` **Proxmox Connection**: ```yaml default: user: monitoring@pve password: verify_ssl: false ``` **Metrics Exported**: - Proxmox cluster status - Node CPU, memory, disk usage - VM/CT status and resource usage - Storage pool utilization - Network interface statistics - Backup job status - Service health **Environment Variables**: - `PVE_USER`: Proxmox API user (typically `monitoring@pve`) - `PVE_PASSWORD`: API user password - `PVE_VERIFY_SSL`: SSL verification (false for self-signed certs) ## Deployment ### Prerequisites 1. **VM 101 Setup**: ```bash # Install Docker and Docker Compose curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER # Verify installation docker --version docker compose version ``` 2. **Proxmox API User**: ```bash # On Proxmox host, create monitoring user pveum user add monitoring@pve pveum passwd monitoring@pve pveum aclmod / -user monitoring@pve -role PVEAuditor ``` 3. **Clone Repository**: ```bash cd /home/jramos git clone homelab cd homelab/monitoring ``` ### Configuration 1. **PVE Exporter Environment**: ```bash cd pve-exporter nano .env ``` Add: ```env PVE_USER=monitoring@pve PVE_PASSWORD=your-secure-password PVE_VERIFY_SSL=false ``` 2. **Verify Configuration Files**: ```bash # Check PVE exporter config cat pve-exporter/pve.yml # Check Prometheus config cat prometheus/prometheus.yml ``` ### Deployment Steps 1. **Deploy PVE Exporter**: ```bash cd /home/jramos/homelab/monitoring/pve-exporter docker compose up -d docker compose logs -f ``` 2. **Deploy Prometheus**: ```bash cd /home/jramos/homelab/monitoring/prometheus docker compose up -d docker compose logs -f ``` 3. **Deploy Grafana**: ```bash cd /home/jramos/homelab/monitoring/grafana docker compose up -d docker compose logs -f ``` 4. **Verify All Services**: ```bash # Check running containers docker ps # Test PVE Exporter curl http://192.168.2.114:9221/pve?target=192.168.2.200&module=default # Test Prometheus curl http://192.168.2.114:9090/-/healthy # Test Grafana curl http://192.168.2.114:3000/api/health ``` ### Initial Grafana Setup 1. **Access Grafana**: - Navigate to http://192.168.2.114:3000 - Login with default credentials (admin/admin) - Change password when prompted 2. **Add Prometheus Data Source**: - Go to Configuration → Data Sources - Click "Add data source" - Select "Prometheus" - URL: `http://prometheus:9090` - Click "Save & Test" 3. **Import Proxmox Dashboard**: - Go to Dashboards → Import - Dashboard ID: 10347 (Proxmox VE) - Select Prometheus data source - Click "Import" 4. **Configure Alerting** (Optional): - Go to Alerting → Notification channels - Add email, Slack, or other notification methods - Create alert rules in dashboards ## Network Configuration ### Internal Access All services are accessible within the homelab network: - **Grafana**: http://192.168.2.114:3000 - **Prometheus**: http://192.168.2.114:9090 - **PVE Exporter**: http://192.168.2.114:9221 ### External Access (via Nginx Proxy Manager) Configure reverse proxy on CT 102 (nginx at 192.168.2.101): 1. **Create Proxy Host**: - Domain: `monitoring.yourdomain.com` - Scheme: `http` - Forward Hostname: `192.168.2.114` - Forward Port: `3000` 2. **SSL Configuration**: - Enable "Force SSL" - Request Let's Encrypt certificate - Enable HTTP/2 3. **Access List** (Optional): - Create access list for authentication - Apply to proxy host for additional security ## Maintenance ### Update Services ```bash # Update all monitoring services cd /home/jramos/homelab/monitoring # Update PVE Exporter cd pve-exporter docker compose pull docker compose up -d # Update Prometheus cd ../prometheus docker compose pull docker compose up -d # Update Grafana cd ../grafana docker compose pull docker compose up -d ``` ### Backup Grafana Dashboards ```bash # Backup Grafana data docker exec -t grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz # Or use Grafana's provisioning # Dashboards can be exported as JSON and stored in git ``` ### Prometheus Data Retention ```bash # Check Prometheus storage size docker exec prometheus du -sh /prometheus # Adjust retention in docker-compose.yml: # command: # - '--storage.tsdb.retention.time=30d' # - '--storage.tsdb.retention.size=50GB' ``` ### View Logs ```bash # PVE Exporter logs cd /home/jramos/homelab/monitoring/pve-exporter docker compose logs -f # Prometheus logs cd /home/jramos/homelab/monitoring/prometheus docker compose logs -f # Grafana logs cd /home/jramos/homelab/monitoring/grafana docker compose logs -f # All logs together docker logs -f pve-exporter docker logs -f prometheus docker logs -f grafana ``` ## Troubleshooting ### PVE Exporter Cannot Connect to Proxmox **Symptoms**: No metrics from Proxmox, connection refused errors **Solutions**: 1. Verify Proxmox API is accessible: ```bash curl -k https://192.168.2.200:8006/api2/json/version ``` 2. Check PVE Exporter environment variables: ```bash cd /home/jramos/homelab/monitoring/pve-exporter cat .env docker compose config ``` 3. Test authentication: ```bash # From VM 101 curl -k -d "username=monitoring@pve&password=yourpassword" \ https://192.168.2.200:8006/api2/json/access/ticket ``` 4. Verify user permissions on Proxmox: ```bash # On Proxmox host pveum user list pveum aclmod / -user monitoring@pve -role PVEAuditor ``` ### Prometheus Not Scraping Targets **Symptoms**: Targets shown as down in Prometheus UI **Solutions**: 1. Check Prometheus targets: - Navigate to http://192.168.2.114:9090/targets - Verify target status and error messages 2. Verify network connectivity: ```bash docker exec prometheus curl http://pve-exporter:9221/pve ``` 3. Check Prometheus configuration: ```bash cd /home/jramos/homelab/monitoring/prometheus docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml ``` 4. Reload Prometheus configuration: ```bash docker compose restart prometheus ``` ### Grafana Shows No Data **Symptoms**: Dashboards display "No data" or empty graphs **Solutions**: 1. Verify Prometheus data source: - Go to Configuration → Data Sources - Test connection to Prometheus - URL should be `http://prometheus:9090` 2. Check Prometheus has data: - Navigate to http://192.168.2.114:9090 - Run query: `up` - Should show all scrape targets 3. Verify dashboard queries: - Edit panel - Check PromQL query syntax - Test query in Prometheus UI first 4. Check time range: - Ensure dashboard time range includes recent data - Prometheus retention period not exceeded ### Docker Compose Network Issues **Symptoms**: Containers cannot communicate **Solutions**: 1. Check Docker network: ```bash docker network ls docker network inspect monitoring_default ``` 2. Verify container connectivity: ```bash docker exec prometheus ping pve-exporter docker exec grafana ping prometheus ``` 3. Recreate network: ```bash cd /home/jramos/homelab/monitoring docker compose down docker network prune docker compose up -d ``` ### High Memory Usage **Symptoms**: VM 101 running out of memory **Solutions**: 1. Check container memory usage: ```bash docker stats ``` 2. Reduce Prometheus retention: ```yaml # In prometheus/docker-compose.yml command: - '--storage.tsdb.retention.time=7d' - '--storage.tsdb.retention.size=10GB' ``` 3. Limit Grafana image rendering: ```yaml # In grafana/docker-compose.yml environment: - GF_RENDERING_SERVER_URL= - GF_RENDERING_CALLBACK_URL= ``` 4. Increase VM memory allocation in Proxmox ### SSL/TLS Certificate Errors **Symptoms**: PVE Exporter cannot verify SSL certificate **Solutions**: 1. Set `verify_ssl: false` in `pve.yml` (for self-signed certs) 2. Or import Proxmox CA certificate: ```bash # Copy CA from Proxmox to VM 101 scp root@192.168.2.200:/etc/pve/pve-root-ca.pem . # Add to trust store sudo cp pve-root-ca.pem /usr/local/share/ca-certificates/pve-root-ca.crt sudo update-ca-certificates ``` ## Metrics Reference ### Key Proxmox Metrics **Node Metrics**: - `pve_node_cpu_usage_ratio`: CPU utilization (0-1) - `pve_node_memory_usage_bytes`: Memory used - `pve_node_memory_total_bytes`: Total memory - `pve_node_disk_usage_bytes`: Root disk used - `pve_node_uptime_seconds`: Node uptime **VM/CT Metrics**: - `pve_guest_info`: Guest information (labels: id, name, type, node) - `pve_guest_cpu_usage_ratio`: Guest CPU usage - `pve_guest_memory_usage_bytes`: Guest memory used - `pve_guest_disk_read_bytes_total`: Disk read bytes - `pve_guest_disk_write_bytes_total`: Disk write bytes - `pve_guest_network_receive_bytes_total`: Network received - `pve_guest_network_transmit_bytes_total`: Network transmitted **Storage Metrics**: - `pve_storage_usage_bytes`: Storage used - `pve_storage_size_bytes`: Total storage size - `pve_storage_info`: Storage information (labels: storage, type) ### Useful PromQL Queries **CPU Usage by VM**: ```promql pve_guest_cpu_usage_ratio{type="qemu"} * 100 ``` **Memory Usage Percentage**: ```promql (pve_guest_memory_usage_bytes / pve_guest_memory_size_bytes) * 100 ``` **Storage Usage Percentage**: ```promql (pve_storage_usage_bytes / pve_storage_size_bytes) * 100 ``` **Network Bandwidth (rate)**: ```promql rate(pve_guest_network_transmit_bytes_total[5m]) ``` **Top 5 VMs by CPU**: ```promql topk(5, pve_guest_cpu_usage_ratio{type="qemu"}) ``` ## Security Considerations ### API Credentials 1. **PVE Exporter `.env` file**: - Never commit to version control - Use strong passwords - Restrict file permissions: `chmod 600 .env` 2. **Proxmox API User**: - Use dedicated monitoring user - Grant minimal required permissions (PVEAuditor role) - Consider token-based authentication 3. **Grafana Authentication**: - Change default admin password - Enable OAuth/LDAP for user authentication - Use role-based access control ### Network Security 1. **Firewall Rules**: ```bash # On VM 101, restrict access ufw allow from 192.168.2.0/24 to any port 3000 ufw allow from 192.168.2.0/24 to any port 9090 ufw allow from 192.168.2.0/24 to any port 9221 ``` 2. **Reverse Proxy**: - Use Nginx Proxy Manager for SSL termination - Implement access lists - Enable fail2ban for brute force protection 3. **Docker Security**: - Run containers as non-root users - Use read-only filesystems where possible - Limit container capabilities ## Performance Tuning ### Prometheus Optimization **Scrape Interval**: ```yaml global: scrape_interval: 30s # Increase for less frequent scraping evaluation_interval: 30s ``` **Target Relabeling**: ```yaml relabel_configs: - source_labels: [__address__] regex: '.*' action: keep # Keep only matching targets ``` ### Grafana Optimization **Query Optimization**: - Use recording rules in Prometheus for complex queries - Set appropriate refresh intervals on dashboards - Limit time range on expensive queries **Caching**: ```ini # In grafana.ini or environment variables [caching] enabled = true ttl = 3600 ``` ## Advanced Configuration ### Alerting with Alertmanager 1. **Add Alertmanager to stack**: ```bash cd /home/jramos/homelab/monitoring # Create alertmanager directory with docker-compose.yml ``` 2. **Configure alerts in Prometheus**: ```yaml # In prometheus.yml alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] rule_files: - 'alerts.yml' ``` 3. **Example alert rules**: ```yaml # alerts.yml groups: - name: proxmox interval: 30s rules: - alert: HighCPUUsage expr: pve_node_cpu_usage_ratio > 0.9 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.node }}" ``` ### Multi-Node Proxmox Cluster For clustered Proxmox environments: ```yaml # In pve.yml cluster1: user: monitoring@pve password: ${PVE_PASSWORD} verify_ssl: false cluster2: user: monitoring@pve password: ${PVE_PASSWORD} verify_ssl: false ``` ### Dashboard Provisioning Store dashboards as code: ```bash # Create provisioning directory mkdir -p grafana/provisioning/dashboards # Add provisioning config # grafana/provisioning/dashboards/dashboards.yml ``` ## Integration with Other Services ### n8n Workflow Automation Create workflows in n8n (CT 113) to: - Send alerts to Slack/Discord based on Prometheus alerts - Generate daily/weekly infrastructure reports - Automate backup verification checks ### NetBox IPAM Sync monitoring targets with NetBox (CT 103): - Automatically discover new VMs/CTs - Update service inventory - Link metrics to network documentation ## Additional Resources ### Documentation - [Prometheus Documentation](https://prometheus.io/docs/) - [Grafana Documentation](https://grafana.com/docs/) - [PVE Exporter GitHub](https://github.com/prometheus-pve/prometheus-pve-exporter) - [Proxmox API Documentation](https://pve.proxmox.com/pve-docs/api-viewer/) ### Community Dashboards - Grafana Dashboard 10347: Proxmox VE - Grafana Dashboard 15356: Proxmox Cluster - Grafana Dashboard 15362: Proxmox Summary ### Related Homelab Documentation - [Homelab Overview](../README.md) - [Services Documentation](../services/README.md) - [Infrastructure Index](../INDEX.md) - [n8n Setup Guide](../services/README.md#n8n-workflow-automation) --- **Last Updated**: 2025-12-07 **Maintainer**: jramos **VM**: 101 (monitoring-docker) at 192.168.2.114 **Stack Version**: Prometheus 2.x, Grafana 10.x, PVE Exporter latest