- Update INDEX.md with VM 101 (monitoring-docker) and CT 112 (twingate-connector) - Update README.md with monitoring and security sections - Update CLAUDE.md with new architecture patterns - Update services/README.md with monitoring stack documentation - Update CLAUDE_STATUS.md with current infrastructure state - Update infrastructure counts: 10 VMs, 4 Containers - Update storage stats: PBS 27.43%, Vault 10.88% - Create comprehensive monitoring/README.md - Add .gitignore rules for monitoring sensitive files (pve.yml, .env) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
18 KiB
Monitoring Stack
Comprehensive monitoring and observability stack for the Proxmox homelab environment, providing real-time metrics, visualization, and alerting capabilities.
Overview
The monitoring stack consists of three primary components deployed on VM 101 (monitoring-docker) at 192.168.2.114:
- Grafana: Visualization and dashboards (Port 3000)
- Prometheus: Metrics collection and time-series database (Port 9090)
- PVE Exporter: Proxmox VE metrics exporter (Port 9221)
Architecture
┌─────────────────────────────────────────────────────────────────┐
│ Proxmox Host (serviceslab) │
│ 192.168.2.200 │
└────────────────────────────┬────────────────────────────────────┘
│
│ API (8006)
│
┌────────▼────────┐
│ PVE Exporter │
│ Port: 9221 │
│ (VM 101) │
└────────┬────────┘
│
│ Metrics
│
┌────────▼────────┐
│ Prometheus │
│ Port: 9090 │
│ (VM 101) │
└────────┬────────┘
│
│ Query
│
┌────────▼────────┐
│ Grafana │
│ Port: 3000 │
│ (VM 101) │
└─────────────────┘
│
│ HTTPS
│
┌────────▼────────┐
│ Nginx Proxy │
│ (CT 102) │
│ 192.168.2.101 │
└─────────────────┘
Components
VM 101: monitoring-docker
Specifications:
- IP Address: 192.168.2.114
- Operating System: Ubuntu 22.04/24.04 LTS
- Docker Version: 24.0+
- Purpose: Dedicated monitoring infrastructure host
Resource Allocation:
- CPU: 2-4 cores
- Memory: 4-8 GB
- Storage: 50-100 GB (thin provisioned)
Grafana
Version: Latest stable Port: 3000 Access: http://192.168.2.114:3000
Features:
- Pre-configured Proxmox VE dashboards
- Prometheus data source integration
- User authentication and authorization
- Dashboard templating and variables
- Alerting capabilities
- Panel plugins for advanced visualizations
Default Credentials:
- Username:
admin - Password: Check
.envfile or initial setup
Key Dashboards:
- Proxmox Host Overview
- VM Resource Utilization
- Container Resource Utilization
- Storage Pool Metrics
- Network Traffic Analysis
Prometheus
Version: Latest stable Port: 9090 Access: http://192.168.2.114:9090
Configuration: /home/jramos/homelab/monitoring/prometheus/prometheus.yml
Scrape Targets:
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'pve'
static_configs:
- targets: ['pve-exporter:9221']
metrics_path: /pve
params:
module: [default]
Features:
- Time-series metrics database
- PromQL query language
- Service discovery
- Alert manager integration (configurable)
- Data retention policies
- Remote storage support
Retention Policy: 15 days (configurable via command line args)
PVE Exporter
Version: prompve/prometheus-pve-exporter:latest Port: 9221 Access: http://192.168.2.114:9221
Configuration:
- File:
/home/jramos/homelab/monitoring/pve-exporter/pve.yml - Environment:
/home/jramos/homelab/monitoring/pve-exporter/.env
Proxmox Connection:
default:
user: monitoring@pve
password: <stored in .env>
verify_ssl: false
Metrics Exported:
- Proxmox cluster status
- Node CPU, memory, disk usage
- VM/CT status and resource usage
- Storage pool utilization
- Network interface statistics
- Backup job status
- Service health
Environment Variables:
PVE_USER: Proxmox API user (typicallymonitoring@pve)PVE_PASSWORD: API user passwordPVE_VERIFY_SSL: SSL verification (false for self-signed certs)
Deployment
Prerequisites
-
VM 101 Setup:
# Install Docker and Docker Compose curl -fsSL https://get.docker.com | sh sudo usermod -aG docker $USER # Verify installation docker --version docker compose version -
Proxmox API User:
# On Proxmox host, create monitoring user pveum user add monitoring@pve pveum passwd monitoring@pve pveum aclmod / -user monitoring@pve -role PVEAuditor -
Clone Repository:
cd /home/jramos git clone <repository-url> homelab cd homelab/monitoring
Configuration
-
PVE Exporter Environment:
cd pve-exporter nano .envAdd:
PVE_USER=monitoring@pve PVE_PASSWORD=your-secure-password PVE_VERIFY_SSL=false -
Verify Configuration Files:
# Check PVE exporter config cat pve-exporter/pve.yml # Check Prometheus config cat prometheus/prometheus.yml
Deployment Steps
-
Deploy PVE Exporter:
cd /home/jramos/homelab/monitoring/pve-exporter docker compose up -d docker compose logs -f -
Deploy Prometheus:
cd /home/jramos/homelab/monitoring/prometheus docker compose up -d docker compose logs -f -
Deploy Grafana:
cd /home/jramos/homelab/monitoring/grafana docker compose up -d docker compose logs -f -
Verify All Services:
# Check running containers docker ps # Test PVE Exporter curl http://192.168.2.114:9221/pve?target=192.168.2.200&module=default # Test Prometheus curl http://192.168.2.114:9090/-/healthy # Test Grafana curl http://192.168.2.114:3000/api/health
Initial Grafana Setup
-
Access Grafana:
- Navigate to http://192.168.2.114:3000
- Login with default credentials (admin/admin)
- Change password when prompted
-
Add Prometheus Data Source:
- Go to Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- URL:
http://prometheus:9090 - Click "Save & Test"
-
Import Proxmox Dashboard:
- Go to Dashboards → Import
- Dashboard ID: 10347 (Proxmox VE)
- Select Prometheus data source
- Click "Import"
-
Configure Alerting (Optional):
- Go to Alerting → Notification channels
- Add email, Slack, or other notification methods
- Create alert rules in dashboards
Network Configuration
Internal Access
All services are accessible within the homelab network:
- Grafana: http://192.168.2.114:3000
- Prometheus: http://192.168.2.114:9090
- PVE Exporter: http://192.168.2.114:9221
External Access (via Nginx Proxy Manager)
Configure reverse proxy on CT 102 (nginx at 192.168.2.101):
-
Create Proxy Host:
- Domain:
monitoring.yourdomain.com - Scheme:
http - Forward Hostname:
192.168.2.114 - Forward Port:
3000
- Domain:
-
SSL Configuration:
- Enable "Force SSL"
- Request Let's Encrypt certificate
- Enable HTTP/2
-
Access List (Optional):
- Create access list for authentication
- Apply to proxy host for additional security
Maintenance
Update Services
# Update all monitoring services
cd /home/jramos/homelab/monitoring
# Update PVE Exporter
cd pve-exporter
docker compose pull
docker compose up -d
# Update Prometheus
cd ../prometheus
docker compose pull
docker compose up -d
# Update Grafana
cd ../grafana
docker compose pull
docker compose up -d
Backup Grafana Dashboards
# Backup Grafana data
docker exec -t grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz
# Or use Grafana's provisioning
# Dashboards can be exported as JSON and stored in git
Prometheus Data Retention
# Check Prometheus storage size
docker exec prometheus du -sh /prometheus
# Adjust retention in docker-compose.yml:
# command:
# - '--storage.tsdb.retention.time=30d'
# - '--storage.tsdb.retention.size=50GB'
View Logs
# PVE Exporter logs
cd /home/jramos/homelab/monitoring/pve-exporter
docker compose logs -f
# Prometheus logs
cd /home/jramos/homelab/monitoring/prometheus
docker compose logs -f
# Grafana logs
cd /home/jramos/homelab/monitoring/grafana
docker compose logs -f
# All logs together
docker logs -f pve-exporter
docker logs -f prometheus
docker logs -f grafana
Troubleshooting
PVE Exporter Cannot Connect to Proxmox
Symptoms: No metrics from Proxmox, connection refused errors
Solutions:
-
Verify Proxmox API is accessible:
curl -k https://192.168.2.200:8006/api2/json/version -
Check PVE Exporter environment variables:
cd /home/jramos/homelab/monitoring/pve-exporter cat .env docker compose config -
Test authentication:
# From VM 101 curl -k -d "username=monitoring@pve&password=yourpassword" \ https://192.168.2.200:8006/api2/json/access/ticket -
Verify user permissions on Proxmox:
# On Proxmox host pveum user list pveum aclmod / -user monitoring@pve -role PVEAuditor
Prometheus Not Scraping Targets
Symptoms: Targets shown as down in Prometheus UI
Solutions:
-
Check Prometheus targets:
- Navigate to http://192.168.2.114:9090/targets
- Verify target status and error messages
-
Verify network connectivity:
docker exec prometheus curl http://pve-exporter:9221/pve -
Check Prometheus configuration:
cd /home/jramos/homelab/monitoring/prometheus docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml -
Reload Prometheus configuration:
docker compose restart prometheus
Grafana Shows No Data
Symptoms: Dashboards display "No data" or empty graphs
Solutions:
-
Verify Prometheus data source:
- Go to Configuration → Data Sources
- Test connection to Prometheus
- URL should be
http://prometheus:9090
-
Check Prometheus has data:
- Navigate to http://192.168.2.114:9090
- Run query:
up - Should show all scrape targets
-
Verify dashboard queries:
- Edit panel
- Check PromQL query syntax
- Test query in Prometheus UI first
-
Check time range:
- Ensure dashboard time range includes recent data
- Prometheus retention period not exceeded
Docker Compose Network Issues
Symptoms: Containers cannot communicate
Solutions:
-
Check Docker network:
docker network ls docker network inspect monitoring_default -
Verify container connectivity:
docker exec prometheus ping pve-exporter docker exec grafana ping prometheus -
Recreate network:
cd /home/jramos/homelab/monitoring docker compose down docker network prune docker compose up -d
High Memory Usage
Symptoms: VM 101 running out of memory
Solutions:
-
Check container memory usage:
docker stats -
Reduce Prometheus retention:
# In prometheus/docker-compose.yml command: - '--storage.tsdb.retention.time=7d' - '--storage.tsdb.retention.size=10GB' -
Limit Grafana image rendering:
# In grafana/docker-compose.yml environment: - GF_RENDERING_SERVER_URL= - GF_RENDERING_CALLBACK_URL= -
Increase VM memory allocation in Proxmox
SSL/TLS Certificate Errors
Symptoms: PVE Exporter cannot verify SSL certificate
Solutions:
- Set
verify_ssl: falseinpve.yml(for self-signed certs) - Or import Proxmox CA certificate:
# Copy CA from Proxmox to VM 101 scp root@192.168.2.200:/etc/pve/pve-root-ca.pem . # Add to trust store sudo cp pve-root-ca.pem /usr/local/share/ca-certificates/pve-root-ca.crt sudo update-ca-certificates
Metrics Reference
Key Proxmox Metrics
Node Metrics:
pve_node_cpu_usage_ratio: CPU utilization (0-1)pve_node_memory_usage_bytes: Memory usedpve_node_memory_total_bytes: Total memorypve_node_disk_usage_bytes: Root disk usedpve_node_uptime_seconds: Node uptime
VM/CT Metrics:
pve_guest_info: Guest information (labels: id, name, type, node)pve_guest_cpu_usage_ratio: Guest CPU usagepve_guest_memory_usage_bytes: Guest memory usedpve_guest_disk_read_bytes_total: Disk read bytespve_guest_disk_write_bytes_total: Disk write bytespve_guest_network_receive_bytes_total: Network receivedpve_guest_network_transmit_bytes_total: Network transmitted
Storage Metrics:
pve_storage_usage_bytes: Storage usedpve_storage_size_bytes: Total storage sizepve_storage_info: Storage information (labels: storage, type)
Useful PromQL Queries
CPU Usage by VM:
pve_guest_cpu_usage_ratio{type="qemu"} * 100
Memory Usage Percentage:
(pve_guest_memory_usage_bytes / pve_guest_memory_size_bytes) * 100
Storage Usage Percentage:
(pve_storage_usage_bytes / pve_storage_size_bytes) * 100
Network Bandwidth (rate):
rate(pve_guest_network_transmit_bytes_total[5m])
Top 5 VMs by CPU:
topk(5, pve_guest_cpu_usage_ratio{type="qemu"})
Security Considerations
API Credentials
-
PVE Exporter
.envfile:- Never commit to version control
- Use strong passwords
- Restrict file permissions:
chmod 600 .env
-
Proxmox API User:
- Use dedicated monitoring user
- Grant minimal required permissions (PVEAuditor role)
- Consider token-based authentication
-
Grafana Authentication:
- Change default admin password
- Enable OAuth/LDAP for user authentication
- Use role-based access control
Network Security
-
Firewall Rules:
# On VM 101, restrict access ufw allow from 192.168.2.0/24 to any port 3000 ufw allow from 192.168.2.0/24 to any port 9090 ufw allow from 192.168.2.0/24 to any port 9221 -
Reverse Proxy:
- Use Nginx Proxy Manager for SSL termination
- Implement access lists
- Enable fail2ban for brute force protection
-
Docker Security:
- Run containers as non-root users
- Use read-only filesystems where possible
- Limit container capabilities
Performance Tuning
Prometheus Optimization
Scrape Interval:
global:
scrape_interval: 30s # Increase for less frequent scraping
evaluation_interval: 30s
Target Relabeling:
relabel_configs:
- source_labels: [__address__]
regex: '.*'
action: keep # Keep only matching targets
Grafana Optimization
Query Optimization:
- Use recording rules in Prometheus for complex queries
- Set appropriate refresh intervals on dashboards
- Limit time range on expensive queries
Caching:
# In grafana.ini or environment variables
[caching]
enabled = true
ttl = 3600
Advanced Configuration
Alerting with Alertmanager
-
Add Alertmanager to stack:
cd /home/jramos/homelab/monitoring # Create alertmanager directory with docker-compose.yml -
Configure alerts in Prometheus:
# In prometheus.yml alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] rule_files: - 'alerts.yml' -
Example alert rules:
# alerts.yml groups: - name: proxmox interval: 30s rules: - alert: HighCPUUsage expr: pve_node_cpu_usage_ratio > 0.9 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.node }}"
Multi-Node Proxmox Cluster
For clustered Proxmox environments:
# In pve.yml
cluster1:
user: monitoring@pve
password: ${PVE_PASSWORD}
verify_ssl: false
cluster2:
user: monitoring@pve
password: ${PVE_PASSWORD}
verify_ssl: false
Dashboard Provisioning
Store dashboards as code:
# Create provisioning directory
mkdir -p grafana/provisioning/dashboards
# Add provisioning config
# grafana/provisioning/dashboards/dashboards.yml
Integration with Other Services
n8n Workflow Automation
Create workflows in n8n (CT 113) to:
- Send alerts to Slack/Discord based on Prometheus alerts
- Generate daily/weekly infrastructure reports
- Automate backup verification checks
NetBox IPAM
Sync monitoring targets with NetBox (CT 103):
- Automatically discover new VMs/CTs
- Update service inventory
- Link metrics to network documentation
Additional Resources
Documentation
Community Dashboards
- Grafana Dashboard 10347: Proxmox VE
- Grafana Dashboard 15356: Proxmox Cluster
- Grafana Dashboard 15362: Proxmox Summary
Related Homelab Documentation
Last Updated: 2025-12-07 Maintainer: jramos VM: 101 (monitoring-docker) at 192.168.2.114 Stack Version: Prometheus 2.x, Grafana 10.x, PVE Exporter latest