Files
homelab/monitoring
Jordan Ramos 004e3da77c feat(agents): optimize sub-agent architecture with comprehensive prompt engineering
This commit implements a comprehensive optimization of all sub-agent prompt
definitions based on Opus-powered prompt engineering analysis. All agents now
match the quality standard established by librarian.md.

Agent Improvements:
- scribe.md: 29→340 lines (11.7x expansion)
  * Added 6 usage examples with role clarity
  * Implemented comprehensive responsibilities section
  * Added 3 complete ASCII diagram templates
  * Included safety protocols and decision frameworks

- backend-builder.md: 40→291 lines (7.3x expansion)
  * Added 6 usage examples with clear boundaries
  * Expanded core responsibilities (Ansible, Terraform, Docker, Python, Shell)
  * Added technology stack and validation rules tables
  * Included handoff protocol for lab-operator deployment
  * Defined clear boundaries (CREATES code, does NOT deploy)

- lab-operator.md: 37→193 lines (5.2x expansion)
  * Added 6 usage examples with role clarity
  * Expanded domain expertise with specific commands
  * Added command style guide (5-step pattern)
  * Included safety protocols and decision-making framework
  * Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)

- librarian.md: Minor formatting improvements

CLAUDE.md Fixes:
- Moved YAML frontmatter to line 1 (was incorrectly at line 89)
- Fixed trailing pipe character
- Completed incomplete sentences about backup strategy and storage growth
- Removed redundant information
- Expanded status file template with recovery instructions

Files Added:
- Claude_UPDATES.md: Comprehensive prompt engineering analysis report
- monitoring/pve-exporter/pve.yml: PVE monitoring configuration

Impact:
- Total agent documentation: 249→967 lines (288% increase)
- Usage examples: 6→24 total (400% increase)
- All agents now have comprehensive safety protocols
- Clear role boundaries prevent agent overlap
- Validation testing confirms all agents functional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2025-12-07 22:39:40 -07:00
..

Monitoring Stack

Comprehensive monitoring and observability stack for the Proxmox homelab environment, providing real-time metrics, visualization, and alerting capabilities.

Overview

The monitoring stack consists of three primary components deployed on VM 101 (monitoring-docker) at 192.168.2.114:

  • Grafana: Visualization and dashboards (Port 3000)
  • Prometheus: Metrics collection and time-series database (Port 9090)
  • PVE Exporter: Proxmox VE metrics exporter (Port 9221)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Proxmox Host (serviceslab)                   │
│                         192.168.2.200                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             │ API (8006)
                             │
                    ┌────────▼────────┐
                    │  PVE Exporter   │
                    │   Port: 9221    │
                    │  (VM 101)       │
                    └────────┬────────┘
                             │
                             │ Metrics
                             │
                    ┌────────▼────────┐
                    │   Prometheus    │
                    │   Port: 9090    │
                    │  (VM 101)       │
                    └────────┬────────┘
                             │
                             │ Query
                             │
                    ┌────────▼────────┐
                    │     Grafana     │
                    │   Port: 3000    │
                    │  (VM 101)       │
                    └─────────────────┘
                             │
                             │ HTTPS
                             │
                    ┌────────▼────────┐
                    │  Nginx Proxy    │
                    │   (CT 102)      │
                    │  192.168.2.101  │
                    └─────────────────┘

Components

VM 101: monitoring-docker

Specifications:

  • IP Address: 192.168.2.114
  • Operating System: Ubuntu 22.04/24.04 LTS
  • Docker Version: 24.0+
  • Purpose: Dedicated monitoring infrastructure host

Resource Allocation:

  • CPU: 2-4 cores
  • Memory: 4-8 GB
  • Storage: 50-100 GB (thin provisioned)

Grafana

Version: Latest stable Port: 3000 Access: http://192.168.2.114:3000

Features:

  • Pre-configured Proxmox VE dashboards
  • Prometheus data source integration
  • User authentication and authorization
  • Dashboard templating and variables
  • Alerting capabilities
  • Panel plugins for advanced visualizations

Default Credentials:

  • Username: admin
  • Password: Check .env file or initial setup

Key Dashboards:

  • Proxmox Host Overview
  • VM Resource Utilization
  • Container Resource Utilization
  • Storage Pool Metrics
  • Network Traffic Analysis

Prometheus

Version: Latest stable Port: 9090 Access: http://192.168.2.114:9090

Configuration: /home/jramos/homelab/monitoring/prometheus/prometheus.yml

Scrape Targets:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'pve'
    static_configs:
      - targets: ['pve-exporter:9221']
    metrics_path: /pve
    params:
      module: [default]

Features:

  • Time-series metrics database
  • PromQL query language
  • Service discovery
  • Alert manager integration (configurable)
  • Data retention policies
  • Remote storage support

Retention Policy: 15 days (configurable via command line args)

PVE Exporter

Version: prompve/prometheus-pve-exporter:latest Port: 9221 Access: http://192.168.2.114:9221

Configuration:

  • File: /home/jramos/homelab/monitoring/pve-exporter/pve.yml
  • Environment: /home/jramos/homelab/monitoring/pve-exporter/.env

Proxmox Connection:

default:
  user: monitoring@pve
  password: <stored in .env>
  verify_ssl: false

Metrics Exported:

  • Proxmox cluster status
  • Node CPU, memory, disk usage
  • VM/CT status and resource usage
  • Storage pool utilization
  • Network interface statistics
  • Backup job status
  • Service health

Environment Variables:

  • PVE_USER: Proxmox API user (typically monitoring@pve)
  • PVE_PASSWORD: API user password
  • PVE_VERIFY_SSL: SSL verification (false for self-signed certs)

Deployment

Prerequisites

  1. VM 101 Setup:

    # Install Docker and Docker Compose
    curl -fsSL https://get.docker.com | sh
    sudo usermod -aG docker $USER
    
    # Verify installation
    docker --version
    docker compose version
    
  2. Proxmox API User:

    # On Proxmox host, create monitoring user
    pveum user add monitoring@pve
    pveum passwd monitoring@pve
    pveum aclmod / -user monitoring@pve -role PVEAuditor
    
  3. Clone Repository:

    cd /home/jramos
    git clone <repository-url> homelab
    cd homelab/monitoring
    

Configuration

  1. PVE Exporter Environment:

    cd pve-exporter
    nano .env
    

    Add:

    PVE_USER=monitoring@pve
    PVE_PASSWORD=your-secure-password
    PVE_VERIFY_SSL=false
    
  2. Verify Configuration Files:

    # Check PVE exporter config
    cat pve-exporter/pve.yml
    
    # Check Prometheus config
    cat prometheus/prometheus.yml
    

Deployment Steps

  1. Deploy PVE Exporter:

    cd /home/jramos/homelab/monitoring/pve-exporter
    docker compose up -d
    docker compose logs -f
    
  2. Deploy Prometheus:

    cd /home/jramos/homelab/monitoring/prometheus
    docker compose up -d
    docker compose logs -f
    
  3. Deploy Grafana:

    cd /home/jramos/homelab/monitoring/grafana
    docker compose up -d
    docker compose logs -f
    
  4. Verify All Services:

    # Check running containers
    docker ps
    
    # Test PVE Exporter
    curl http://192.168.2.114:9221/pve?target=192.168.2.200&module=default
    
    # Test Prometheus
    curl http://192.168.2.114:9090/-/healthy
    
    # Test Grafana
    curl http://192.168.2.114:3000/api/health
    

Initial Grafana Setup

  1. Access Grafana:

  2. Add Prometheus Data Source:

    • Go to Configuration → Data Sources
    • Click "Add data source"
    • Select "Prometheus"
    • URL: http://prometheus:9090
    • Click "Save & Test"
  3. Import Proxmox Dashboard:

    • Go to Dashboards → Import
    • Dashboard ID: 10347 (Proxmox VE)
    • Select Prometheus data source
    • Click "Import"
  4. Configure Alerting (Optional):

    • Go to Alerting → Notification channels
    • Add email, Slack, or other notification methods
    • Create alert rules in dashboards

Network Configuration

Internal Access

All services are accessible within the homelab network:

External Access (via Nginx Proxy Manager)

Configure reverse proxy on CT 102 (nginx at 192.168.2.101):

  1. Create Proxy Host:

    • Domain: monitoring.yourdomain.com
    • Scheme: http
    • Forward Hostname: 192.168.2.114
    • Forward Port: 3000
  2. SSL Configuration:

    • Enable "Force SSL"
    • Request Let's Encrypt certificate
    • Enable HTTP/2
  3. Access List (Optional):

    • Create access list for authentication
    • Apply to proxy host for additional security

Maintenance

Update Services

# Update all monitoring services
cd /home/jramos/homelab/monitoring

# Update PVE Exporter
cd pve-exporter
docker compose pull
docker compose up -d

# Update Prometheus
cd ../prometheus
docker compose pull
docker compose up -d

# Update Grafana
cd ../grafana
docker compose pull
docker compose up -d

Backup Grafana Dashboards

# Backup Grafana data
docker exec -t grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz

# Or use Grafana's provisioning
# Dashboards can be exported as JSON and stored in git

Prometheus Data Retention

# Check Prometheus storage size
docker exec prometheus du -sh /prometheus

# Adjust retention in docker-compose.yml:
# command:
#   - '--storage.tsdb.retention.time=30d'
#   - '--storage.tsdb.retention.size=50GB'

View Logs

# PVE Exporter logs
cd /home/jramos/homelab/monitoring/pve-exporter
docker compose logs -f

# Prometheus logs
cd /home/jramos/homelab/monitoring/prometheus
docker compose logs -f

# Grafana logs
cd /home/jramos/homelab/monitoring/grafana
docker compose logs -f

# All logs together
docker logs -f pve-exporter
docker logs -f prometheus
docker logs -f grafana

Troubleshooting

PVE Exporter Cannot Connect to Proxmox

Symptoms: No metrics from Proxmox, connection refused errors

Solutions:

  1. Verify Proxmox API is accessible:

    curl -k https://192.168.2.200:8006/api2/json/version
    
  2. Check PVE Exporter environment variables:

    cd /home/jramos/homelab/monitoring/pve-exporter
    cat .env
    docker compose config
    
  3. Test authentication:

    # From VM 101
    curl -k -d "username=monitoring@pve&password=yourpassword" \
      https://192.168.2.200:8006/api2/json/access/ticket
    
  4. Verify user permissions on Proxmox:

    # On Proxmox host
    pveum user list
    pveum aclmod / -user monitoring@pve -role PVEAuditor
    

Prometheus Not Scraping Targets

Symptoms: Targets shown as down in Prometheus UI

Solutions:

  1. Check Prometheus targets:

  2. Verify network connectivity:

    docker exec prometheus curl http://pve-exporter:9221/pve
    
  3. Check Prometheus configuration:

    cd /home/jramos/homelab/monitoring/prometheus
    docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml
    
  4. Reload Prometheus configuration:

    docker compose restart prometheus
    

Grafana Shows No Data

Symptoms: Dashboards display "No data" or empty graphs

Solutions:

  1. Verify Prometheus data source:

    • Go to Configuration → Data Sources
    • Test connection to Prometheus
    • URL should be http://prometheus:9090
  2. Check Prometheus has data:

  3. Verify dashboard queries:

    • Edit panel
    • Check PromQL query syntax
    • Test query in Prometheus UI first
  4. Check time range:

    • Ensure dashboard time range includes recent data
    • Prometheus retention period not exceeded

Docker Compose Network Issues

Symptoms: Containers cannot communicate

Solutions:

  1. Check Docker network:

    docker network ls
    docker network inspect monitoring_default
    
  2. Verify container connectivity:

    docker exec prometheus ping pve-exporter
    docker exec grafana ping prometheus
    
  3. Recreate network:

    cd /home/jramos/homelab/monitoring
    docker compose down
    docker network prune
    docker compose up -d
    

High Memory Usage

Symptoms: VM 101 running out of memory

Solutions:

  1. Check container memory usage:

    docker stats
    
  2. Reduce Prometheus retention:

    # In prometheus/docker-compose.yml
    command:
      - '--storage.tsdb.retention.time=7d'
      - '--storage.tsdb.retention.size=10GB'
    
  3. Limit Grafana image rendering:

    # In grafana/docker-compose.yml
    environment:
      - GF_RENDERING_SERVER_URL=
      - GF_RENDERING_CALLBACK_URL=
    
  4. Increase VM memory allocation in Proxmox

SSL/TLS Certificate Errors

Symptoms: PVE Exporter cannot verify SSL certificate

Solutions:

  1. Set verify_ssl: false in pve.yml (for self-signed certs)
  2. Or import Proxmox CA certificate:
    # Copy CA from Proxmox to VM 101
    scp root@192.168.2.200:/etc/pve/pve-root-ca.pem .
    
    # Add to trust store
    sudo cp pve-root-ca.pem /usr/local/share/ca-certificates/pve-root-ca.crt
    sudo update-ca-certificates
    

Metrics Reference

Key Proxmox Metrics

Node Metrics:

  • pve_node_cpu_usage_ratio: CPU utilization (0-1)
  • pve_node_memory_usage_bytes: Memory used
  • pve_node_memory_total_bytes: Total memory
  • pve_node_disk_usage_bytes: Root disk used
  • pve_node_uptime_seconds: Node uptime

VM/CT Metrics:

  • pve_guest_info: Guest information (labels: id, name, type, node)
  • pve_guest_cpu_usage_ratio: Guest CPU usage
  • pve_guest_memory_usage_bytes: Guest memory used
  • pve_guest_disk_read_bytes_total: Disk read bytes
  • pve_guest_disk_write_bytes_total: Disk write bytes
  • pve_guest_network_receive_bytes_total: Network received
  • pve_guest_network_transmit_bytes_total: Network transmitted

Storage Metrics:

  • pve_storage_usage_bytes: Storage used
  • pve_storage_size_bytes: Total storage size
  • pve_storage_info: Storage information (labels: storage, type)

Useful PromQL Queries

CPU Usage by VM:

pve_guest_cpu_usage_ratio{type="qemu"} * 100

Memory Usage Percentage:

(pve_guest_memory_usage_bytes / pve_guest_memory_size_bytes) * 100

Storage Usage Percentage:

(pve_storage_usage_bytes / pve_storage_size_bytes) * 100

Network Bandwidth (rate):

rate(pve_guest_network_transmit_bytes_total[5m])

Top 5 VMs by CPU:

topk(5, pve_guest_cpu_usage_ratio{type="qemu"})

Security Considerations

API Credentials

  1. PVE Exporter .env file:

    • Never commit to version control
    • Use strong passwords
    • Restrict file permissions: chmod 600 .env
  2. Proxmox API User:

    • Use dedicated monitoring user
    • Grant minimal required permissions (PVEAuditor role)
    • Consider token-based authentication
  3. Grafana Authentication:

    • Change default admin password
    • Enable OAuth/LDAP for user authentication
    • Use role-based access control

Network Security

  1. Firewall Rules:

    # On VM 101, restrict access
    ufw allow from 192.168.2.0/24 to any port 3000
    ufw allow from 192.168.2.0/24 to any port 9090
    ufw allow from 192.168.2.0/24 to any port 9221
    
  2. Reverse Proxy:

    • Use Nginx Proxy Manager for SSL termination
    • Implement access lists
    • Enable fail2ban for brute force protection
  3. Docker Security:

    • Run containers as non-root users
    • Use read-only filesystems where possible
    • Limit container capabilities

Performance Tuning

Prometheus Optimization

Scrape Interval:

global:
  scrape_interval: 30s  # Increase for less frequent scraping
  evaluation_interval: 30s

Target Relabeling:

relabel_configs:
  - source_labels: [__address__]
    regex: '.*'
    action: keep  # Keep only matching targets

Grafana Optimization

Query Optimization:

  • Use recording rules in Prometheus for complex queries
  • Set appropriate refresh intervals on dashboards
  • Limit time range on expensive queries

Caching:

# In grafana.ini or environment variables
[caching]
enabled = true
ttl = 3600

Advanced Configuration

Alerting with Alertmanager

  1. Add Alertmanager to stack:

    cd /home/jramos/homelab/monitoring
    # Create alertmanager directory with docker-compose.yml
    
  2. Configure alerts in Prometheus:

    # In prometheus.yml
    alerting:
      alertmanagers:
        - static_configs:
            - targets: ['alertmanager:9093']
    
    rule_files:
      - 'alerts.yml'
    
  3. Example alert rules:

    # alerts.yml
    groups:
      - name: proxmox
        interval: 30s
        rules:
          - alert: HighCPUUsage
            expr: pve_node_cpu_usage_ratio > 0.9
            for: 5m
            labels:
              severity: warning
            annotations:
              summary: "High CPU usage on {{ $labels.node }}"
    

Multi-Node Proxmox Cluster

For clustered Proxmox environments:

# In pve.yml
cluster1:
  user: monitoring@pve
  password: ${PVE_PASSWORD}
  verify_ssl: false

cluster2:
  user: monitoring@pve
  password: ${PVE_PASSWORD}
  verify_ssl: false

Dashboard Provisioning

Store dashboards as code:

# Create provisioning directory
mkdir -p grafana/provisioning/dashboards

# Add provisioning config
# grafana/provisioning/dashboards/dashboards.yml

Integration with Other Services

n8n Workflow Automation

Create workflows in n8n (CT 113) to:

  • Send alerts to Slack/Discord based on Prometheus alerts
  • Generate daily/weekly infrastructure reports
  • Automate backup verification checks

NetBox IPAM

Sync monitoring targets with NetBox (CT 103):

  • Automatically discover new VMs/CTs
  • Update service inventory
  • Link metrics to network documentation

Additional Resources

Documentation

Community Dashboards

  • Grafana Dashboard 10347: Proxmox VE
  • Grafana Dashboard 15356: Proxmox Cluster
  • Grafana Dashboard 15362: Proxmox Summary

Last Updated: 2025-12-07 Maintainer: jramos VM: 101 (monitoring-docker) at 192.168.2.114 Stack Version: Prometheus 2.x, Grafana 10.x, PVE Exporter latest