Files

Jordan Ramos 004e3da77c feat(agents): optimize sub-agent architecture with comprehensive prompt engineering

This commit implements a comprehensive optimization of all sub-agent prompt
definitions based on Opus-powered prompt engineering analysis. All agents now
match the quality standard established by librarian.md.

Agent Improvements:
- scribe.md: 29→340 lines (11.7x expansion)
  * Added 6 usage examples with role clarity
  * Implemented comprehensive responsibilities section
  * Added 3 complete ASCII diagram templates
  * Included safety protocols and decision frameworks

- backend-builder.md: 40→291 lines (7.3x expansion)
  * Added 6 usage examples with clear boundaries
  * Expanded core responsibilities (Ansible, Terraform, Docker, Python, Shell)
  * Added technology stack and validation rules tables
  * Included handoff protocol for lab-operator deployment
  * Defined clear boundaries (CREATES code, does NOT deploy)

- lab-operator.md: 37→193 lines (5.2x expansion)
  * Added 6 usage examples with role clarity
  * Expanded domain expertise with specific commands
  * Added command style guide (5-step pattern)
  * Included safety protocols and decision-making framework
  * Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)

- librarian.md: Minor formatting improvements

CLAUDE.md Fixes:
- Moved YAML frontmatter to line 1 (was incorrectly at line 89)
- Fixed trailing pipe character
- Completed incomplete sentences about backup strategy and storage growth
- Removed redundant information
- Expanded status file template with recovery instructions

Files Added:
- Claude_UPDATES.md: Comprehensive prompt engineering analysis report
- monitoring/pve-exporter/pve.yml: PVE monitoring configuration

Impact:
- Total agent documentation: 249→967 lines (288% increase)
- Usage examples: 6→24 total (400% increase)
- All agents now have comprehensive safety protocols
- Clear role boundaries prevent agent overlap
- Validation testing confirms all agents functional

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-07 22:39:40 -07:00

grafana

feat(monitoring): add Prometheus/Grafana monitoring stack

2025-12-07 12:41:22 -07:00

prometheus

feat(monitoring): add Prometheus/Grafana monitoring stack

2025-12-07 12:41:22 -07:00

pve-exporter

feat(agents): optimize sub-agent architecture with comprehensive prompt engineering

2025-12-07 22:39:40 -07:00

README.md

feat(docs): update documentation for monitoring stack and infrastructure changes

2025-12-07 12:41:08 -07:00

README.md

Monitoring Stack

Comprehensive monitoring and observability stack for the Proxmox homelab environment, providing real-time metrics, visualization, and alerting capabilities.

Overview

The monitoring stack consists of three primary components deployed on VM 101 (monitoring-docker) at 192.168.2.114:

Grafana: Visualization and dashboards (Port 3000)
Prometheus: Metrics collection and time-series database (Port 9090)
PVE Exporter: Proxmox VE metrics exporter (Port 9221)

Architecture

┌─────────────────────────────────────────────────────────────────┐
│                    Proxmox Host (serviceslab)                   │
│                         192.168.2.200                           │
└────────────────────────────┬────────────────────────────────────┘
                             │
                             │ API (8006)
                             │
                    ┌────────▼────────┐
                    │  PVE Exporter   │
                    │   Port: 9221    │
                    │  (VM 101)       │
                    └────────┬────────┘
                             │
                             │ Metrics
                             │
                    ┌────────▼────────┐
                    │   Prometheus    │
                    │   Port: 9090    │
                    │  (VM 101)       │
                    └────────┬────────┘
                             │
                             │ Query
                             │
                    ┌────────▼────────┐
                    │     Grafana     │
                    │   Port: 3000    │
                    │  (VM 101)       │
                    └─────────────────┘
                             │
                             │ HTTPS
                             │
                    ┌────────▼────────┐
                    │  Nginx Proxy    │
                    │   (CT 102)      │
                    │  192.168.2.101  │
                    └─────────────────┘

Components

VM 101: monitoring-docker

Specifications:

IP Address: 192.168.2.114
Operating System: Ubuntu 22.04/24.04 LTS
Docker Version: 24.0+
Purpose: Dedicated monitoring infrastructure host

Resource Allocation:

CPU: 2-4 cores
Memory: 4-8 GB
Storage: 50-100 GB (thin provisioned)

Grafana

Version: Latest stable Port: 3000 Access: http://192.168.2.114:3000

Features:

Pre-configured Proxmox VE dashboards
Prometheus data source integration
User authentication and authorization
Dashboard templating and variables
Alerting capabilities
Panel plugins for advanced visualizations

Default Credentials:

Username: admin
Password: Check .env file or initial setup

Key Dashboards:

Proxmox Host Overview
VM Resource Utilization
Container Resource Utilization
Storage Pool Metrics
Network Traffic Analysis

Prometheus

Version: Latest stable Port: 9090 Access: http://192.168.2.114:9090

Configuration: /home/jramos/homelab/monitoring/prometheus/prometheus.yml

Scrape Targets:

scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  - job_name: 'pve'
    static_configs:
      - targets: ['pve-exporter:9221']
    metrics_path: /pve
    params:
      module: [default]

Features:

Time-series metrics database
PromQL query language
Service discovery
Alert manager integration (configurable)
Data retention policies
Remote storage support

Retention Policy: 15 days (configurable via command line args)

PVE Exporter

Version: prompve/prometheus-pve-exporter:latest Port: 9221 Access: http://192.168.2.114:9221

Configuration:

File: /home/jramos/homelab/monitoring/pve-exporter/pve.yml
Environment: /home/jramos/homelab/monitoring/pve-exporter/.env

Proxmox Connection:

default:
  user: monitoring@pve
  password: <stored in .env>
  verify_ssl: false

Metrics Exported:

Proxmox cluster status
Node CPU, memory, disk usage
VM/CT status and resource usage
Storage pool utilization
Network interface statistics
Backup job status
Service health

Environment Variables:

PVE_USER: Proxmox API user (typically monitoring@pve)
PVE_PASSWORD: API user password
PVE_VERIFY_SSL: SSL verification (false for self-signed certs)

Deployment

Prerequisites

VM 101 Setup:

# Install Docker and Docker Compose
curl -fsSL https://get.docker.com | sh
sudo usermod -aG docker $USER

# Verify installation
docker --version
docker compose version

Proxmox API User:

# On Proxmox host, create monitoring user
pveum user add monitoring@pve
pveum passwd monitoring@pve
pveum aclmod / -user monitoring@pve -role PVEAuditor

Clone Repository:

cd /home/jramos
git clone <repository-url> homelab
cd homelab/monitoring

Configuration

PVE Exporter Environment:

cd pve-exporter
nano .env

Add:

PVE_USER=monitoring@pve
PVE_PASSWORD=your-secure-password
PVE_VERIFY_SSL=false

Verify Configuration Files:

# Check PVE exporter config
cat pve-exporter/pve.yml

# Check Prometheus config
cat prometheus/prometheus.yml

Deployment Steps

Deploy PVE Exporter:

cd /home/jramos/homelab/monitoring/pve-exporter
docker compose up -d
docker compose logs -f

Deploy Prometheus:

cd /home/jramos/homelab/monitoring/prometheus
docker compose up -d
docker compose logs -f

Deploy Grafana:

cd /home/jramos/homelab/monitoring/grafana
docker compose up -d
docker compose logs -f

Verify All Services:

# Check running containers
docker ps

# Test PVE Exporter
curl http://192.168.2.114:9221/pve?target=192.168.2.200&module=default

# Test Prometheus
curl http://192.168.2.114:9090/-/healthy

# Test Grafana
curl http://192.168.2.114:3000/api/health

Initial Grafana Setup

Access Grafana:
- Navigate to http://192.168.2.114:3000
- Login with default credentials (admin/admin)
- Change password when prompted
Add Prometheus Data Source:
- Go to Configuration → Data Sources
- Click "Add data source"
- Select "Prometheus"
- URL: http://prometheus:9090
- Click "Save & Test"
Import Proxmox Dashboard:
- Go to Dashboards → Import
- Dashboard ID: 10347 (Proxmox VE)
- Select Prometheus data source
- Click "Import"
Configure Alerting (Optional):
- Go to Alerting → Notification channels
- Add email, Slack, or other notification methods
- Create alert rules in dashboards

Network Configuration

Internal Access

All services are accessible within the homelab network:

Grafana: http://192.168.2.114:3000
Prometheus: http://192.168.2.114:9090
PVE Exporter: http://192.168.2.114:9221

External Access (via Nginx Proxy Manager)

Configure reverse proxy on CT 102 (nginx at 192.168.2.101):

Create Proxy Host:
- Domain: monitoring.yourdomain.com
- Scheme: http
- Forward Hostname: 192.168.2.114
- Forward Port: 3000
SSL Configuration:
- Enable "Force SSL"
- Request Let's Encrypt certificate
- Enable HTTP/2
Access List (Optional):
- Create access list for authentication
- Apply to proxy host for additional security

Maintenance

Update Services

# Update all monitoring services
cd /home/jramos/homelab/monitoring

# Update PVE Exporter
cd pve-exporter
docker compose pull
docker compose up -d

# Update Prometheus
cd ../prometheus
docker compose pull
docker compose up -d

# Update Grafana
cd ../grafana
docker compose pull
docker compose up -d

Backup Grafana Dashboards

# Backup Grafana data
docker exec -t grafana tar czf - /var/lib/grafana > grafana-backup-$(date +%Y%m%d).tar.gz

# Or use Grafana's provisioning
# Dashboards can be exported as JSON and stored in git

Prometheus Data Retention

# Check Prometheus storage size
docker exec prometheus du -sh /prometheus

# Adjust retention in docker-compose.yml:
# command:
#   - '--storage.tsdb.retention.time=30d'
#   - '--storage.tsdb.retention.size=50GB'

View Logs

# PVE Exporter logs
cd /home/jramos/homelab/monitoring/pve-exporter
docker compose logs -f

# Prometheus logs
cd /home/jramos/homelab/monitoring/prometheus
docker compose logs -f

# Grafana logs
cd /home/jramos/homelab/monitoring/grafana
docker compose logs -f

# All logs together
docker logs -f pve-exporter
docker logs -f prometheus
docker logs -f grafana

Troubleshooting

PVE Exporter Cannot Connect to Proxmox

Symptoms: No metrics from Proxmox, connection refused errors

Solutions:

Verify Proxmox API is accessible:

curl -k https://192.168.2.200:8006/api2/json/version

Check PVE Exporter environment variables:

cd /home/jramos/homelab/monitoring/pve-exporter
cat .env
docker compose config

Test authentication:

# From VM 101
curl -k -d "username=monitoring@pve&password=yourpassword" \
  https://192.168.2.200:8006/api2/json/access/ticket

Verify user permissions on Proxmox:

# On Proxmox host
pveum user list
pveum aclmod / -user monitoring@pve -role PVEAuditor

Prometheus Not Scraping Targets

Symptoms: Targets shown as down in Prometheus UI

Solutions:

Check Prometheus targets:
- Navigate to http://192.168.2.114:9090/targets
- Verify target status and error messages

Verify network connectivity:

docker exec prometheus curl http://pve-exporter:9221/pve

Check Prometheus configuration:

cd /home/jramos/homelab/monitoring/prometheus
docker compose exec prometheus promtool check config /etc/prometheus/prometheus.yml

Reload Prometheus configuration:
```
docker compose restart prometheus
```

Grafana Shows No Data

Symptoms: Dashboards display "No data" or empty graphs

Solutions:

Verify Prometheus data source:
- Go to Configuration → Data Sources
- Test connection to Prometheus
- URL should be http://prometheus:9090
Check Prometheus has data:
- Navigate to http://192.168.2.114:9090
- Run query: up
- Should show all scrape targets
Verify dashboard queries:
- Edit panel
- Check PromQL query syntax
- Test query in Prometheus UI first
Check time range:
- Ensure dashboard time range includes recent data
- Prometheus retention period not exceeded

Docker Compose Network Issues

Symptoms: Containers cannot communicate

Solutions:

Check Docker network:

docker network ls
docker network inspect monitoring_default

Verify container connectivity:

docker exec prometheus ping pve-exporter
docker exec grafana ping prometheus

Recreate network:

cd /home/jramos/homelab/monitoring
docker compose down
docker network prune
docker compose up -d

High Memory Usage

Symptoms: VM 101 running out of memory

Solutions:

Check container memory usage:
```
docker stats
```

Reduce Prometheus retention:

# In prometheus/docker-compose.yml
command:
  - '--storage.tsdb.retention.time=7d'
  - '--storage.tsdb.retention.size=10GB'

Limit Grafana image rendering:

# In grafana/docker-compose.yml
environment:
  - GF_RENDERING_SERVER_URL=
  - GF_RENDERING_CALLBACK_URL=

Increase VM memory allocation in Proxmox

SSL/TLS Certificate Errors

Symptoms: PVE Exporter cannot verify SSL certificate

Solutions:

Set verify_ssl: false in pve.yml (for self-signed certs)

Or import Proxmox CA certificate:

# Copy CA from Proxmox to VM 101
scp root@192.168.2.200:/etc/pve/pve-root-ca.pem .

# Add to trust store
sudo cp pve-root-ca.pem /usr/local/share/ca-certificates/pve-root-ca.crt
sudo update-ca-certificates

Metrics Reference

Key Proxmox Metrics

Node Metrics:

pve_node_cpu_usage_ratio: CPU utilization (0-1)
pve_node_memory_usage_bytes: Memory used
pve_node_memory_total_bytes: Total memory
pve_node_disk_usage_bytes: Root disk used
pve_node_uptime_seconds: Node uptime

VM/CT Metrics:

pve_guest_info: Guest information (labels: id, name, type, node)
pve_guest_cpu_usage_ratio: Guest CPU usage
pve_guest_memory_usage_bytes: Guest memory used
pve_guest_disk_read_bytes_total: Disk read bytes
pve_guest_disk_write_bytes_total: Disk write bytes
pve_guest_network_receive_bytes_total: Network received
pve_guest_network_transmit_bytes_total: Network transmitted

Storage Metrics:

pve_storage_usage_bytes: Storage used
pve_storage_size_bytes: Total storage size
pve_storage_info: Storage information (labels: storage, type)

Useful PromQL Queries

CPU Usage by VM:

pve_guest_cpu_usage_ratio{type="qemu"} * 100

Memory Usage Percentage:

(pve_guest_memory_usage_bytes / pve_guest_memory_size_bytes) * 100

Storage Usage Percentage:

(pve_storage_usage_bytes / pve_storage_size_bytes) * 100

Network Bandwidth (rate):

rate(pve_guest_network_transmit_bytes_total[5m])

Top 5 VMs by CPU:

topk(5, pve_guest_cpu_usage_ratio{type="qemu"})

Security Considerations

API Credentials

PVE Exporter .env file:
- Never commit to version control
- Use strong passwords
- Restrict file permissions: chmod 600 .env
Proxmox API User:
- Use dedicated monitoring user
- Grant minimal required permissions (PVEAuditor role)
- Consider token-based authentication
Grafana Authentication:
- Change default admin password
- Enable OAuth/LDAP for user authentication
- Use role-based access control

Network Security

Firewall Rules:

# On VM 101, restrict access
ufw allow from 192.168.2.0/24 to any port 3000
ufw allow from 192.168.2.0/24 to any port 9090
ufw allow from 192.168.2.0/24 to any port 9221

Reverse Proxy:
- Use Nginx Proxy Manager for SSL termination
- Implement access lists
- Enable fail2ban for brute force protection
Docker Security:
- Run containers as non-root users
- Use read-only filesystems where possible
- Limit container capabilities

Performance Tuning

Prometheus Optimization

Scrape Interval:

global:
  scrape_interval: 30s  # Increase for less frequent scraping
  evaluation_interval: 30s

Target Relabeling:

relabel_configs:
  - source_labels: [__address__]
    regex: '.*'
    action: keep  # Keep only matching targets

Grafana Optimization

Query Optimization:

Use recording rules in Prometheus for complex queries
Set appropriate refresh intervals on dashboards
Limit time range on expensive queries

Caching:

# In grafana.ini or environment variables
[caching]
enabled = true
ttl = 3600

Advanced Configuration

Alerting with Alertmanager

Add Alertmanager to stack:

cd /home/jramos/homelab/monitoring
# Create alertmanager directory with docker-compose.yml

Configure alerts in Prometheus:

# In prometheus.yml
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']

rule_files:
  - 'alerts.yml'

Example alert rules:

# alerts.yml
groups:
  - name: proxmox
    interval: 30s
    rules:
      - alert: HighCPUUsage
        expr: pve_node_cpu_usage_ratio > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High CPU usage on {{ $labels.node }}"

Multi-Node Proxmox Cluster

For clustered Proxmox environments:

# In pve.yml
cluster1:
  user: monitoring@pve
  password: ${PVE_PASSWORD}
  verify_ssl: false

cluster2:
  user: monitoring@pve
  password: ${PVE_PASSWORD}
  verify_ssl: false

Dashboard Provisioning

Store dashboards as code:

# Create provisioning directory
mkdir -p grafana/provisioning/dashboards

# Add provisioning config
# grafana/provisioning/dashboards/dashboards.yml

Integration with Other Services

n8n Workflow Automation

Create workflows in n8n (CT 113) to:

Send alerts to Slack/Discord based on Prometheus alerts
Generate daily/weekly infrastructure reports
Automate backup verification checks

NetBox IPAM

Sync monitoring targets with NetBox (CT 103):

Automatically discover new VMs/CTs
Update service inventory
Link metrics to network documentation

Additional Resources

Documentation

Community Dashboards

Grafana Dashboard 10347: Proxmox VE
Grafana Dashboard 15356: Proxmox Cluster
Grafana Dashboard 15362: Proxmox Summary

Last Updated: 2025-12-07 Maintainer: jramos VM: 101 (monitoring-docker) at 192.168.2.114 Stack Version: Prometheus 2.x, Grafana 10.x, PVE Exporter latest

README.md

Monitoring Stack

Overview

Architecture

Components

VM 101: monitoring-docker

Grafana

Prometheus

PVE Exporter

Deployment

Prerequisites

Configuration

Deployment Steps

Initial Grafana Setup

Network Configuration

Internal Access

External Access (via Nginx Proxy Manager)

Maintenance

Update Services

Backup Grafana Dashboards

Prometheus Data Retention

View Logs

Troubleshooting

PVE Exporter Cannot Connect to Proxmox

Prometheus Not Scraping Targets

Grafana Shows No Data

Docker Compose Network Issues

High Memory Usage

SSL/TLS Certificate Errors

Metrics Reference

Key Proxmox Metrics

Useful PromQL Queries

Security Considerations

API Credentials

Network Security

Performance Tuning

Prometheus Optimization

Grafana Optimization

Advanced Configuration

Alerting with Alertmanager

Multi-Node Proxmox Cluster

Dashboard Provisioning

Integration with Other Services

n8n Workflow Automation

NetBox IPAM

Additional Resources

Documentation

Community Dashboards

Related Homelab Documentation