Files

Jordan Ramos e481c95da4 docs(security): comprehensive security audit and remediation documentation

- Add SECURITY.md policy with credential management, Docker security, SSL/TLS guidance
- Add security audit report (2025-12-20) with 31 findings across 4 severity levels
- Add pre-deployment security checklist template
- Update CLAUDE_STATUS.md with security audit initiative
- Expand services/README.md with comprehensive security sections
- Add script validation report and container name fix guide

Audit identified 6 CRITICAL, 3 HIGH, 2 MEDIUM findings
4-phase remediation roadmap created (estimated 6-13 min downtime)
All security scripts validated and ready for execution

Related: Security Audit Q4 2025, CRITICAL-001 through CRITICAL-006

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>

2025-12-21 13:52:34 -07:00

24 KiB

Raw Blame History

Security Policy

Version: 1.0 Last Updated: 2025-12-20 Effective Date: 2025-12-20

Overview

This document establishes the security policy and best practices for the homelab infrastructure environment running on Proxmox VE. The policy applies to all virtual machines (VMs), LXC containers, Docker services, and network resources deployed within the homelab.

Scope

This security policy covers:

Proxmox VE infrastructure (serviceslab node at 192.168.2.200)
All virtual machines and LXC containers
Docker containers and compose stacks
Network services and reverse proxies
Authentication and access control systems
Data storage and backup systems
Monitoring and logging infrastructure

Vulnerability Disclosure

Reporting Security Issues

Security vulnerabilities should be reported immediately to the infrastructure maintainer:

Contact: jramos Repository: http://192.168.2.102:3060/jramos/homelab Documentation: /home/jramos/homelab/troubleshooting/

Disclosure Process

Report: Submit vulnerability details via secure channel
Acknowledge: Receipt confirmation within 24 hours
Investigate: Assessment and validation within 72 hours
Remediate: Fix deployment based on severity (see SLA below)
Document: Post-remediation documentation in /troubleshooting/
Review: Security audit update and lessons learned

Severity Classification

Severity	Response Time	Example
CRITICAL	< 4 hours	Docker socket exposure, root credential leaks
HIGH	< 24 hours	Unencrypted credentials, missing authentication
MEDIUM	< 72 hours	Weak passwords, missing SSL/TLS
LOW	< 7 days	Informational findings, optimization opportunities

Security Best Practices

1. Credential Management

1.1 Password Requirements

Minimum Standards:

Length: 16+ characters for administrative accounts
Complexity: Mixed case, numbers, special characters
Uniqueness: No password reuse across services
Rotation: Every 90 days for privileged accounts

Prohibited Practices:

Default passwords (e.g., admin/admin, password, changeme)
Hardcoded credentials in docker-compose files
Plaintext passwords in configuration files
Credentials committed to version control

1.2 Secrets Management

Docker Secrets Strategy:

# BAD: Hardcoded in docker-compose.yml
environment:
  - POSTGRES_PASSWORD=mypassword123

# GOOD: Environment file (.env)
environment:
  - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}

# BETTER: Docker secrets (for swarm mode)
secrets:
  - postgres_password

Environment File Protection:

# Ensure .env files are gitignored
echo "*.env" >> .gitignore
echo ".env.*" >> .gitignore

# Set restrictive permissions
chmod 600 /path/to/service/.env
chown root:root /path/to/service/.env

Credential Storage Locations:

Docker service secrets: /path/to/service/.env (gitignored)
Proxmox credentials: Stored in Proxmox secret storage or .env files
Database passwords: Environment variables, rotated quarterly
API tokens: Environment variables, scoped to minimum permissions

1.3 Credential Rotation

Rotation Schedule:

Credential Type	Frequency	Tool/Script
Proxmox root/API users	90 days	`scripts/security/rotate-pve-credentials.sh`
Database passwords	90 days	`scripts/security/rotate-paperless-password.sh`
JWT secrets	90 days	`scripts/security/rotate-bytestash-jwt.sh`
Service passwords	90 days	`scripts/security/rotate-logward-credentials.sh`
SSH keys	365 days	Manual rotation via Ansible

Rotation Workflow:

Backup: Create full backup before rotation (scripts/security/backup-before-remediation.sh)
Generate: Create new credential using password manager or openssl rand -base64 32
Update: Modify .env file or service configuration
Restart: Restart affected service: docker compose restart <service>
Verify: Test service functionality post-rotation
Document: Record rotation in /troubleshooting/ log file

2. Docker Security

2.1 Docker Socket Protection

CRITICAL: The Docker socket (/var/run/docker.sock) provides root-level access to the host system.

Current Exposures (as of 2025-12-20 audit):

Portainer: Direct socket mount
Nginx Proxy Manager: Direct socket mount
Speedtest Tracker: Direct socket mount

Remediation Strategy:

# INSECURE: Direct socket mount
volumes:
  - /var/run/docker.sock:/var/run/docker.sock

# SECURE: Use docker-socket-proxy
services:
  socket-proxy:
    image: tecnativa/docker-socket-proxy
    environment:
      - CONTAINERS=1
      - NETWORKS=1
      - SERVICES=1
      - TASKS=0
      - POST=0
    volumes:
      - /var/run/docker.sock:/var/run/docker.sock:ro
    restart: unless-stopped

  portainer:
    image: portainer/portainer-ce
    environment:
      - DOCKER_HOST=tcp://socket-proxy:2375
    # No direct socket mount

Implementation Guide: See scripts/security/docker-socket-proxy/README.md

2.2 Container User Privileges

Principle: Containers should run as non-root users whenever possible.

Current Issues (2025-12-20 audit):

Multiple containers running as root (UID 0)
Missing user: directive in docker-compose files

Remediation:

# Add to docker-compose.yml
services:
  myapp:
    image: myapp:latest
    user: "1000:1000"  # Run as non-root user
    # OR use image-specific variables
    environment:
      - PUID=1000
      - PGID=1000

Verification:

# Check running container user
docker exec <container> id

# Should show non-root user:
# uid=1000(appuser) gid=1000(appuser)

2.3 Container Hardening

Security Checklist:

Run as non-root user
Use read-only root filesystem where possible: read_only: true
Drop unnecessary capabilities: cap_drop: [ALL]
Limit resources: mem_limit, cpus
Enable no-new-privileges: security_opt: [no-new-privileges:true]
Use minimal base images (Alpine, distroless)
Scan images for vulnerabilities: docker scan <image>

Example Hardened Service:

services:
  secure-app:
    image: secure-app:latest
    user: "1000:1000"
    read_only: true
    security_opt:
      - no-new-privileges:true
    cap_drop:
      - ALL
    cap_add:
      - NET_BIND_SERVICE  # Only if needed
    mem_limit: 512m
    cpus: 0.5
    tmpfs:
      - /tmp:size=100M,mode=1777

2.4 Image Security

Best Practices:

Pin image versions: Use specific tags, not latest

image: nginx:1.25.3-alpine  # GOOD
image: nginx:latest          # BAD

Verify image signatures: Enable Docker Content Trust
```
export DOCKER_CONTENT_TRUST=1
```

Scan for vulnerabilities: Use Trivy or Grype

# Install trivy
docker run aquasec/trivy image nginx:1.25.3-alpine

Use official images: Prefer verified publishers from Docker Hub
Regular updates: Monthly image update cycle
```
docker compose pull
docker compose up -d
```

3. SSL/TLS Configuration

3.1 Certificate Management

Nginx Proxy Manager (NPM):

Primary SSL termination point for external services
Let's Encrypt integration for automatic certificate renewal
Deployed on CT 102 (192.168.2.101)

Certificate Lifecycle:

Generation: Use Let's Encrypt via NPM UI (http://192.168.2.101:81)
Deployment: Automatic via NPM
Renewal: Automatic via NPM (60 days before expiry)
Monitoring: Check NPM dashboard for expiry warnings

Manual Certificate Installation (if needed):

# Copy certificate to service
cp /path/to/cert.pem /path/to/service/certs/
cp /path/to/key.pem /path/to/service/certs/

# Set permissions
chmod 644 /path/to/service/certs/cert.pem
chmod 600 /path/to/service/certs/key.pem

3.2 SSL/TLS Best Practices

Current Gaps (2025-12-20 audit):

Internal services using HTTP (Grafana, Prometheus, PVE Exporter)
Missing HSTS headers on some NPM proxies
No TLS 1.3 enforcement

Remediation Checklist:

Enable SSL for all web UIs (Grafana, Prometheus, Portainer)
Configure NPM to force HTTPS redirects
Enable HSTS headers: Strict-Transport-Security: max-age=31536000
Disable TLS 1.0 and 1.1 (use TLS 1.2+ only)
Use strong cipher suites (Mozilla Intermediate configuration)

NPM SSL Configuration:

# Custom Nginx Configuration (NPM Advanced tab)
add_header Strict-Transport-Security "max-age=31536000; includeSubDomains" always;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;

ssl_protocols TLSv1.2 TLSv1.3;
ssl_ciphers 'ECDHE-ECDSA-AES128-GCM-SHA256:ECDHE-RSA-AES128-GCM-SHA256';
ssl_prefer_server_ciphers on;

3.3 Internal Service SSL

Grafana HTTPS:

# /etc/grafana/grafana.ini
[server]
protocol = https
cert_file = /etc/grafana/certs/cert.pem
cert_key = /etc/grafana/certs/key.pem

Prometheus HTTPS:

# prometheus.yml
web:
  tls_server_config:
    cert_file: /etc/prometheus/certs/cert.pem
    key_file: /etc/prometheus/certs/key.pem

4. Network Security

4.1 Network Segmentation

Current Architecture:

Single flat network: 192.168.2.0/24
All VMs and containers on same subnet

Recommended Segmentation:

Management VLAN (VLAN 10): 192.168.10.0/24
  - Proxmox node (192.168.10.200)
  - Ansible-Control (192.168.10.106)

Services VLAN (VLAN 20): 192.168.20.0/24
  - Web servers (109, 110)
  - Database server (111)
  - Docker services

DMZ VLAN (VLAN 30): 192.168.30.0/24
  - Nginx Proxy Manager (exposed to internet)
  - Public-facing services

Monitoring VLAN (VLAN 40): 192.168.40.0/24
  - Grafana, Prometheus, PVE Exporter
  - Logging services

Implementation: Use Proxmox VLANs and firewall rules (Phase 4 remediation)

4.2 Firewall Rules

Proxmox Firewall Best Practices:

# Enable Proxmox firewall
pveum cluster firewall enable

# Default deny incoming
pveum cluster firewall rules add --action DROP --dir in

# Allow management access
pveum cluster firewall rules add --action ACCEPT --proto tcp --dport 8006 --source 192.168.2.0/24

# Allow SSH (key-based only)
pveum cluster firewall rules add --action ACCEPT --proto tcp --dport 22 --source 192.168.2.0/24

Docker Network Isolation:

# Create isolated networks per service
networks:
  frontend:
    driver: bridge
  backend:
    driver: bridge
    internal: true  # No external access

services:
  web:
    networks:
      - frontend
      - backend

  db:
    networks:
      - backend  # Database not exposed to frontend

4.3 Rate Limiting & DDoS Protection

Current Gaps:

No rate limiting on NPM proxies
No fail2ban deployment
No intrusion detection system (IDS)

NPM Rate Limiting:

# Custom Nginx Configuration (NPM)
limit_req_zone $binary_remote_addr zone=api_limit:10m rate=10r/s;
limit_req_zone $binary_remote_addr zone=web_limit:10m rate=100r/s;

location /api/ {
    limit_req zone=api_limit burst=20 nodelay;
}

location / {
    limit_req zone=web_limit burst=50 nodelay;
}

Fail2ban Deployment (Phase 3 remediation):

# Install on NPM container or host
apt-get install fail2ban

# Configure jail for NPM
cat > /etc/fail2ban/jail.d/npm.conf << EOF
[npm]
enabled = true
port = http,https
filter = npm
logpath = /var/log/nginx/error.log
maxretry = 5
bantime = 3600
EOF

5. Access Control

5.1 Authentication

Multi-Factor Authentication (MFA):

Proxmox: Enable 2FA via TOTP (Google Authenticator, Authy)

# Enable 2FA for user
pveum user tfa <user@pam> <TFA-ID>

Portainer: Enable MFA in Portainer settings
Grafana: Enable TOTP 2FA in user preferences
NPM: No native MFA (use reverse proxy authentication)

SSO Integration:

TinyAuth (CT 115) provides SSO for NetBox
Extend to other services using OAuth2/OIDC (Phase 4)

5.2 Authorization

Principle of Least Privilege:

Grant minimum required permissions
Use role-based access control (RBAC) where available
Regular access reviews (quarterly)

Proxmox Roles:

# Create limited user for monitoring
pveum user add monitor@pve
pveum acl modify / --user monitor@pve --role PVEAuditor

Docker/Portainer Roles:

Admin: Full access to all stacks
User: Access to specific stacks only
Read-only: View-only access for monitoring

5.3 SSH Access

SSH Hardening:

# /etc/ssh/sshd_config
PermitRootLogin no
PasswordAuthentication no
PubkeyAuthentication yes
Port 22  # Consider non-standard port
AllowUsers jramos ansible-user
MaxAuthTries 3
ClientAliveInterval 300
ClientAliveCountMax 2

SSH Key Management:

Use ED25519 keys: ssh-keygen -t ed25519 -C "your_email@example.com"
Rotate keys annually
Store private keys securely (password manager, SSH agent)
Distribute public keys via Ansible

6. Logging and Monitoring

6.1 Centralized Logging

Current State:

Individual service logs: docker compose logs
No centralized log aggregation

Recommended Stack (Phase 4):

Loki: Log aggregation
Promtail: Log shipping
Grafana: Log visualization

Implementation:

# loki/docker-compose.yml
services:
  loki:
    image: grafana/loki:latest
    ports:
      - 3100:3100
    volumes:
      - ./loki-config.yml:/etc/loki/loki-config.yml
      - loki-data:/loki

  promtail:
    image: grafana/promtail:latest
    volumes:
      - /var/log:/var/log:ro
      - /var/lib/docker/containers:/var/lib/docker/containers:ro
      - ./promtail-config.yml:/etc/promtail/promtail-config.yml

6.2 Security Monitoring

Key Metrics to Monitor:

Failed authentication attempts (Proxmox, SSH, services)
Docker socket access events
Privilege escalation attempts
Network traffic anomalies
Resource exhaustion (CPU, memory, disk)

Alerting Rules (Prometheus):

# alerts.yml
groups:
  - name: security
    rules:
      - alert: HighFailedSSHLogins
        expr: rate(ssh_failed_login_total[5m]) > 5
        for: 5m
        annotations:
          summary: "High rate of failed SSH logins"

      - alert: DockerSocketAccess
        expr: increase(docker_socket_access_total[1h]) > 100
        annotations:
          summary: "Unusual Docker socket activity"

6.3 Audit Logging

Proxmox Audit Log:

# View Proxmox audit log
cat /var/log/pve/tasks/index

# Monitor in real-time
tail -f /var/log/pve/tasks/index

Docker Audit Logging:

# docker-compose.yml
services:
  myapp:
    logging:
      driver: "json-file"
      options:
        max-size: "10m"
        max-file: "3"
        labels: "service,environment"

7. Backup and Recovery

7.1 Backup Strategy

Current Implementation:

Proxmox Backup Server (PBS) at 28.27% utilization
Automated daily incremental backups
Weekly full backups

Backup Scope:

All VMs and LXC containers
Docker volumes (manual backup via scripts)
Configuration files (version controlled in Git)

Backup Verification:

# Pre-remediation backup
/home/jramos/homelab/scripts/security/backup-before-remediation.sh

# Verify backup integrity
proxmox-backup-client list --repository <repo>

7.2 Encryption at Rest

Current Gaps (2025-12-20 audit):

PBS backups not encrypted
Docker volumes not encrypted
Sensitive configuration files unencrypted

Remediation (Phase 4):

# Enable PBS encryption
proxmox-backup-client backup ... --encrypt

# LUKS encryption for sensitive volumes
cryptsetup luksFormat /dev/sdb
cryptsetup luksOpen /dev/sdb encrypted-volume
mkfs.ext4 /dev/mapper/encrypted-volume

7.3 Disaster Recovery

Recovery Time Objective (RTO): 4 hours Recovery Point Objective (RPO): 24 hours

Recovery Procedure:

Assess Damage: Identify failed components
Restore Infrastructure: Rebuild Proxmox node if needed
Restore VMs/Containers: Use PBS restore
Restore Data: Mount backup volumes
Verify Functionality: Test all services
Document Incident: Post-mortem in /troubleshooting/

Recovery Testing: Quarterly DR drills

8. Vulnerability Management

8.1 Vulnerability Scanning

Container Scanning:

# Install Trivy
wget -qO - https://aquasecurity.github.io/trivy-repo/deb/public.key | sudo apt-key add -
echo "deb https://aquasecurity.github.io/trivy-repo/deb $(lsb_release -sc) main" | sudo tee -a /etc/apt/sources.list.d/trivy.list
sudo apt-get update
sudo apt-get install trivy

# Scan all running containers
docker ps --format '{{.Image}}' | xargs -I {} trivy image {}

# Scan docker-compose stack
trivy config docker-compose.yml

Host Scanning:

# Install OpenSCAP
apt-get install libopenscap8 openscap-scanner

# Run CIS benchmark scan
oscap xccdf eval --profile cis --results scan-results.xml /usr/share/xml/scap/ssg/content/ssg-ubuntu2004-xccdf.xml

8.2 Patch Management

Update Schedule:

Proxmox VE: Monthly (during maintenance window)
VMs/Containers: Bi-weekly (automated via Ansible)
Docker Images: Monthly (CI/CD pipeline)
Host OS: Weekly (security patches only)

Ansible Patch Playbook:

# playbooks/patch-systems.yml
- hosts: all
  become: yes
  tasks:
    - name: Update apt cache
      apt:
        update_cache: yes

    - name: Upgrade all packages
      apt:
        upgrade: dist

    - name: Reboot if required
      reboot:
        msg: "Rebooting after patching"
      when: reboot_required_file.stat.exists

8.3 Security Baseline Compliance

CIS Docker Benchmark:

See audit report: /home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md
Current compliance: ~40% (as of 2025-12-20)
Target compliance: 80% (by Q1 2026)

NIST Cybersecurity Framework:

Identify: Asset inventory (CLAUDE_STATUS.md)
Protect: Access control, encryption (this document)
Detect: Monitoring, logging (Grafana, Prometheus)
Respond: Incident response plan (Section 9)
Recover: Backup and DR (Section 7)

9. Incident Response

9.1 Incident Classification

Severity	Definition	Examples
P1 - Critical	Service outage, data breach	Proxmox node failure, credential leak
P2 - High	Degraded service, security vulnerability	Single VM down, HIGH severity finding
P3 - Medium	Non-critical issue	SSL certificate expiry warning
P4 - Low	Informational, enhancement	Log rotation, optimization

9.2 Response Procedure

Phase 1: Detection

Monitor alerts from Grafana/Prometheus
Review logs for anomalies
User-reported issues

Phase 2: Containment

Isolate affected systems (firewall rules, network disconnect)
Preserve evidence (logs, disk images)
Prevent spread (patch vulnerable services)

Phase 3: Eradication

Remove malware/backdoors
Patch vulnerabilities
Reset compromised credentials

Phase 4: Recovery

Restore from clean backups
Verify service functionality
Monitor for recurrence

Phase 5: Post-Incident

Document incident in /troubleshooting/
Update security controls
Conduct lessons learned review

9.3 Communication Plan

Internal Communication:

Incident lead: jramos
Status updates: CLAUDE_STATUS.md
Documentation: /troubleshooting/INCIDENT-YYYY-MM-DD.md

External Communication:

For homelab: Not applicable (internal environment)
For production: Define stakeholder notification procedure

10. Compliance and Auditing

10.1 Security Audits

Audit Schedule:

Quarterly: Internal security review
Annually: Comprehensive security audit
Ad-hoc: After major infrastructure changes

Audit Scope:

Credential management practices
Docker security configuration
SSL/TLS certificate status
Access control policies
Backup and recovery procedures
Vulnerability scan results

Audit Documentation:

Location: /home/jramos/homelab/troubleshooting/SECURITY_AUDIT_*.md
Latest Audit: 2025-12-20 (31 findings)
Next Audit: 2026-03-20 (Q1 2026)

10.2 Compliance Standards

Applicable Standards (for reference/practice):

CIS Docker Benchmark v1.6.0
NIST Cybersecurity Framework v1.1
OWASP Top 10 (for web services)
PCI-DSS v4.0 (if handling payment data - N/A for homelab)

Compliance Tracking:

Checklist: /home/jramos/homelab/templates/SECURITY_CHECKLIST.md
Status: CLAUDE_STATUS.md (Security Status section)
Evidence: /troubleshooting/ and /scripts/security/

10.3 Documentation Requirements

Required Security Documentation:

Security Policy (this document)
Security Audit Reports (/troubleshooting/SECURITY_AUDIT_*.md)
Pre-Deployment Security Checklist (/templates/SECURITY_CHECKLIST.md)
Credential Rotation Procedures (/scripts/security/*.sh)
Incident Response Plan (Section 9 of this document)
Network Topology Diagram (TBD in Phase 4)
Data Flow Diagrams (TBD in Phase 4)
Risk Assessment Matrix (TBD in Q1 2026)

11. Security Checklists

Pre-Deployment Security Checklist

See comprehensive checklist: /home/jramos/homelab/templates/SECURITY_CHECKLIST.md

Quick Validation:

# Run quick security check
bash /home/jramos/homelab/templates/SECURITY_CHECKLIST.md#quick-validation-script

Quarterly Security Review Checklist

Review and rotate all service credentials
Scan all containers for vulnerabilities (Trivy)
Update all Docker images to latest versions
Review Proxmox audit logs for anomalies
Verify backup integrity and test restore
Review firewall rules and network ACLs
Update SSL certificates (if manual)
Review user access and permissions (RBAC)
Patch Proxmox VE, VMs, and containers
Update security documentation (this file)
Conduct penetration testing (if applicable)
Review and update incident response plan

12. Security Resources

Internal Documentation

Security Audit Report: /home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md
Security Scripts: /home/jramos/homelab/scripts/security/
Security Checklist: /home/jramos/homelab/templates/SECURITY_CHECKLIST.md
Infrastructure Status: /home/jramos/homelab/CLAUDE_STATUS.md
Service Documentation: /home/jramos/homelab/services/README.md

External Resources

Docker Security:

Proxmox Security:

General Security:

Security Tools:

13. Change Log

Date	Version	Changes	Author
2025-12-20	1.0	Initial security policy creation following comprehensive security audit	jramos / Claude Sonnet 4.5

Document Owner: jramos Review Frequency: Quarterly Next Review: 2026-03-20 Classification: Internal Use Repository: http://192.168.2.102:3060/jramos/homelab

24 KiB Raw Blame History