- Add Docker Compose configs with security hardening (cap_drop ALL, non-root, read-only FS) - Add Prometheus node_exporter scrape target for 192.168.2.120:9100 - Update services/README.md, INDEX.md, and CLAUDE_STATUS.md with VM 120 - Image pinned to v2026.2.1 (patches CVE-2026-25253) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
989 lines
43 KiB
Markdown
989 lines
43 KiB
Markdown
# Homelab Infrastructure Status
|
|
|
|
**Last Updated**: 2026-02-03
|
|
**Export Reference**: disaster-recovery/homelab-export-20251211-144345
|
|
**Current Session:** OpenClaw Deployment - VM 120
|
|
|
|
## Quick Resume (Current Session Context)
|
|
|
|
**Where We Are:** OpenClaw deployed and healthy on VM 120. Container running with full security hardening. Backups configured. Manual steps remain for NPM proxy host, Twingate resource, and Prometheus config on VM 101.
|
|
|
|
**Completed:**
|
|
- [x] Config files created (`services/openclaw/`)
|
|
- [x] VM 120 created and hardened (UFW, fail2ban, node-exporter, openclaw user)
|
|
- [x] OpenClaw container deployed and healthy (v2026.2.1)
|
|
- [x] Security verified (cap_drop ALL, non-root, read-only FS, no docker.sock)
|
|
- [x] Prometheus scrape target added to repo copy
|
|
- [x] PBS backup job created (daily 02:00, snapshot, zstd)
|
|
- [x] Application backup script + weekly cron configured
|
|
- [x] Documentation updated (README, services/README, CLAUDE_STATUS, INDEX)
|
|
- [x] node_exporter installed and serving metrics on 192.168.2.120:9100
|
|
|
|
**Manual Steps Remaining:**
|
|
- [ ] NPM: Create proxy host for openclaw.apophisnetworking.net -> 192.168.2.120:18789 (WebSocket support, SSL, TinyAuth)
|
|
- [ ] Twingate: Add resource for 192.168.2.120 ports 18789/18790/1455
|
|
- [ ] VM 101: Deploy updated prometheus.yml via Proxmox web console (SSH not configured)
|
|
- [ ] Configure at least one LLM provider API key in /opt/openclaw/.env
|
|
|
|
---
|
|
|
|
## Current Infrastructure Snapshot
|
|
|
|
### Proxmox Environment
|
|
- **Node**: serviceslab
|
|
- **Version**: Proxmox VE 8.4.0
|
|
- **Management IP**: 192.168.2.100
|
|
- **Architecture**: Single-node cluster
|
|
- **Total Resources**: 10 VMs, 2 Templates, 5 LXC Containers
|
|
|
|
---
|
|
|
|
## Virtual Machines (QEMU/KVM) - 10 VMs
|
|
|
|
| VM ID | Name | IP Address | Status | Purpose |
|
|
|-------|------|------------|--------|---------|
|
|
| 100 | docker-hub | 192.168.2.102 | Running | Container registry/Docker hub mirror |
|
|
| 101 | monitoring-docker | 192.168.2.114 | Running | Monitoring stack (Grafana/Prometheus/PVE Exporter) |
|
|
| 105 | dev | - | Stopped | General-purpose development workstation |
|
|
| 106 | Ansible-Control | 192.168.2.XXX | Running | IaC orchestration, configuration management |
|
|
| 108 | CML | - | Stopped | Cisco Modeling Labs - network simulation |
|
|
| 109 | web-server-01 | 192.168.2.XXX | Running | Web application server (clustered) |
|
|
| 110 | web-server-02 | 192.168.2.XXX | Running | Load-balanced pair with web-server-01 |
|
|
| 111 | db-server-01 | 192.168.2.XXX | Running | Backend database server |
|
|
| 114 | haos | 192.168.2.XXX | Running | Home Assistant OS - smart home automation platform |
|
|
| 120 | openclaw | 192.168.2.120 | Running | OpenClaw AI chatbot gateway |
|
|
|
|
**Recent Changes**:
|
|
- Added VM 120 (openclaw) for multi-platform AI chatbot gateway (2026-02-03)
|
|
- Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
|
|
- Removed VM 101 (gitlab) - service decommissioned
|
|
|
|
---
|
|
|
|
## VM Templates - 2 Templates
|
|
|
|
| Template ID | Name | Purpose |
|
|
|-------------|------|---------|
|
|
| 104 | ubuntu-dev | Ubuntu development environment template for cloning |
|
|
| 107 | ubuntu-docker | Ubuntu Docker host template for rapid deployment |
|
|
|
|
**Note**: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.
|
|
|
|
---
|
|
|
|
## Containers (LXC) - 5 Containers
|
|
|
|
| CT ID | Name | IP Address | Status | Purpose |
|
|
|-------|------|------------|--------|---------|
|
|
| 102 | nginx | 192.168.2.101 | Running | Reverse proxy/load balancer & NPM |
|
|
| 103 | netbox | 192.168.2.XXX | Running | Network documentation/IPAM |
|
|
| 112 | twingate-connector | 192.168.2.XXX | Running | Zero-trust network access connector |
|
|
| 113 | n8n | 192.168.2.113 | Running | Workflow automation platform |
|
|
| 115 | tinyauth | 192.168.2.10 | Running | SSO authentication layer for NetBox |
|
|
|
|
**Recent Changes**:
|
|
- Added CT 115 (tinyauth) for SSO authentication integration with NetBox
|
|
- Added CT 112 (twingate-connector) for zero-trust network security
|
|
- Added CT 113 (n8n) for workflow automation
|
|
- Removed CT 112 (Anytype) - replaced by n8n
|
|
|
|
---
|
|
|
|
## Storage Architecture
|
|
|
|
| Storage Pool | Type | Total | Used | % Used | Purpose |
|
|
|--------------|------|-------|------|--------|---------|
|
|
| local | Directory | - | - | 19.11% | System files, ISOs, templates |
|
|
| local-lvm | LVM-Thin | - | - | 0.01% | VM disk images (thin provisioned) |
|
|
| Vault | NFS/Directory | - | - | 12.13% | Secure storage for sensitive data |
|
|
| PBS-Backups | PBS | - | - | 28.27% | Automated backup repository |
|
|
| iso-share | NFS/CIFS | - | - | 1.45% | Installation media library |
|
|
| localnetwork | Network Share | - | - | N/A | Shared resources across infrastructure |
|
|
|
|
**Capacity Notes**:
|
|
- PBS-Backups utilization increased to 28.27% (healthy retention)
|
|
- Vault utilization increased to 12.13% (data growth monitored)
|
|
- local storage at 19.11% (system overhead within normal range)
|
|
|
|
---
|
|
|
|
## Key Services & Stacks
|
|
|
|
### Monitoring & Observability (NEW)
|
|
**VM 101** - monitoring-docker (192.168.2.114)
|
|
- **Grafana**: Port 3000 - Visualization and dashboards
|
|
- **Prometheus**: Port 9090 - Metrics collection and time-series database
|
|
- **PVE Exporter**: Port 9221 - Proxmox VE metrics exporter
|
|
- **Documentation**: `/home/jramos/homelab/monitoring/README.md`
|
|
- **Status**: Fully operational
|
|
|
|
### Network Security (NEW)
|
|
**CT 112** - twingate-connector
|
|
- **Purpose**: Zero-trust network access
|
|
- **Type**: Lightweight connector
|
|
- **Status**: Running
|
|
- **Integration**: Connects homelab to Twingate network
|
|
|
|
### Automation & Integration
|
|
**CT 113** - n8n (192.168.2.113)
|
|
- **Purpose**: Workflow automation platform
|
|
- **Technology**: n8n.io
|
|
- **Database**: PostgreSQL 15+
|
|
- **Features**: API integration, scheduled workflows, webhook triggers
|
|
- **Documentation**: `/home/jramos/homelab/services/README.md#n8n-workflow-automation`
|
|
- **Status**: Operational (resolved database locale issues)
|
|
|
|
### Authentication & SSO
|
|
**CT 115** - tinyauth (192.168.2.10)
|
|
- **Purpose**: Lightweight SSO authentication layer
|
|
- **Technology**: TinyAuth v4 (Docker container)
|
|
- **Port**: 8000
|
|
- **Domain**: tinyauth.apophisnetworking.net
|
|
- **Integration**: Authentication gateway for NetBox via Nginx Proxy Manager
|
|
- **Security**: Bcrypt-hashed credentials, HTTPS enforcement
|
|
- **Documentation**: `/home/jramos/homelab/services/tinyauth/README.md`
|
|
- **Status**: Operational
|
|
|
|
### AI Chatbot Gateway
|
|
**VM 120** - openclaw (192.168.2.120)
|
|
- **Purpose**: Multi-platform AI chatbot gateway
|
|
- **Technology**: OpenClaw (Docker container)
|
|
- **Ports**: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
|
|
- **Domain**: openclaw.apophisnetworking.net
|
|
- **LLM Providers**: Anthropic, OpenAI, Ollama
|
|
- **Messaging**: Discord, Telegram, Slack, WhatsApp
|
|
- **Security**: CVE-2026-25253 patched (v2026.2.1), cap_drop ALL, non-root, read-only FS
|
|
- **Documentation**: `/home/jramos/homelab/services/openclaw/README.md`
|
|
- **Status**: Operational - Container healthy
|
|
|
|
### Infrastructure Documentation
|
|
**CT 103** - netbox
|
|
- **Purpose**: Network documentation and IPAM
|
|
- **Status**: Stopped (on-demand use)
|
|
- **Function**: Infrastructure source of truth
|
|
|
|
### Reverse Proxy & Load Balancing
|
|
**CT 102** - nginx (192.168.2.101)
|
|
- **Purpose**: Nginx Proxy Manager
|
|
- **Ports**: 80, 81, 443
|
|
- **Function**: SSL termination, reverse proxy, certificate management
|
|
- **Upstream Services**: All web-facing applications
|
|
|
|
### Three-Tier Application Stack
|
|
**Web Tier**:
|
|
- VM 109 (web-server-01) - Primary web server
|
|
- VM 110 (web-server-02) - Load-balanced pair
|
|
|
|
**Database Tier**:
|
|
- VM 111 (db-server-01) - Backend database
|
|
|
|
**Proxy Tier**:
|
|
- CT 102 (nginx) - Load balancer and SSL termination
|
|
|
|
### Development & Automation
|
|
**VM 106** - Ansible-Control
|
|
- **Purpose**: Infrastructure as Code orchestration
|
|
- **Tools**: Ansible, Terraform/OpenTofu (potential)
|
|
- **Status**: Running
|
|
|
|
### Container Registry
|
|
**VM 100** - docker-hub
|
|
- **Purpose**: Local Docker registry and hub mirror
|
|
- **Function**: Caching container images for faster deployments
|
|
- **Status**: Running
|
|
|
|
### Network Simulation
|
|
**VM 108** - CML
|
|
- **Purpose**: Cisco Modeling Labs
|
|
- **Function**: Network topology testing and simulation
|
|
- **Status**: Stopped (resource-intensive, on-demand use)
|
|
|
|
---
|
|
|
|
## Architecture Patterns
|
|
|
|
### Monitoring & Observability (NEW)
|
|
The infrastructure now implements a comprehensive monitoring stack following industry best practices:
|
|
|
|
- **Metrics Collection**: Prometheus scraping Proxmox metrics via PVE Exporter
|
|
- **Visualization**: Grafana providing real-time dashboards and alerting
|
|
- **Isolation**: Dedicated VM for monitoring services (fault isolation)
|
|
- **Integration**: Ready for AlertManager, additional exporters, and integrations
|
|
|
|
**Design Decision**: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.
|
|
|
|
### Zero-Trust Security (NEW)
|
|
Implementation of zero-trust network access principles:
|
|
|
|
- **Twingate Connector**: Lightweight connector providing secure access without VPNs
|
|
- **Container Deployment**: LXC container for minimal resource overhead
|
|
- **Network Segmentation**: Secure access to homelab from external networks
|
|
|
|
**Design Decision**: LXC container chosen for quick provisioning and low resource consumption.
|
|
|
|
### Automation-First Approach
|
|
Workflow automation and infrastructure orchestration:
|
|
|
|
- **n8n Platform**: Visual workflow builder for API integrations
|
|
- **Scheduled Tasks**: Automated backup checks, monitoring alerts, reports
|
|
- **Integration Hub**: Connects monitoring, documentation, and operational tools
|
|
|
|
**Design Decision**: PostgreSQL backend ensures data persistence and supports complex workflows.
|
|
|
|
### Tiered Application Architecture
|
|
Classic three-tier design for production-like environments:
|
|
|
|
- **Presentation Tier**: Paired web servers (109, 110) behind load balancer
|
|
- **Business Logic**: Application processing on web tier
|
|
- **Data Tier**: Dedicated database server (111) with backup strategy
|
|
|
|
**Design Decision**: Separation of concerns, scalability testing, high availability patterns.
|
|
|
|
### Selective Containerization Strategy
|
|
Hybrid approach balancing performance and resource efficiency:
|
|
|
|
- **LXC Containers**: Stateless services (nginx, netbox, twingate, n8n)
|
|
- **Full VMs**: Complex applications, kernel dependencies, heavy workloads
|
|
- **Rationale**: LXC for ~10x lower overhead, VMs for isolation and compatibility
|
|
|
|
---
|
|
|
|
## Recent Infrastructure Changes
|
|
|
|
### 2026-02-03: OpenClaw AI Chatbot Gateway Deployment (In Progress)
|
|
|
|
**Service**: VM 120 - OpenClaw multi-platform AI chatbot gateway
|
|
|
|
**Purpose**: Bridge messaging platforms (Discord, Telegram, Slack, WhatsApp) with LLM providers (Anthropic, OpenAI, Ollama) through a unified gateway.
|
|
|
|
**Specifications**:
|
|
- **VM**: 120 (cloned from template 107, ubuntu-docker)
|
|
- **IP**: 192.168.2.120
|
|
- **Resources**: 4 vCPUs, 16GB RAM, 50GB disk on Vault (ZFS)
|
|
- **Ports**: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
|
|
- **Domain**: openclaw.apophisnetworking.net
|
|
- **Image**: ghcr.io/openclaw/openclaw:2026.2.1
|
|
|
|
**Security Hardening**:
|
|
- Version >= 2026.2.1 (patches CVE-2026-25253, CVSS 8.8 1-click RCE)
|
|
- All ports bound to 127.0.0.1 (reverse proxy required)
|
|
- Docker: cap_drop ALL, no-new-privileges, read-only filesystem, non-root user (1001:1001)
|
|
- UFW: deny-all + whitelist 192.168.2.0/24 + 192.168.1.91 (desktop PC)
|
|
- fail2ban on SSH (3 retries), unattended-upgrades
|
|
- Prometheus node_exporter at port 9100
|
|
|
|
**Completed Steps**:
|
|
- [x] Docker Compose configuration files created
|
|
- [x] Security hardening overlay (docker-compose.override.yml)
|
|
- [x] Environment variable template (.env.example)
|
|
- [x] Prometheus scrape target added
|
|
- [x] Documentation created (README, services/README, CLAUDE_STATUS, INDEX)
|
|
- [x] VM 120 Creation & SSH Setup
|
|
- [x] OS Hardening (UFW, user creation)
|
|
|
|
**Pending Steps**:
|
|
- [ ] NPM reverse proxy configuration (manual - web UI)
|
|
- [ ] Twingate resource creation (manual - admin console)
|
|
- [ ] Prometheus config on VM 101 (manual - no SSH access)
|
|
- [ ] Configure LLM provider API key in .env
|
|
|
|
**Status**: Container healthy - Manual network integration remaining
|
|
|
|
---
|
|
|
|
### 2025-12-20: Comprehensive Security Audit Completed
|
|
|
|
**Activity:** Complete infrastructure security assessment and remediation planning
|
|
|
|
**Audit Scope:**
|
|
- All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser)
|
|
- Proxmox VE infrastructure and API access
|
|
- Network security and segmentation
|
|
- Credential management and storage
|
|
- SSL/TLS configuration
|
|
- Container security and runtime configuration
|
|
|
|
**Findings Summary:**
|
|
- **CRITICAL (6)**: Docker socket exposure, hardcoded credentials, database passwords in git
|
|
- **HIGH (3)**: Missing SSL/TLS, weak passwords, containers running as root
|
|
- **MEDIUM (2)**: SSL verification disabled, missing authentication
|
|
- **LOW (20)**: Documentation gaps, monitoring improvements, backup encryption
|
|
|
|
**Deliverables:**
|
|
1. **Security Policy** (`SECURITY.md`): 864 lines - Comprehensive security best practices
|
|
2. **Audit Report** (`troubleshooting/SECURITY_AUDIT_2025-12-20.md`): 2,350 lines - Detailed findings and remediation plan
|
|
3. **Security Checklist** (`templates/SECURITY_CHECKLIST.md`): 750 lines - Pre-deployment validation template
|
|
4. **Validation Report** (`scripts/security/VALIDATION_REPORT.md`): 2,092 lines - Script safety assessment
|
|
5. **Container Fixes** (`scripts/security/CONTAINER_NAME_FIXES.md`): 621 lines - Container name verification
|
|
6. **Security Scripts** (8 total):
|
|
- `verify-service-status.sh` - Service health checker
|
|
- `backup-before-remediation.sh` - Comprehensive backup utility
|
|
- `rotate-pve-credentials.sh` - Proxmox credential rotation
|
|
- `rotate-paperless-password.sh` - Database password rotation
|
|
- `rotate-bytestash-jwt.sh` - JWT secret rotation
|
|
- `rotate-logward-credentials.sh` - Multi-service credential rotation
|
|
- `docker-socket-proxy/docker-compose.yml` - Security proxy deployment
|
|
- `portainer/docker-compose.socket-proxy.yml` - Portainer migration config
|
|
|
|
**Script Validation:**
|
|
- **Ready for execution**: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy)
|
|
- **Needs container name fixes**: 3/8 scripts (see CONTAINER_NAME_FIXES.md)
|
|
|
|
**4-Phase Remediation Roadmap:**
|
|
- Phase 1 (Week 1): Immediate actions - Backups, secrets migration
|
|
- Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation
|
|
- Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS
|
|
- Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines
|
|
|
|
**Estimated Timeline:**
|
|
- Total downtime: 6-13 minutes (sequential script execution)
|
|
- Full remediation: 8-16 weeks
|
|
|
|
**Risk Assessment:**
|
|
- Current risk: HIGH - Multiple CRITICAL vulnerabilities active
|
|
- Post-Phase 1 risk: MEDIUM - Credential exposure mitigated
|
|
- Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated
|
|
- Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented
|
|
|
|
**Status:** Documentation complete, awaiting remediation execution approval
|
|
|
|
---
|
|
|
|
### 2025-12-18: TinyAuth SSO Deployment
|
|
|
|
**Service Deployed:** CT 115 - TinyAuth authentication layer
|
|
|
|
**Purpose:** Centralized SSO authentication for NetBox and future homelab services
|
|
|
|
**Specifications:**
|
|
- **Container**: CT 115 (LXC with Docker)
|
|
- **IP Address**: 192.168.2.10
|
|
- **Domain**: tinyauth.apophisnetworking.net
|
|
- **Port**: 8000 (external), 3000 (internal)
|
|
- **Docker Image**: ghcr.io/steveiliop56/tinyauth:v4
|
|
- **Resource Usage**: ~50-100 MB memory, <1% CPU
|
|
|
|
**Integration Architecture:**
|
|
- Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103)
|
|
- NPM uses `auth_request` directive to validate credentials via TinyAuth
|
|
- Bcrypt-hashed password storage for security
|
|
- HTTPS enforcement via NPM SSL termination
|
|
|
|
**Issues Resolved During Deployment:**
|
|
1. **500 Internal Server Error**: Fixed Nginx advanced config syntax
|
|
2. **IP addresses not allowed**: Changed APP_URL from IP to domain
|
|
3. **Port mapping**: Corrected Docker port mapping from 8000:8000 to 8000:3000
|
|
4. **Invalid password**: Implemented bcrypt hash requirement for TinyAuth v4
|
|
|
|
**Integration Impact:**
|
|
- NetBox now protected by centralized authentication
|
|
- Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates)
|
|
- Authentication logs available for security auditing
|
|
|
|
**Documentation:** Complete guide at `/home/jramos/homelab/services/tinyauth/README.md`
|
|
|
|
**Status:** ✅ Operational - Successfully authenticating NetBox access
|
|
|
|
---
|
|
|
|
### 2025-12-11: Loki-Stack Monitoring Fully Operational
|
|
|
|
**Issue Resolved:** Centralized logging pipeline now receiving syslog from UniFi router
|
|
|
|
**Root Cause:** rsyslog filter in `/etc/rsyslog.d/unifi-router.conf` was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1)
|
|
|
|
**Fix Applied:** Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1)
|
|
|
|
**Status:** ✅ Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana
|
|
|
|
**Services Affected:**
|
|
- VM 101 (monitoring-docker): rsyslog configuration updated
|
|
- Loki-stack: All components operational
|
|
- Grafana: Dashboards receiving real-time syslog data
|
|
|
|
**Technical Details:** See `troubleshooting/loki-stack-bugfix.md` for complete 5-phase troubleshooting history
|
|
|
|
---
|
|
|
|
### 2025-12-11: Infrastructure Expansion & System Updates
|
|
|
|
#### Proxmox VE Platform Upgrade
|
|
- **Upgraded**: Proxmox VE 8.3.3 → 8.4.0
|
|
- **Kernel**: 6.8.12-8-pve
|
|
- **pve-manager**: 8.4.14
|
|
- **Impact**: Enhanced performance, security updates, bug fixes
|
|
- **Status**: ✅ Complete - All VMs and containers operating normally
|
|
|
|
#### New VM 114: Home Assistant OS Deployment
|
|
- **Service**: haos (Home Assistant Operating System)
|
|
- **Purpose**: Smart home automation and integration platform
|
|
- **Specifications**:
|
|
- Memory: 4 GB (87% utilized)
|
|
- CPU: 2 vCPUs
|
|
- Boot Disk: 50 GB
|
|
- Status: Running (~3 days uptime)
|
|
- **Rationale**: Centralized home automation hub for IoT device management
|
|
- **Integration**: Will integrate with monitoring stack for infrastructure metrics
|
|
|
|
#### CT 103: NetBox IPAM Activated
|
|
- **Service**: netbox (Network Documentation & IPAM)
|
|
- **Status Change**: Stopped → Running
|
|
- **Uptime**: ~3.1 days
|
|
- **Resource Usage**: 1.28 GB / 2 GB memory (64%)
|
|
- **Purpose**: Active network documentation and IP address management
|
|
- **Rationale**: Required for ongoing infrastructure expansion planning
|
|
|
|
#### Storage Utilization Trends
|
|
- **PBS-Backups**: 27.43% → 28.27% (+0.84%) - Normal backup retention growth
|
|
- **Vault (ZFS)**: 10.88% → 12.13% (+1.25%) - Data accumulation monitored
|
|
- **local**: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates
|
|
- **iso-share**: 1.4% → 1.45% (+0.05%) - Minimal change
|
|
- **local-lvm**: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline
|
|
|
|
---
|
|
|
|
### 2025-12-25: RAG Vector Search - Phase 3 Complete
|
|
|
|
**Activity:** Implemented and debugged production-ready vector search system for AI-powered documentation retrieval
|
|
|
|
**Deliverables:**
|
|
1. **Production Module** (`n8n/vector_search.py`): Complete API for semantic search
|
|
- `search_similar_documents()` - Query with natural language
|
|
- `insert_document()` - Add documents with embeddings
|
|
- `get_stats()` - Database statistics
|
|
- `delete_by_repo()` - Bulk cleanup
|
|
- CLI interface for testing and manual operations
|
|
|
|
2. **Documentation Suite:**
|
|
- `SESSION_HANDOFF_PHASE4_READY.md` (17KB) - Comprehensive learning guide for next session
|
|
- `PHASE3_COMPLETE.md` (12KB) - Complete debugging summary and deployment guide
|
|
- `VECTOR_SEARCH_DEBUG.md` (4.7KB) - Technical root cause analysis
|
|
- `VECTOR_SEARCH_COMPARISON.md` (2.5KB) - Before/after code comparison
|
|
|
|
3. **Diagnostic Scripts** (8 total):
|
|
- Embedding storage repair, parameter binding tests, SQL validation
|
|
- All scripts validated and preserved for reference
|
|
|
|
**Technical Achievement:**
|
|
- PostgreSQL 16.11 + pgvector 0.8.1 fully operational on CT 113
|
|
- Vector similarity search returning accurate scores (0.5765 for related concepts)
|
|
- Resolved 2 critical bugs:
|
|
1. psycopg2 parameter handling for pgvector types (must cast in SQL, not Python)
|
|
2. ORDER BY with vector operations (subquery pattern required)
|
|
|
|
**Validation Results:**
|
|
- Query: "How do I create snapshots of virtual machines?"
|
|
- Result: 0.5765 similarity to backup documentation
|
|
- Interpretation: Correctly identifies semantic relationship between "snapshots" and "backups"
|
|
|
|
**Infrastructure:**
|
|
- Database: n8n_db on CT 113
|
|
- Table: rag_embeddings (id, source_repo, file_path, chunk_text, embedding vector(768), metadata jsonb)
|
|
- Embedding API: Ollama at 192.168.1.81:11434 (nomic-embed-text, 768 dimensions)
|
|
- Storage overhead: ~3KB per vector, ~5KB per document total
|
|
|
|
**Status:** ✅ Phase 3 Complete | Phase 4 Ready to Start
|
|
**Next Steps:** Build n8n ingestion workflow to load homelab documentation from Gitea
|
|
|
|
---
|
|
|
|
### 2025-12-07: Infrastructure Documentation & Monitoring Stack
|
|
|
|
#### Additions
|
|
1. **VM 101 (monitoring-docker)**: New dedicated monitoring infrastructure
|
|
- Grafana for visualization
|
|
- Prometheus for metrics collection
|
|
- PVE Exporter for Proxmox integration
|
|
- IP: 192.168.2.114
|
|
|
|
2. **CT 112 (twingate-connector)**: Zero-trust network security
|
|
- Lightweight connector
|
|
- Secure remote access without VPN
|
|
|
|
3. **CT 113 (n8n)**: Workflow automation platform
|
|
- PostgreSQL 16.11 backend (upgraded from 15+)
|
|
- pgvector 0.8.1 extension for vector search
|
|
- IP: 192.168.2.113
|
|
- Resolved database locale issues
|
|
|
|
### Modifications
|
|
- Storage utilization updated across all pools
|
|
- PBS-Backups now at 27.43% (increased retention)
|
|
- Vault optimized to 10.88% (reduced usage)
|
|
|
|
### Removals
|
|
- **VM 101 (gitlab)**: Decommissioned (previously at this ID)
|
|
- **CT 112 (Anytype)**: Replaced by n8n for better integration
|
|
|
|
### Documentation Updates
|
|
- Created comprehensive monitoring stack documentation
|
|
- Updated all infrastructure tables with current VMs/CTs
|
|
- Added architecture patterns for observability and zero-trust
|
|
- Updated storage statistics
|
|
- Referenced latest export: disaster-recovery/homelab-export-20251207-120040
|
|
|
|
---
|
|
|
|
## Repository Structure
|
|
|
|
```
|
|
homelab/
|
|
n8n/ # RAG Vector Search Implementation (NEW)
|
|
vector_search.py # Production module for vector operations
|
|
SESSION_HANDOFF_PHASE4_READY.md # Learning guide for next session
|
|
PHASE3_COMPLETE.md # Phase 3 debugging and achievements summary
|
|
fix_embedding_storage.py # Diagnostic script (embedding repair)
|
|
test_direct_sql.py # Diagnostic script (query testing)
|
|
test_vector_search_working.py # Validated working implementation
|
|
test_parameter_binding.py # Diagnostic script (psycopg2 debugging)
|
|
test_pgvector_direct.sql # Raw SQL tests for pgvector
|
|
VECTOR_SEARCH_DEBUG.md # Technical debugging documentation
|
|
VECTOR_SEARCH_COMPARISON.md # Before/after code comparison
|
|
README_VECTOR_SEARCH.md # Comprehensive setup guide
|
|
monitoring/ # Monitoring stack configurations
|
|
README.md # Comprehensive monitoring documentation
|
|
grafana/
|
|
docker-compose.yml
|
|
prometheus/
|
|
docker-compose.yml
|
|
prometheus.yml
|
|
pve-exporter/
|
|
docker-compose.yml
|
|
pve.yml
|
|
.env
|
|
services/ # Docker Compose service configurations
|
|
n8n/ # n8n workflow automation
|
|
netbox/ # Network documentation & IPAM
|
|
openclaw/ # OpenClaw AI chatbot gateway (VM 120)
|
|
tinyauth/ # SSO authentication layer
|
|
README.md # Services overview (updated)
|
|
disaster-recovery/
|
|
homelab-export-20251207-120040/ # Latest infrastructure export
|
|
scripts/
|
|
crawlers-exporters/ # Infrastructure collection scripts
|
|
fixers/ # Problem-solving scripts
|
|
qol/ # Quality of life improvements
|
|
security/ # Security audit and remediation scripts (NEW)
|
|
verify-service-status.sh
|
|
backup-before-remediation.sh
|
|
rotate-*.sh # Credential rotation scripts
|
|
QUICK_REFERENCE.md # Security operations guide
|
|
troubleshooting/
|
|
SECURITY_AUDIT_2025-12-20.md # Comprehensive security assessment
|
|
loki-stack-bugfix.md # Loki logging troubleshooting
|
|
CLAUDE.md # AI assistant guidance (updated)
|
|
SECURITY.md # Security policy and best practices (NEW)
|
|
INDEX.md # Navigation index (updated)
|
|
README.md # Repository overview (updated)
|
|
CLAUDE_STATUS.md # This file - current infrastructure status
|
|
```
|
|
|
|
---
|
|
|
|
## Security Status
|
|
|
|
**Latest Audit**: 2025-12-20
|
|
**Total Findings**: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW)
|
|
**Remediation Status**: Planning Phase - Documentation Complete
|
|
|
|
**Critical Vulnerabilities**:
|
|
- Docker socket exposure (3 containers)
|
|
- Proxmox credentials in plaintext
|
|
- Database passwords in git repository
|
|
- Missing SSL/TLS for internal services
|
|
- Weak/default passwords across services
|
|
- Containers running as root
|
|
|
|
**Documentation**:
|
|
- Security Policy: `/home/jramos/homelab/SECURITY.md`
|
|
- Audit Report: `/home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md`
|
|
- Security Checklist: `/home/jramos/homelab/templates/SECURITY_CHECKLIST.md`
|
|
- Script Validation: `/home/jramos/homelab/scripts/security/VALIDATION_REPORT.md`
|
|
|
|
---
|
|
|
|
## Current Initiative: n8n RAG Workflow for Homelab Documentation - Q4 2025
|
|
|
|
### Goal
|
|
Build an interactive n8n workflow that implements Retrieval-Augmented Generation (RAG) to query homelab documentation stored in Gitea using local AI (Ollama). This is a learning-focused project to understand RAG architecture, embeddings, vector storage, and LLM integration.
|
|
|
|
### Phase
|
|
Phase 3 Complete - Vector Storage Operational | Moving to Phase 4 - n8n Workflow Development
|
|
|
|
### Infrastructure Components
|
|
- **AI Backend**: Ollama running on Windows 11 PC (192.168.1.81)
|
|
- Hardware: AMD 7900 GRE GPU, i7-12700KF, 32GB RAM @ 4000MHz, 2TB NVMe
|
|
- Installation: Native Windows application (not Docker)
|
|
- Open-WebUI: Running in Docker Desktop on same machine (port 3000)
|
|
- **Orchestrator**: n8n workflow automation (CT 113, 192.168.2.113)
|
|
- **Data Source**: Gitea repositories (192.168.2.102:3060)
|
|
- Repositories: homelab, truenas
|
|
- **Vector Storage**: PostgreSQL 16.11 + pgvector 0.8.1 (operational on CT 113)
|
|
|
|
### Progress Checklist
|
|
|
|
**Phase 1: Network & Connectivity Setup**
|
|
- [x] Verify Gitea API accessibility (working: http://192.168.2.102:3060/api/v1)
|
|
- [x] Verify n8n instance running (CT 113, 192.168.2.113)
|
|
- [x] Configure Ollama network binding (set OLLAMA_HOST=0.0.0.0 via environment variables)
|
|
- [x] Verify Ollama API accessible from homelab (curl http://192.168.1.81:11434/api/tags)
|
|
- [x] Identify available Ollama models (LLMs: deepseek-r1:8.2B, gpt-oss:20.9B, llama3.2:3.2B, phi3:3.8B)
|
|
- [x] Pull embedding model (nomic-embed-text - 768 dimensions, 274MB)
|
|
|
|
**Phase 2: Understanding Embeddings (Learning Phase)**
|
|
- [x] Pull sample document from Gitea API
|
|
- [x] Send text to Ollama for embedding generation
|
|
- [x] Examine vector output (768-dimensional vectors for each text)
|
|
- [x] Understand semantic similarity concept (cosine similarity demo: 0.5764 for related topics)
|
|
|
|
**Phase 3: Vector Storage Implementation** ✅ COMPLETE
|
|
- [x] Evaluate PostgreSQL + pgvector (uses existing n8n database)
|
|
- [x] Evaluate Qdrant (lightweight Docker deployment)
|
|
- [x] Choose storage backend based on learning goals (PostgreSQL + pgvector selected)
|
|
- [x] Install pgvector extension on CT 113 (PostgreSQL 16.11, pgvector 0.8.1)
|
|
- [x] Create rag_embeddings table with vector(768) column
|
|
- [x] Debug and fix vector insertion (corrected string→vector conversion)
|
|
- [x] Debug and fix ORDER BY issue (subquery approach working)
|
|
- [x] Verify cosine similarity search (working: 0.5765 similarity for related concepts)
|
|
- [x] Create production-ready vector_search.py module with insert/search/stats functions
|
|
|
|
**Phase 4: Build Ingestion Workflow (n8n)** - READY TO START
|
|
- [ ] Deploy vector_search.py production module to CT 113
|
|
- [ ] Test manual document insertion via CLI
|
|
- [ ] Implement text chunking strategy (500 char chunks, 100 char overlap)
|
|
- [ ] Create minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
|
|
- [ ] Test workflow with single README.md file from homelab repo
|
|
- [ ] Scale to process all .md files in homelab repository
|
|
- [ ] Add error handling and deduplication logic
|
|
- [ ] Schedule automated daily ingestion runs
|
|
|
|
**Phase 5: Build Query Workflow (n8n)** - NOT STARTED
|
|
- [ ] Create workflow: Webhook → User question
|
|
- [ ] Generate embedding for user query
|
|
- [ ] Implement vector similarity search (threshold >0.5)
|
|
- [ ] Retrieve top 3-5 relevant chunks
|
|
- [ ] Construct prompt with retrieved context
|
|
- [ ] Call Ollama LLM for answer generation (llama3.2 or deepseek-r1)
|
|
- [ ] Return formatted response with source references
|
|
- [ ] Add webhook endpoint for external integrations
|
|
|
|
### Context
|
|
**RAG Architecture Overview:**
|
|
1. **Ingestion Pipeline**: Gitea API → Text Chunking → Ollama Embeddings → Vector Database
|
|
2. **Query Pipeline**: User Question → Embedding → Vector Search → Context Retrieval → LLM Generation → Answer
|
|
|
|
**Phase 3 Achievements (2025-12-25):**
|
|
- ✅ PostgreSQL + pgvector fully operational on CT 113
|
|
- ✅ Vector search working with 0.5765 similarity for related concepts
|
|
- ✅ Production-ready Python module (`vector_search.py`) with insert/search/stats functions
|
|
- ✅ Debugged and resolved 2 critical issues:
|
|
1. Embedding storage: Fixed psycopg2 parameter handling (must cast to `::vector(768)` in SQL, not Python)
|
|
2. ORDER BY bug: Subquery approach works, CTE approach fails (use `ORDER BY similarity DESC` instead of vector operation)
|
|
|
|
**Key Learnings:**
|
|
- ✅ Embeddings convert text to 768-dimensional vectors representing semantic meaning
|
|
- ✅ Vector databases enable semantic search (meaning-based, not keyword-based)
|
|
- ✅ pgvector cosine distance operator (`<=>`) measures similarity: 0=identical, 2=opposite
|
|
- ✅ Similarity scores: >0.7=highly relevant, 0.5-0.7=related, 0.3-0.5=somewhat related, <0.3=unrelated
|
|
- ✅ psycopg2 doesn't natively support pgvector - must format vectors as strings and cast in SQL
|
|
- ✅ Reusing vector parameters in ORDER BY causes silent failures - use subqueries instead
|
|
|
|
**Technical Stack Validated:**
|
|
- Ollama API (192.168.1.81:11434) ✅ Accessible across subnets
|
|
- nomic-embed-text model ✅ 768 dimensions, fast generation
|
|
- PostgreSQL 16.11 + pgvector 0.8.1 ✅ Operators working correctly
|
|
- Python psycopg2 ✅ With workarounds for vector handling
|
|
|
|
**Success Metrics - Phase 3:**
|
|
- ✅ Successfully query "how to backup VM" and retrieve relevant homelab documentation (0.5765 similarity)
|
|
- ✅ Understand each component of the vector storage pipeline
|
|
- ✅ Create reusable Python module for n8n integration
|
|
|
|
**Next Steps - Phase 4:**
|
|
- Deploy vector_search.py to CT 113 and test CLI interface
|
|
- Create text chunking function (500 char chunks, 100 char overlap)
|
|
- Build minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
|
|
- Scale to process all .md files in homelab repository
|
|
- Add error handling and deduplication logic
|
|
|
|
**Session Handoff Document:** `/home/jramos/homelab/n8n/SESSION_HANDOFF_PHASE4_READY.md`
|
|
**Learning Resources:** Step-by-step lessons with examples, mental models, troubleshooting guide
|
|
|
|
---
|
|
|
|
## Previous Initiative: Security Audit Remediation - Q4 2025
|
|
|
|
### Goal
|
|
Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration.
|
|
|
|
### Phase
|
|
Planning - Documentation Complete, Remediation Pending
|
|
|
|
### Progress Checklist
|
|
|
|
**Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime**
|
|
- [x] Complete security audit (31 findings documented)
|
|
- [x] Create remediation scripts (8 scripts validated)
|
|
- [x] Document security baseline in SECURITY.md
|
|
- [ ] Backup all service configurations (`backup-before-remediation.sh`)
|
|
- [ ] Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker)
|
|
|
|
**Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime**
|
|
- [ ] Deploy docker-socket-proxy
|
|
- [ ] Rotate Proxmox API credentials (`rotate-pve-credentials.sh`)
|
|
- [ ] Rotate database passwords (`rotate-paperless-password.sh`)
|
|
- [ ] Rotate JWT secrets (`rotate-bytestash-jwt.sh`)
|
|
|
|
**Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime**
|
|
- [ ] Migrate Portainer to socket proxy
|
|
- [ ] Migrate NPM to socket proxy or remove socket access
|
|
- [ ] Remove socket mounts from Speedtest Tracker
|
|
- [ ] Implement SSL/TLS for internal services
|
|
- [ ] Enable container user namespacing
|
|
|
|
**Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours**
|
|
- [ ] Implement network segmentation (VLANs for service tiers)
|
|
- [ ] Deploy fail2ban for rate limiting
|
|
- [ ] Enable backup encryption (PBS configuration)
|
|
- [ ] Container vulnerability scanning pipeline
|
|
- [ ] Automated credential rotation system
|
|
|
|
### Context
|
|
Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise.
|
|
|
|
**Risk Management**:
|
|
- Phase 1: Zero downtime (configuration changes only)
|
|
- Phase 2: Minimal downtime (credential rotation, proxy deployment)
|
|
- Phase 3: Moderate downtime (service reconfiguration)
|
|
- Phase 4: Planned maintenance windows (infrastructure changes)
|
|
|
|
**Success Metrics**:
|
|
- All CRITICAL findings remediated (6/6)
|
|
- All HIGH findings remediated (3/3)
|
|
- Secrets removed from git repository
|
|
- Docker socket access eliminated or proxied
|
|
- SSL/TLS enabled for all external services
|
|
|
|
---
|
|
|
|
## Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18)
|
|
|
|
### Goal
|
|
Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit `tools:` declarations receive only a subset of their configured tools, with first and last array elements consistently dropped.
|
|
|
|
### Phase
|
|
COMPLETED - Bug confirmed, comprehensive report generated for Anthropic
|
|
|
|
### Progress Checklist
|
|
- [x] Reproduce bug with scribe agent (confirmed: missing Read and Write)
|
|
- [x] Reproduce bug with lab-operator agent (confirmed: missing Bash and Write)
|
|
- [x] Test backend-builder agent (working correctly - exception to pattern)
|
|
- [x] Test librarian agent (working correctly - no tools: declaration)
|
|
- [x] Identify pattern: First and last tools dropped for agents with explicit tools: arrays
|
|
- [x] Document impact: Scribe cannot create docs, lab-operator cannot execute commands
|
|
- [x] Generate comprehensive bug report for Anthropic with all evidence
|
|
- [x] Update CLAUDE_STATUS.md with investigation status
|
|
- [ ] Submit bug report to Anthropic via GitHub issues
|
|
|
|
### Key Findings
|
|
**Bug Pattern**: Sub-agents with `tools: [A, B, C, D, E]` receive only `[B, C, D]` at runtime
|
|
**Affected**: scribe (no Read/Write), lab-operator (no Bash/Write)
|
|
**Unaffected**: backend-builder (exception), librarian (no tools: line)
|
|
**Workaround**: Remove `tools:` declarations to grant all tools by default
|
|
|
|
**Artifacts**:
|
|
- Bug report: `/home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md`
|
|
- Original report: `/home/jramos/homelab/troubleshooting/BUG_REPORT.md`
|
|
- Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7
|
|
|
|
### Context
|
|
Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue.
|
|
|
|
---
|
|
|
|
## Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07)
|
|
|
|
### Goal
|
|
Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).
|
|
|
|
### Phase
|
|
COMPLETED - All sub-agent improvements and validations finished
|
|
|
|
### Progress Checklist
|
|
- [x] Prompt engineering analysis completed (Opus model)
|
|
- Analyzed CLAUDE.md and all 4 sub-agent files
|
|
- Identified 5 critical issues, 12 high-impact improvements
|
|
- Generated comprehensive improvement recommendations
|
|
- [x] scribe.md improved (29 340 lines)
|
|
- Added 6 usage examples (4 positive, 2 negative redirects)
|
|
- Implemented comprehensive responsibilities section
|
|
- Added 3 complete ASCII diagram templates
|
|
- Included safety protocols and decision frameworks
|
|
- Quality now matches librarian.md standard
|
|
- [x] backend-builder.md improved (40 291 lines)
|
|
- Added 6 usage examples with clear boundaries
|
|
- Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
|
|
- Added technology stack table and validation rules table
|
|
- Included safety protocols for secrets and destructive operations
|
|
- Added handoff protocol for lab-operator deployment
|
|
- Defined clear boundaries (CREATES code, does NOT deploy)
|
|
- [x] lab-operator.md improved (37 193 lines)
|
|
- Added 6 usage examples with role clarity
|
|
- Expanded domain expertise with specific commands
|
|
- Added command style guide (5-step pattern)
|
|
- Included safety protocols and decision-making framework
|
|
- Added error handling and escalation guidelines
|
|
- Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
|
|
- [x] CLAUDE.md structural fixes
|
|
- Moved YAML frontmatter to line 1 (was at line 89)
|
|
- Fixed trailing pipe character on line 87
|
|
- Completed incomplete sentence about backup strategy
|
|
- Completed incomplete sentence about storage growth
|
|
- Removed redundant "Key Services" reference
|
|
- Expanded status file template with actual structure and recovery instructions
|
|
- [x] Final validation and testing
|
|
- librarian: Git status check successful, clear output format
|
|
- scribe: File reading functional (note: reported encoding issue, likely false positive)
|
|
- backend-builder: YAML validation successful, proper syntax checking
|
|
- lab-operator: Directory listing successful, proper command execution
|
|
- All agents demonstrate improved structure and clarity
|
|
|
|
### Context
|
|
**Why It Matters**: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.
|
|
|
|
**Next Steps**: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.
|
|
|
|
---
|
|
|
|
## Previous Phase: Infrastructure Documentation Complete
|
|
|
|
### Goal
|
|
Comprehensive documentation of monitoring stack and updated infrastructure inventory.
|
|
|
|
### Phase
|
|
Documentation & Maintenance
|
|
|
|
### Completed Tasks
|
|
- [x] Created `/home/jramos/homelab/monitoring/README.md` with comprehensive monitoring documentation
|
|
- [x] Updated `CLAUDE_STATUS.md` with current infrastructure state
|
|
- [x] Documented 8 VMs, 2 Templates, and 4 LXC containers
|
|
- [x] Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
|
|
- [x] Added monitoring stack architecture and deployment procedures
|
|
- [x] Documented new services: monitoring-docker, twingate-connector, n8n
|
|
- [x] Referenced latest export: disaster-recovery/homelab-export-20251207-120040
|
|
|
|
### Remaining Documentation Tasks
|
|
- [x] Update INDEX.md with monitoring section and current VM/CT counts
|
|
- [x] Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
|
|
- [x] Update CLAUDE.md with architecture tables for monitoring and zero-trust
|
|
- [x] Update services/README.md with monitoring stack and twingate sections
|
|
- [x] Verify all documentation cross-references are accurate
|
|
- [ ] Test monitoring stack deployment procedures
|
|
|
|
---
|
|
|
|
## Access Information
|
|
|
|
### Management Interfaces
|
|
- **Proxmox UI**: https://192.168.2.200:8006
|
|
- **Grafana**: http://192.168.2.114:3000
|
|
- **Prometheus**: http://192.168.2.114:9090
|
|
- **Nginx Proxy Manager**: http://192.168.2.101:81
|
|
- **n8n**: http://192.168.2.113:5678
|
|
- **TinyAuth**: https://tinyauth.apophisnetworking.net (internal: http://192.168.2.10:8000)
|
|
- **OpenClaw**: https://openclaw.apophisnetworking.net (internal: http://192.168.2.120:18789)
|
|
|
|
### Key Network Segments
|
|
- **Management Network**: 192.168.2.0/24
|
|
- **Proxmox Host**: 192.168.2.200
|
|
- **Reverse Proxy**: 192.168.2.101 (CT 102)
|
|
- **TinyAuth**: 192.168.2.10 (CT 115)
|
|
- **n8n**: 192.168.2.113 (CT 113)
|
|
- **Monitoring**: 192.168.2.114 (VM 101)
|
|
- **OpenClaw**: 192.168.2.120 (VM 120)
|
|
|
|
---
|
|
|
|
## Maintenance Schedule
|
|
|
|
### Automated Tasks
|
|
- **Backups**: Proxmox Backup Server - Daily incremental, Weekly full
|
|
- **Monitoring Scrapes**: Prometheus - Every 30 seconds
|
|
- **Certificate Renewal**: Nginx Proxy Manager - Automatic via Let's Encrypt
|
|
|
|
### Recommended Manual Tasks
|
|
- **Weekly**: Review Grafana dashboards for anomalies
|
|
- **Monthly**: Update monitoring stack Docker images
|
|
- **Quarterly**: Review backup retention policies
|
|
- **Semi-Annual**: Kernel updates on Proxmox host and VMs
|
|
|
|
---
|
|
|
|
## Known Issues & Resolutions
|
|
|
|
### Resolved
|
|
- n8n PostgreSQL locale errors (fixed with `fix_n8n_db_c_locale.sh`)
|
|
- n8n database permissions (fixed with `fix_n8n_db_permissions.sh`)
|
|
|
|
### Active Security Vulnerabilities (2025-12-20 Audit)
|
|
|
|
**CRITICAL Severity:**
|
|
1. **Docker Socket Exposure** (CVSS 9.8)
|
|
- Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker
|
|
- Impact: Container escape to root access
|
|
- Remediation: Deploy docker-socket-proxy (Phase 2)
|
|
|
|
2. **Proxmox Credentials in Plaintext** (CVSS 9.1)
|
|
- Affected: PVE Exporter `.env` and `pve.yml`
|
|
- Impact: Full infrastructure compromise
|
|
- Remediation: Rotate credentials, use API tokens (Phase 2)
|
|
|
|
3. **Database Passwords in Git** (CVSS 8.5)
|
|
- Affected: Paperless-ngx, ByteStash, Speedtest Tracker
|
|
- Impact: Credential exposure to all repository users
|
|
- Remediation: Migrate to `.env` files, scrub git history (Phase 1)
|
|
|
|
**HIGH Severity:**
|
|
4. **Missing SSL/TLS** (CVSS 7.5)
|
|
- Affected: Internal service communication
|
|
- Impact: Traffic interception, credential sniffing
|
|
- Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3)
|
|
|
|
5. **Weak/Default Passwords** (CVSS 7.2)
|
|
- Affected: Multiple services
|
|
- Impact: Brute-force attacks, unauthorized access
|
|
- Remediation: Generate strong passwords, implement rotation (Phase 2)
|
|
|
|
6. **Containers Running as Root** (CVSS 7.0)
|
|
- Affected: Most Docker containers
|
|
- Impact: Privilege escalation if container compromised
|
|
- Remediation: Enable user namespacing, set non-root users (Phase 3)
|
|
|
|
**Remediation Timeline:** See "Security Audit Remediation - Q4 2025" initiative above
|
|
|
|
### Active Monitoring
|
|
- PVE Exporter SSL verification (set to false for self-signed certificates) - **SECURITY RISK**
|
|
- Prometheus retention policies (currently 15 days, may need adjustment)
|
|
- Security script container names need verification (3/8 scripts)
|
|
|
|
### Deferred
|
|
- NetBox container offline (on-demand service)
|
|
- Development VMs stopped (resource conservation)
|
|
- Network segmentation implementation (Phase 4)
|
|
- Backup encryption (Phase 4)
|
|
|
|
---
|
|
|
|
## Version History
|
|
|
|
- **v2.1.0** (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
|
|
- **v2.0.0** (2025-12-02): Repository reorganization, services migration from GitLab
|
|
- **v1.0.0** (2025-11-29): Initial infrastructure documentation
|
|
|
|
---
|
|
|
|
**Maintained by**: jramos
|
|
**Repository**: Homelab Infrastructure Configuration
|
|
**Platform**: Proxmox VE 8.4.0
|
|
**Infrastructure Scale**: 10 VMs, 2 Templates, 5 Containers
|
|
**Current Status**: Operational - OpenClaw Deployment In Progress |