- Add Docker Compose configs with security hardening (cap_drop ALL, non-root, read-only FS) - Add Prometheus node_exporter scrape target for 192.168.2.120:9100 - Update services/README.md, INDEX.md, and CLAUDE_STATUS.md with VM 120 - Image pinned to v2026.2.1 (patches CVE-2026-25253) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
43 KiB
Homelab Infrastructure Status
Last Updated: 2026-02-03 Export Reference: disaster-recovery/homelab-export-20251211-144345 Current Session: OpenClaw Deployment - VM 120
Quick Resume (Current Session Context)
Where We Are: OpenClaw deployed and healthy on VM 120. Container running with full security hardening. Backups configured. Manual steps remain for NPM proxy host, Twingate resource, and Prometheus config on VM 101.
Completed:
- Config files created (
services/openclaw/) - VM 120 created and hardened (UFW, fail2ban, node-exporter, openclaw user)
- OpenClaw container deployed and healthy (v2026.2.1)
- Security verified (cap_drop ALL, non-root, read-only FS, no docker.sock)
- Prometheus scrape target added to repo copy
- PBS backup job created (daily 02:00, snapshot, zstd)
- Application backup script + weekly cron configured
- Documentation updated (README, services/README, CLAUDE_STATUS, INDEX)
- node_exporter installed and serving metrics on 192.168.2.120:9100
Manual Steps Remaining:
- NPM: Create proxy host for openclaw.apophisnetworking.net -> 192.168.2.120:18789 (WebSocket support, SSL, TinyAuth)
- Twingate: Add resource for 192.168.2.120 ports 18789/18790/1455
- VM 101: Deploy updated prometheus.yml via Proxmox web console (SSH not configured)
- Configure at least one LLM provider API key in /opt/openclaw/.env
Current Infrastructure Snapshot
Proxmox Environment
- Node: serviceslab
- Version: Proxmox VE 8.4.0
- Management IP: 192.168.2.100
- Architecture: Single-node cluster
- Total Resources: 10 VMs, 2 Templates, 5 LXC Containers
Virtual Machines (QEMU/KVM) - 10 VMs
| VM ID | Name | IP Address | Status | Purpose |
|---|---|---|---|---|
| 100 | docker-hub | 192.168.2.102 | Running | Container registry/Docker hub mirror |
| 101 | monitoring-docker | 192.168.2.114 | Running | Monitoring stack (Grafana/Prometheus/PVE Exporter) |
| 105 | dev | - | Stopped | General-purpose development workstation |
| 106 | Ansible-Control | 192.168.2.XXX | Running | IaC orchestration, configuration management |
| 108 | CML | - | Stopped | Cisco Modeling Labs - network simulation |
| 109 | web-server-01 | 192.168.2.XXX | Running | Web application server (clustered) |
| 110 | web-server-02 | 192.168.2.XXX | Running | Load-balanced pair with web-server-01 |
| 111 | db-server-01 | 192.168.2.XXX | Running | Backend database server |
| 114 | haos | 192.168.2.XXX | Running | Home Assistant OS - smart home automation platform |
| 120 | openclaw | 192.168.2.120 | Running | OpenClaw AI chatbot gateway |
Recent Changes:
- Added VM 120 (openclaw) for multi-platform AI chatbot gateway (2026-02-03)
- Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
- Removed VM 101 (gitlab) - service decommissioned
VM Templates - 2 Templates
| Template ID | Name | Purpose |
|---|---|---|
| 104 | ubuntu-dev | Ubuntu development environment template for cloning |
| 107 | ubuntu-docker | Ubuntu Docker host template for rapid deployment |
Note: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.
Containers (LXC) - 5 Containers
| CT ID | Name | IP Address | Status | Purpose |
|---|---|---|---|---|
| 102 | nginx | 192.168.2.101 | Running | Reverse proxy/load balancer & NPM |
| 103 | netbox | 192.168.2.XXX | Running | Network documentation/IPAM |
| 112 | twingate-connector | 192.168.2.XXX | Running | Zero-trust network access connector |
| 113 | n8n | 192.168.2.113 | Running | Workflow automation platform |
| 115 | tinyauth | 192.168.2.10 | Running | SSO authentication layer for NetBox |
Recent Changes:
- Added CT 115 (tinyauth) for SSO authentication integration with NetBox
- Added CT 112 (twingate-connector) for zero-trust network security
- Added CT 113 (n8n) for workflow automation
- Removed CT 112 (Anytype) - replaced by n8n
Storage Architecture
| Storage Pool | Type | Total | Used | % Used | Purpose |
|---|---|---|---|---|---|
| local | Directory | - | - | 19.11% | System files, ISOs, templates |
| local-lvm | LVM-Thin | - | - | 0.01% | VM disk images (thin provisioned) |
| Vault | NFS/Directory | - | - | 12.13% | Secure storage for sensitive data |
| PBS-Backups | PBS | - | - | 28.27% | Automated backup repository |
| iso-share | NFS/CIFS | - | - | 1.45% | Installation media library |
| localnetwork | Network Share | - | - | N/A | Shared resources across infrastructure |
Capacity Notes:
- PBS-Backups utilization increased to 28.27% (healthy retention)
- Vault utilization increased to 12.13% (data growth monitored)
- local storage at 19.11% (system overhead within normal range)
Key Services & Stacks
Monitoring & Observability (NEW)
VM 101 - monitoring-docker (192.168.2.114)
- Grafana: Port 3000 - Visualization and dashboards
- Prometheus: Port 9090 - Metrics collection and time-series database
- PVE Exporter: Port 9221 - Proxmox VE metrics exporter
- Documentation:
/home/jramos/homelab/monitoring/README.md - Status: Fully operational
Network Security (NEW)
CT 112 - twingate-connector
- Purpose: Zero-trust network access
- Type: Lightweight connector
- Status: Running
- Integration: Connects homelab to Twingate network
Automation & Integration
CT 113 - n8n (192.168.2.113)
- Purpose: Workflow automation platform
- Technology: n8n.io
- Database: PostgreSQL 15+
- Features: API integration, scheduled workflows, webhook triggers
- Documentation:
/home/jramos/homelab/services/README.md#n8n-workflow-automation - Status: Operational (resolved database locale issues)
Authentication & SSO
CT 115 - tinyauth (192.168.2.10)
- Purpose: Lightweight SSO authentication layer
- Technology: TinyAuth v4 (Docker container)
- Port: 8000
- Domain: tinyauth.apophisnetworking.net
- Integration: Authentication gateway for NetBox via Nginx Proxy Manager
- Security: Bcrypt-hashed credentials, HTTPS enforcement
- Documentation:
/home/jramos/homelab/services/tinyauth/README.md - Status: Operational
AI Chatbot Gateway
VM 120 - openclaw (192.168.2.120)
- Purpose: Multi-platform AI chatbot gateway
- Technology: OpenClaw (Docker container)
- Ports: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
- Domain: openclaw.apophisnetworking.net
- LLM Providers: Anthropic, OpenAI, Ollama
- Messaging: Discord, Telegram, Slack, WhatsApp
- Security: CVE-2026-25253 patched (v2026.2.1), cap_drop ALL, non-root, read-only FS
- Documentation:
/home/jramos/homelab/services/openclaw/README.md - Status: Operational - Container healthy
Infrastructure Documentation
CT 103 - netbox
- Purpose: Network documentation and IPAM
- Status: Stopped (on-demand use)
- Function: Infrastructure source of truth
Reverse Proxy & Load Balancing
CT 102 - nginx (192.168.2.101)
- Purpose: Nginx Proxy Manager
- Ports: 80, 81, 443
- Function: SSL termination, reverse proxy, certificate management
- Upstream Services: All web-facing applications
Three-Tier Application Stack
Web Tier:
- VM 109 (web-server-01) - Primary web server
- VM 110 (web-server-02) - Load-balanced pair
Database Tier:
- VM 111 (db-server-01) - Backend database
Proxy Tier:
- CT 102 (nginx) - Load balancer and SSL termination
Development & Automation
VM 106 - Ansible-Control
- Purpose: Infrastructure as Code orchestration
- Tools: Ansible, Terraform/OpenTofu (potential)
- Status: Running
Container Registry
VM 100 - docker-hub
- Purpose: Local Docker registry and hub mirror
- Function: Caching container images for faster deployments
- Status: Running
Network Simulation
VM 108 - CML
- Purpose: Cisco Modeling Labs
- Function: Network topology testing and simulation
- Status: Stopped (resource-intensive, on-demand use)
Architecture Patterns
Monitoring & Observability (NEW)
The infrastructure now implements a comprehensive monitoring stack following industry best practices:
- Metrics Collection: Prometheus scraping Proxmox metrics via PVE Exporter
- Visualization: Grafana providing real-time dashboards and alerting
- Isolation: Dedicated VM for monitoring services (fault isolation)
- Integration: Ready for AlertManager, additional exporters, and integrations
Design Decision: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.
Zero-Trust Security (NEW)
Implementation of zero-trust network access principles:
- Twingate Connector: Lightweight connector providing secure access without VPNs
- Container Deployment: LXC container for minimal resource overhead
- Network Segmentation: Secure access to homelab from external networks
Design Decision: LXC container chosen for quick provisioning and low resource consumption.
Automation-First Approach
Workflow automation and infrastructure orchestration:
- n8n Platform: Visual workflow builder for API integrations
- Scheduled Tasks: Automated backup checks, monitoring alerts, reports
- Integration Hub: Connects monitoring, documentation, and operational tools
Design Decision: PostgreSQL backend ensures data persistence and supports complex workflows.
Tiered Application Architecture
Classic three-tier design for production-like environments:
- Presentation Tier: Paired web servers (109, 110) behind load balancer
- Business Logic: Application processing on web tier
- Data Tier: Dedicated database server (111) with backup strategy
Design Decision: Separation of concerns, scalability testing, high availability patterns.
Selective Containerization Strategy
Hybrid approach balancing performance and resource efficiency:
- LXC Containers: Stateless services (nginx, netbox, twingate, n8n)
- Full VMs: Complex applications, kernel dependencies, heavy workloads
- Rationale: LXC for ~10x lower overhead, VMs for isolation and compatibility
Recent Infrastructure Changes
2026-02-03: OpenClaw AI Chatbot Gateway Deployment (In Progress)
Service: VM 120 - OpenClaw multi-platform AI chatbot gateway
Purpose: Bridge messaging platforms (Discord, Telegram, Slack, WhatsApp) with LLM providers (Anthropic, OpenAI, Ollama) through a unified gateway.
Specifications:
- VM: 120 (cloned from template 107, ubuntu-docker)
- IP: 192.168.2.120
- Resources: 4 vCPUs, 16GB RAM, 50GB disk on Vault (ZFS)
- Ports: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
- Domain: openclaw.apophisnetworking.net
- Image: ghcr.io/openclaw/openclaw:2026.2.1
Security Hardening:
- Version >= 2026.2.1 (patches CVE-2026-25253, CVSS 8.8 1-click RCE)
- All ports bound to 127.0.0.1 (reverse proxy required)
- Docker: cap_drop ALL, no-new-privileges, read-only filesystem, non-root user (1001:1001)
- UFW: deny-all + whitelist 192.168.2.0/24 + 192.168.1.91 (desktop PC)
- fail2ban on SSH (3 retries), unattended-upgrades
- Prometheus node_exporter at port 9100
Completed Steps:
- Docker Compose configuration files created
- Security hardening overlay (docker-compose.override.yml)
- Environment variable template (.env.example)
- Prometheus scrape target added
- Documentation created (README, services/README, CLAUDE_STATUS, INDEX)
- VM 120 Creation & SSH Setup
- OS Hardening (UFW, user creation)
Pending Steps:
- NPM reverse proxy configuration (manual - web UI)
- Twingate resource creation (manual - admin console)
- Prometheus config on VM 101 (manual - no SSH access)
- Configure LLM provider API key in .env
Status: Container healthy - Manual network integration remaining
2025-12-20: Comprehensive Security Audit Completed
Activity: Complete infrastructure security assessment and remediation planning
Audit Scope:
- All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser)
- Proxmox VE infrastructure and API access
- Network security and segmentation
- Credential management and storage
- SSL/TLS configuration
- Container security and runtime configuration
Findings Summary:
- CRITICAL (6): Docker socket exposure, hardcoded credentials, database passwords in git
- HIGH (3): Missing SSL/TLS, weak passwords, containers running as root
- MEDIUM (2): SSL verification disabled, missing authentication
- LOW (20): Documentation gaps, monitoring improvements, backup encryption
Deliverables:
- Security Policy (
SECURITY.md): 864 lines - Comprehensive security best practices - Audit Report (
troubleshooting/SECURITY_AUDIT_2025-12-20.md): 2,350 lines - Detailed findings and remediation plan - Security Checklist (
templates/SECURITY_CHECKLIST.md): 750 lines - Pre-deployment validation template - Validation Report (
scripts/security/VALIDATION_REPORT.md): 2,092 lines - Script safety assessment - Container Fixes (
scripts/security/CONTAINER_NAME_FIXES.md): 621 lines - Container name verification - Security Scripts (8 total):
verify-service-status.sh- Service health checkerbackup-before-remediation.sh- Comprehensive backup utilityrotate-pve-credentials.sh- Proxmox credential rotationrotate-paperless-password.sh- Database password rotationrotate-bytestash-jwt.sh- JWT secret rotationrotate-logward-credentials.sh- Multi-service credential rotationdocker-socket-proxy/docker-compose.yml- Security proxy deploymentportainer/docker-compose.socket-proxy.yml- Portainer migration config
Script Validation:
- Ready for execution: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy)
- Needs container name fixes: 3/8 scripts (see CONTAINER_NAME_FIXES.md)
4-Phase Remediation Roadmap:
- Phase 1 (Week 1): Immediate actions - Backups, secrets migration
- Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation
- Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS
- Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines
Estimated Timeline:
- Total downtime: 6-13 minutes (sequential script execution)
- Full remediation: 8-16 weeks
Risk Assessment:
- Current risk: HIGH - Multiple CRITICAL vulnerabilities active
- Post-Phase 1 risk: MEDIUM - Credential exposure mitigated
- Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated
- Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented
Status: Documentation complete, awaiting remediation execution approval
2025-12-18: TinyAuth SSO Deployment
Service Deployed: CT 115 - TinyAuth authentication layer
Purpose: Centralized SSO authentication for NetBox and future homelab services
Specifications:
- Container: CT 115 (LXC with Docker)
- IP Address: 192.168.2.10
- Domain: tinyauth.apophisnetworking.net
- Port: 8000 (external), 3000 (internal)
- Docker Image: ghcr.io/steveiliop56/tinyauth:v4
- Resource Usage: ~50-100 MB memory, <1% CPU
Integration Architecture:
- Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103)
- NPM uses
auth_requestdirective to validate credentials via TinyAuth - Bcrypt-hashed password storage for security
- HTTPS enforcement via NPM SSL termination
Issues Resolved During Deployment:
- 500 Internal Server Error: Fixed Nginx advanced config syntax
- IP addresses not allowed: Changed APP_URL from IP to domain
- Port mapping: Corrected Docker port mapping from 8000:8000 to 8000:3000
- Invalid password: Implemented bcrypt hash requirement for TinyAuth v4
Integration Impact:
- NetBox now protected by centralized authentication
- Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates)
- Authentication logs available for security auditing
Documentation: Complete guide at /home/jramos/homelab/services/tinyauth/README.md
Status: ✅ Operational - Successfully authenticating NetBox access
2025-12-11: Loki-Stack Monitoring Fully Operational
Issue Resolved: Centralized logging pipeline now receiving syslog from UniFi router
Root Cause: rsyslog filter in /etc/rsyslog.d/unifi-router.conf was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1)
Fix Applied: Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1)
Status: ✅ Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana
Services Affected:
- VM 101 (monitoring-docker): rsyslog configuration updated
- Loki-stack: All components operational
- Grafana: Dashboards receiving real-time syslog data
Technical Details: See troubleshooting/loki-stack-bugfix.md for complete 5-phase troubleshooting history
2025-12-11: Infrastructure Expansion & System Updates
Proxmox VE Platform Upgrade
- Upgraded: Proxmox VE 8.3.3 → 8.4.0
- Kernel: 6.8.12-8-pve
- pve-manager: 8.4.14
- Impact: Enhanced performance, security updates, bug fixes
- Status: ✅ Complete - All VMs and containers operating normally
New VM 114: Home Assistant OS Deployment
- Service: haos (Home Assistant Operating System)
- Purpose: Smart home automation and integration platform
- Specifications:
- Memory: 4 GB (87% utilized)
- CPU: 2 vCPUs
- Boot Disk: 50 GB
- Status: Running (~3 days uptime)
- Rationale: Centralized home automation hub for IoT device management
- Integration: Will integrate with monitoring stack for infrastructure metrics
CT 103: NetBox IPAM Activated
- Service: netbox (Network Documentation & IPAM)
- Status Change: Stopped → Running
- Uptime: ~3.1 days
- Resource Usage: 1.28 GB / 2 GB memory (64%)
- Purpose: Active network documentation and IP address management
- Rationale: Required for ongoing infrastructure expansion planning
Storage Utilization Trends
- PBS-Backups: 27.43% → 28.27% (+0.84%) - Normal backup retention growth
- Vault (ZFS): 10.88% → 12.13% (+1.25%) - Data accumulation monitored
- local: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates
- iso-share: 1.4% → 1.45% (+0.05%) - Minimal change
- local-lvm: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline
2025-12-25: RAG Vector Search - Phase 3 Complete
Activity: Implemented and debugged production-ready vector search system for AI-powered documentation retrieval
Deliverables:
-
Production Module (
n8n/vector_search.py): Complete API for semantic searchsearch_similar_documents()- Query with natural languageinsert_document()- Add documents with embeddingsget_stats()- Database statisticsdelete_by_repo()- Bulk cleanup- CLI interface for testing and manual operations
-
Documentation Suite:
SESSION_HANDOFF_PHASE4_READY.md(17KB) - Comprehensive learning guide for next sessionPHASE3_COMPLETE.md(12KB) - Complete debugging summary and deployment guideVECTOR_SEARCH_DEBUG.md(4.7KB) - Technical root cause analysisVECTOR_SEARCH_COMPARISON.md(2.5KB) - Before/after code comparison
-
Diagnostic Scripts (8 total):
- Embedding storage repair, parameter binding tests, SQL validation
- All scripts validated and preserved for reference
Technical Achievement:
- PostgreSQL 16.11 + pgvector 0.8.1 fully operational on CT 113
- Vector similarity search returning accurate scores (0.5765 for related concepts)
- Resolved 2 critical bugs:
- psycopg2 parameter handling for pgvector types (must cast in SQL, not Python)
- ORDER BY with vector operations (subquery pattern required)
Validation Results:
- Query: "How do I create snapshots of virtual machines?"
- Result: 0.5765 similarity to backup documentation
- Interpretation: Correctly identifies semantic relationship between "snapshots" and "backups"
Infrastructure:
- Database: n8n_db on CT 113
- Table: rag_embeddings (id, source_repo, file_path, chunk_text, embedding vector(768), metadata jsonb)
- Embedding API: Ollama at 192.168.1.81:11434 (nomic-embed-text, 768 dimensions)
- Storage overhead: ~3KB per vector, ~5KB per document total
Status: ✅ Phase 3 Complete | Phase 4 Ready to Start Next Steps: Build n8n ingestion workflow to load homelab documentation from Gitea
2025-12-07: Infrastructure Documentation & Monitoring Stack
Additions
-
VM 101 (monitoring-docker): New dedicated monitoring infrastructure
- Grafana for visualization
- Prometheus for metrics collection
- PVE Exporter for Proxmox integration
- IP: 192.168.2.114
-
CT 112 (twingate-connector): Zero-trust network security
- Lightweight connector
- Secure remote access without VPN
-
CT 113 (n8n): Workflow automation platform
- PostgreSQL 16.11 backend (upgraded from 15+)
- pgvector 0.8.1 extension for vector search
- IP: 192.168.2.113
- Resolved database locale issues
Modifications
- Storage utilization updated across all pools
- PBS-Backups now at 27.43% (increased retention)
- Vault optimized to 10.88% (reduced usage)
Removals
- VM 101 (gitlab): Decommissioned (previously at this ID)
- CT 112 (Anytype): Replaced by n8n for better integration
Documentation Updates
- Created comprehensive monitoring stack documentation
- Updated all infrastructure tables with current VMs/CTs
- Added architecture patterns for observability and zero-trust
- Updated storage statistics
- Referenced latest export: disaster-recovery/homelab-export-20251207-120040
Repository Structure
homelab/
n8n/ # RAG Vector Search Implementation (NEW)
vector_search.py # Production module for vector operations
SESSION_HANDOFF_PHASE4_READY.md # Learning guide for next session
PHASE3_COMPLETE.md # Phase 3 debugging and achievements summary
fix_embedding_storage.py # Diagnostic script (embedding repair)
test_direct_sql.py # Diagnostic script (query testing)
test_vector_search_working.py # Validated working implementation
test_parameter_binding.py # Diagnostic script (psycopg2 debugging)
test_pgvector_direct.sql # Raw SQL tests for pgvector
VECTOR_SEARCH_DEBUG.md # Technical debugging documentation
VECTOR_SEARCH_COMPARISON.md # Before/after code comparison
README_VECTOR_SEARCH.md # Comprehensive setup guide
monitoring/ # Monitoring stack configurations
README.md # Comprehensive monitoring documentation
grafana/
docker-compose.yml
prometheus/
docker-compose.yml
prometheus.yml
pve-exporter/
docker-compose.yml
pve.yml
.env
services/ # Docker Compose service configurations
n8n/ # n8n workflow automation
netbox/ # Network documentation & IPAM
openclaw/ # OpenClaw AI chatbot gateway (VM 120)
tinyauth/ # SSO authentication layer
README.md # Services overview (updated)
disaster-recovery/
homelab-export-20251207-120040/ # Latest infrastructure export
scripts/
crawlers-exporters/ # Infrastructure collection scripts
fixers/ # Problem-solving scripts
qol/ # Quality of life improvements
security/ # Security audit and remediation scripts (NEW)
verify-service-status.sh
backup-before-remediation.sh
rotate-*.sh # Credential rotation scripts
QUICK_REFERENCE.md # Security operations guide
troubleshooting/
SECURITY_AUDIT_2025-12-20.md # Comprehensive security assessment
loki-stack-bugfix.md # Loki logging troubleshooting
CLAUDE.md # AI assistant guidance (updated)
SECURITY.md # Security policy and best practices (NEW)
INDEX.md # Navigation index (updated)
README.md # Repository overview (updated)
CLAUDE_STATUS.md # This file - current infrastructure status
Security Status
Latest Audit: 2025-12-20 Total Findings: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW) Remediation Status: Planning Phase - Documentation Complete
Critical Vulnerabilities:
- Docker socket exposure (3 containers)
- Proxmox credentials in plaintext
- Database passwords in git repository
- Missing SSL/TLS for internal services
- Weak/default passwords across services
- Containers running as root
Documentation:
- Security Policy:
/home/jramos/homelab/SECURITY.md - Audit Report:
/home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md - Security Checklist:
/home/jramos/homelab/templates/SECURITY_CHECKLIST.md - Script Validation:
/home/jramos/homelab/scripts/security/VALIDATION_REPORT.md
Current Initiative: n8n RAG Workflow for Homelab Documentation - Q4 2025
Goal
Build an interactive n8n workflow that implements Retrieval-Augmented Generation (RAG) to query homelab documentation stored in Gitea using local AI (Ollama). This is a learning-focused project to understand RAG architecture, embeddings, vector storage, and LLM integration.
Phase
Phase 3 Complete - Vector Storage Operational | Moving to Phase 4 - n8n Workflow Development
Infrastructure Components
- AI Backend: Ollama running on Windows 11 PC (192.168.1.81)
- Hardware: AMD 7900 GRE GPU, i7-12700KF, 32GB RAM @ 4000MHz, 2TB NVMe
- Installation: Native Windows application (not Docker)
- Open-WebUI: Running in Docker Desktop on same machine (port 3000)
- Orchestrator: n8n workflow automation (CT 113, 192.168.2.113)
- Data Source: Gitea repositories (192.168.2.102:3060)
- Repositories: homelab, truenas
- Vector Storage: PostgreSQL 16.11 + pgvector 0.8.1 (operational on CT 113)
Progress Checklist
Phase 1: Network & Connectivity Setup
- Verify Gitea API accessibility (working: http://192.168.2.102:3060/api/v1)
- Verify n8n instance running (CT 113, 192.168.2.113)
- Configure Ollama network binding (set OLLAMA_HOST=0.0.0.0 via environment variables)
- Verify Ollama API accessible from homelab (curl http://192.168.1.81:11434/api/tags)
- Identify available Ollama models (LLMs: deepseek-r1:8.2B, gpt-oss:20.9B, llama3.2:3.2B, phi3:3.8B)
- Pull embedding model (nomic-embed-text - 768 dimensions, 274MB)
Phase 2: Understanding Embeddings (Learning Phase)
- Pull sample document from Gitea API
- Send text to Ollama for embedding generation
- Examine vector output (768-dimensional vectors for each text)
- Understand semantic similarity concept (cosine similarity demo: 0.5764 for related topics)
Phase 3: Vector Storage Implementation ✅ COMPLETE
- Evaluate PostgreSQL + pgvector (uses existing n8n database)
- Evaluate Qdrant (lightweight Docker deployment)
- Choose storage backend based on learning goals (PostgreSQL + pgvector selected)
- Install pgvector extension on CT 113 (PostgreSQL 16.11, pgvector 0.8.1)
- Create rag_embeddings table with vector(768) column
- Debug and fix vector insertion (corrected string→vector conversion)
- Debug and fix ORDER BY issue (subquery approach working)
- Verify cosine similarity search (working: 0.5765 similarity for related concepts)
- Create production-ready vector_search.py module with insert/search/stats functions
Phase 4: Build Ingestion Workflow (n8n) - READY TO START
- Deploy vector_search.py production module to CT 113
- Test manual document insertion via CLI
- Implement text chunking strategy (500 char chunks, 100 char overlap)
- Create minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
- Test workflow with single README.md file from homelab repo
- Scale to process all .md files in homelab repository
- Add error handling and deduplication logic
- Schedule automated daily ingestion runs
Phase 5: Build Query Workflow (n8n) - NOT STARTED
- Create workflow: Webhook → User question
- Generate embedding for user query
- Implement vector similarity search (threshold >0.5)
- Retrieve top 3-5 relevant chunks
- Construct prompt with retrieved context
- Call Ollama LLM for answer generation (llama3.2 or deepseek-r1)
- Return formatted response with source references
- Add webhook endpoint for external integrations
Context
RAG Architecture Overview:
- Ingestion Pipeline: Gitea API → Text Chunking → Ollama Embeddings → Vector Database
- Query Pipeline: User Question → Embedding → Vector Search → Context Retrieval → LLM Generation → Answer
Phase 3 Achievements (2025-12-25):
- ✅ PostgreSQL + pgvector fully operational on CT 113
- ✅ Vector search working with 0.5765 similarity for related concepts
- ✅ Production-ready Python module (
vector_search.py) with insert/search/stats functions - ✅ Debugged and resolved 2 critical issues:
- Embedding storage: Fixed psycopg2 parameter handling (must cast to
::vector(768)in SQL, not Python) - ORDER BY bug: Subquery approach works, CTE approach fails (use
ORDER BY similarity DESCinstead of vector operation)
- Embedding storage: Fixed psycopg2 parameter handling (must cast to
Key Learnings:
- ✅ Embeddings convert text to 768-dimensional vectors representing semantic meaning
- ✅ Vector databases enable semantic search (meaning-based, not keyword-based)
- ✅ pgvector cosine distance operator (
<=>) measures similarity: 0=identical, 2=opposite - ✅ Similarity scores: >0.7=highly relevant, 0.5-0.7=related, 0.3-0.5=somewhat related, <0.3=unrelated
- ✅ psycopg2 doesn't natively support pgvector - must format vectors as strings and cast in SQL
- ✅ Reusing vector parameters in ORDER BY causes silent failures - use subqueries instead
Technical Stack Validated:
- Ollama API (192.168.1.81:11434) ✅ Accessible across subnets
- nomic-embed-text model ✅ 768 dimensions, fast generation
- PostgreSQL 16.11 + pgvector 0.8.1 ✅ Operators working correctly
- Python psycopg2 ✅ With workarounds for vector handling
Success Metrics - Phase 3:
- ✅ Successfully query "how to backup VM" and retrieve relevant homelab documentation (0.5765 similarity)
- ✅ Understand each component of the vector storage pipeline
- ✅ Create reusable Python module for n8n integration
Next Steps - Phase 4:
- Deploy vector_search.py to CT 113 and test CLI interface
- Create text chunking function (500 char chunks, 100 char overlap)
- Build minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
- Scale to process all .md files in homelab repository
- Add error handling and deduplication logic
Session Handoff Document: /home/jramos/homelab/n8n/SESSION_HANDOFF_PHASE4_READY.md
Learning Resources: Step-by-step lessons with examples, mental models, troubleshooting guide
Previous Initiative: Security Audit Remediation - Q4 2025
Goal
Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration.
Phase
Planning - Documentation Complete, Remediation Pending
Progress Checklist
Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime
- Complete security audit (31 findings documented)
- Create remediation scripts (8 scripts validated)
- Document security baseline in SECURITY.md
- Backup all service configurations (
backup-before-remediation.sh) - Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker)
Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime
- Deploy docker-socket-proxy
- Rotate Proxmox API credentials (
rotate-pve-credentials.sh) - Rotate database passwords (
rotate-paperless-password.sh) - Rotate JWT secrets (
rotate-bytestash-jwt.sh)
Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime
- Migrate Portainer to socket proxy
- Migrate NPM to socket proxy or remove socket access
- Remove socket mounts from Speedtest Tracker
- Implement SSL/TLS for internal services
- Enable container user namespacing
Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours
- Implement network segmentation (VLANs for service tiers)
- Deploy fail2ban for rate limiting
- Enable backup encryption (PBS configuration)
- Container vulnerability scanning pipeline
- Automated credential rotation system
Context
Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise.
Risk Management:
- Phase 1: Zero downtime (configuration changes only)
- Phase 2: Minimal downtime (credential rotation, proxy deployment)
- Phase 3: Moderate downtime (service reconfiguration)
- Phase 4: Planned maintenance windows (infrastructure changes)
Success Metrics:
- All CRITICAL findings remediated (6/6)
- All HIGH findings remediated (3/3)
- Secrets removed from git repository
- Docker socket access eliminated or proxied
- SSL/TLS enabled for all external services
Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18)
Goal
Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit tools: declarations receive only a subset of their configured tools, with first and last array elements consistently dropped.
Phase
COMPLETED - Bug confirmed, comprehensive report generated for Anthropic
Progress Checklist
- Reproduce bug with scribe agent (confirmed: missing Read and Write)
- Reproduce bug with lab-operator agent (confirmed: missing Bash and Write)
- Test backend-builder agent (working correctly - exception to pattern)
- Test librarian agent (working correctly - no tools: declaration)
- Identify pattern: First and last tools dropped for agents with explicit tools: arrays
- Document impact: Scribe cannot create docs, lab-operator cannot execute commands
- Generate comprehensive bug report for Anthropic with all evidence
- Update CLAUDE_STATUS.md with investigation status
- Submit bug report to Anthropic via GitHub issues
Key Findings
Bug Pattern: Sub-agents with tools: [A, B, C, D, E] receive only [B, C, D] at runtime
Affected: scribe (no Read/Write), lab-operator (no Bash/Write)
Unaffected: backend-builder (exception), librarian (no tools: line)
Workaround: Remove tools: declarations to grant all tools by default
Artifacts:
- Bug report:
/home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md - Original report:
/home/jramos/homelab/troubleshooting/BUG_REPORT.md - Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7
Context
Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue.
Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07)
Goal
Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).
Phase
COMPLETED - All sub-agent improvements and validations finished
Progress Checklist
- Prompt engineering analysis completed (Opus model)
- Analyzed CLAUDE.md and all 4 sub-agent files
- Identified 5 critical issues, 12 high-impact improvements
- Generated comprehensive improvement recommendations
- scribe.md improved (29 340 lines)
- Added 6 usage examples (4 positive, 2 negative redirects)
- Implemented comprehensive responsibilities section
- Added 3 complete ASCII diagram templates
- Included safety protocols and decision frameworks
- Quality now matches librarian.md standard
- backend-builder.md improved (40 291 lines)
- Added 6 usage examples with clear boundaries
- Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
- Added technology stack table and validation rules table
- Included safety protocols for secrets and destructive operations
- Added handoff protocol for lab-operator deployment
- Defined clear boundaries (CREATES code, does NOT deploy)
- lab-operator.md improved (37 193 lines)
- Added 6 usage examples with role clarity
- Expanded domain expertise with specific commands
- Added command style guide (5-step pattern)
- Included safety protocols and decision-making framework
- Added error handling and escalation guidelines
- Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
- CLAUDE.md structural fixes
- Moved YAML frontmatter to line 1 (was at line 89)
- Fixed trailing pipe character on line 87
- Completed incomplete sentence about backup strategy
- Completed incomplete sentence about storage growth
- Removed redundant "Key Services" reference
- Expanded status file template with actual structure and recovery instructions
- Final validation and testing
- librarian: Git status check successful, clear output format
- scribe: File reading functional (note: reported encoding issue, likely false positive)
- backend-builder: YAML validation successful, proper syntax checking
- lab-operator: Directory listing successful, proper command execution
- All agents demonstrate improved structure and clarity
Context
Why It Matters: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.
Next Steps: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.
Previous Phase: Infrastructure Documentation Complete
Goal
Comprehensive documentation of monitoring stack and updated infrastructure inventory.
Phase
Documentation & Maintenance
Completed Tasks
- Created
/home/jramos/homelab/monitoring/README.mdwith comprehensive monitoring documentation - Updated
CLAUDE_STATUS.mdwith current infrastructure state - Documented 8 VMs, 2 Templates, and 4 LXC containers
- Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
- Added monitoring stack architecture and deployment procedures
- Documented new services: monitoring-docker, twingate-connector, n8n
- Referenced latest export: disaster-recovery/homelab-export-20251207-120040
Remaining Documentation Tasks
- Update INDEX.md with monitoring section and current VM/CT counts
- Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
- Update CLAUDE.md with architecture tables for monitoring and zero-trust
- Update services/README.md with monitoring stack and twingate sections
- Verify all documentation cross-references are accurate
- Test monitoring stack deployment procedures
Access Information
Management Interfaces
- Proxmox UI: https://192.168.2.200:8006
- Grafana: http://192.168.2.114:3000
- Prometheus: http://192.168.2.114:9090
- Nginx Proxy Manager: http://192.168.2.101:81
- n8n: http://192.168.2.113:5678
- TinyAuth: https://tinyauth.apophisnetworking.net (internal: http://192.168.2.10:8000)
- OpenClaw: https://openclaw.apophisnetworking.net (internal: http://192.168.2.120:18789)
Key Network Segments
- Management Network: 192.168.2.0/24
- Proxmox Host: 192.168.2.200
- Reverse Proxy: 192.168.2.101 (CT 102)
- TinyAuth: 192.168.2.10 (CT 115)
- n8n: 192.168.2.113 (CT 113)
- Monitoring: 192.168.2.114 (VM 101)
- OpenClaw: 192.168.2.120 (VM 120)
Maintenance Schedule
Automated Tasks
- Backups: Proxmox Backup Server - Daily incremental, Weekly full
- Monitoring Scrapes: Prometheus - Every 30 seconds
- Certificate Renewal: Nginx Proxy Manager - Automatic via Let's Encrypt
Recommended Manual Tasks
- Weekly: Review Grafana dashboards for anomalies
- Monthly: Update monitoring stack Docker images
- Quarterly: Review backup retention policies
- Semi-Annual: Kernel updates on Proxmox host and VMs
Known Issues & Resolutions
Resolved
- n8n PostgreSQL locale errors (fixed with
fix_n8n_db_c_locale.sh) - n8n database permissions (fixed with
fix_n8n_db_permissions.sh)
Active Security Vulnerabilities (2025-12-20 Audit)
CRITICAL Severity:
-
Docker Socket Exposure (CVSS 9.8)
- Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker
- Impact: Container escape to root access
- Remediation: Deploy docker-socket-proxy (Phase 2)
-
Proxmox Credentials in Plaintext (CVSS 9.1)
- Affected: PVE Exporter
.envandpve.yml - Impact: Full infrastructure compromise
- Remediation: Rotate credentials, use API tokens (Phase 2)
- Affected: PVE Exporter
-
Database Passwords in Git (CVSS 8.5)
- Affected: Paperless-ngx, ByteStash, Speedtest Tracker
- Impact: Credential exposure to all repository users
- Remediation: Migrate to
.envfiles, scrub git history (Phase 1)
HIGH Severity: 4. Missing SSL/TLS (CVSS 7.5)
- Affected: Internal service communication
- Impact: Traffic interception, credential sniffing
- Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3)
-
Weak/Default Passwords (CVSS 7.2)
- Affected: Multiple services
- Impact: Brute-force attacks, unauthorized access
- Remediation: Generate strong passwords, implement rotation (Phase 2)
-
Containers Running as Root (CVSS 7.0)
- Affected: Most Docker containers
- Impact: Privilege escalation if container compromised
- Remediation: Enable user namespacing, set non-root users (Phase 3)
Remediation Timeline: See "Security Audit Remediation - Q4 2025" initiative above
Active Monitoring
- PVE Exporter SSL verification (set to false for self-signed certificates) - SECURITY RISK
- Prometheus retention policies (currently 15 days, may need adjustment)
- Security script container names need verification (3/8 scripts)
Deferred
- NetBox container offline (on-demand service)
- Development VMs stopped (resource conservation)
- Network segmentation implementation (Phase 4)
- Backup encryption (Phase 4)
Version History
- v2.1.0 (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
- v2.0.0 (2025-12-02): Repository reorganization, services migration from GitLab
- v1.0.0 (2025-11-29): Initial infrastructure documentation
Maintained by: jramos Repository: Homelab Infrastructure Configuration Platform: Proxmox VE 8.4.0 Infrastructure Scale: 10 VMs, 2 Templates, 5 Containers Current Status: Operational - OpenClaw Deployment In Progress