Files
homelab/CLAUDE_STATUS.md
Jordan Ramos e08951de21 feat(openclaw): deploy OpenClaw AI chatbot gateway on VM 120
- Add Docker Compose configs with security hardening (cap_drop ALL, non-root, read-only FS)
- Add Prometheus node_exporter scrape target for 192.168.2.120:9100
- Update services/README.md, INDEX.md, and CLAUDE_STATUS.md with VM 120
- Image pinned to v2026.2.1 (patches CVE-2026-25253)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
2026-02-03 18:14:58 -07:00

43 KiB

Homelab Infrastructure Status

Last Updated: 2026-02-03 Export Reference: disaster-recovery/homelab-export-20251211-144345 Current Session: OpenClaw Deployment - VM 120

Quick Resume (Current Session Context)

Where We Are: OpenClaw deployed and healthy on VM 120. Container running with full security hardening. Backups configured. Manual steps remain for NPM proxy host, Twingate resource, and Prometheus config on VM 101.

Completed:

  • Config files created (services/openclaw/)
  • VM 120 created and hardened (UFW, fail2ban, node-exporter, openclaw user)
  • OpenClaw container deployed and healthy (v2026.2.1)
  • Security verified (cap_drop ALL, non-root, read-only FS, no docker.sock)
  • Prometheus scrape target added to repo copy
  • PBS backup job created (daily 02:00, snapshot, zstd)
  • Application backup script + weekly cron configured
  • Documentation updated (README, services/README, CLAUDE_STATUS, INDEX)
  • node_exporter installed and serving metrics on 192.168.2.120:9100

Manual Steps Remaining:

  • NPM: Create proxy host for openclaw.apophisnetworking.net -> 192.168.2.120:18789 (WebSocket support, SSL, TinyAuth)
  • Twingate: Add resource for 192.168.2.120 ports 18789/18790/1455
  • VM 101: Deploy updated prometheus.yml via Proxmox web console (SSH not configured)
  • Configure at least one LLM provider API key in /opt/openclaw/.env

Current Infrastructure Snapshot

Proxmox Environment

  • Node: serviceslab
  • Version: Proxmox VE 8.4.0
  • Management IP: 192.168.2.100
  • Architecture: Single-node cluster
  • Total Resources: 10 VMs, 2 Templates, 5 LXC Containers

Virtual Machines (QEMU/KVM) - 10 VMs

VM ID Name IP Address Status Purpose
100 docker-hub 192.168.2.102 Running Container registry/Docker hub mirror
101 monitoring-docker 192.168.2.114 Running Monitoring stack (Grafana/Prometheus/PVE Exporter)
105 dev - Stopped General-purpose development workstation
106 Ansible-Control 192.168.2.XXX Running IaC orchestration, configuration management
108 CML - Stopped Cisco Modeling Labs - network simulation
109 web-server-01 192.168.2.XXX Running Web application server (clustered)
110 web-server-02 192.168.2.XXX Running Load-balanced pair with web-server-01
111 db-server-01 192.168.2.XXX Running Backend database server
114 haos 192.168.2.XXX Running Home Assistant OS - smart home automation platform
120 openclaw 192.168.2.120 Running OpenClaw AI chatbot gateway

Recent Changes:

  • Added VM 120 (openclaw) for multi-platform AI chatbot gateway (2026-02-03)
  • Added VM 101 (monitoring-docker) for dedicated monitoring infrastructure
  • Removed VM 101 (gitlab) - service decommissioned

VM Templates - 2 Templates

Template ID Name Purpose
104 ubuntu-dev Ubuntu development environment template for cloning
107 ubuntu-docker Ubuntu Docker host template for rapid deployment

Note: Templates are immutable base images used for cloning new VMs, not running workloads. They provide standardized configurations for consistent infrastructure provisioning.


Containers (LXC) - 5 Containers

CT ID Name IP Address Status Purpose
102 nginx 192.168.2.101 Running Reverse proxy/load balancer & NPM
103 netbox 192.168.2.XXX Running Network documentation/IPAM
112 twingate-connector 192.168.2.XXX Running Zero-trust network access connector
113 n8n 192.168.2.113 Running Workflow automation platform
115 tinyauth 192.168.2.10 Running SSO authentication layer for NetBox

Recent Changes:

  • Added CT 115 (tinyauth) for SSO authentication integration with NetBox
  • Added CT 112 (twingate-connector) for zero-trust network security
  • Added CT 113 (n8n) for workflow automation
  • Removed CT 112 (Anytype) - replaced by n8n

Storage Architecture

Storage Pool Type Total Used % Used Purpose
local Directory - - 19.11% System files, ISOs, templates
local-lvm LVM-Thin - - 0.01% VM disk images (thin provisioned)
Vault NFS/Directory - - 12.13% Secure storage for sensitive data
PBS-Backups PBS - - 28.27% Automated backup repository
iso-share NFS/CIFS - - 1.45% Installation media library
localnetwork Network Share - - N/A Shared resources across infrastructure

Capacity Notes:

  • PBS-Backups utilization increased to 28.27% (healthy retention)
  • Vault utilization increased to 12.13% (data growth monitored)
  • local storage at 19.11% (system overhead within normal range)

Key Services & Stacks

Monitoring & Observability (NEW)

VM 101 - monitoring-docker (192.168.2.114)

  • Grafana: Port 3000 - Visualization and dashboards
  • Prometheus: Port 9090 - Metrics collection and time-series database
  • PVE Exporter: Port 9221 - Proxmox VE metrics exporter
  • Documentation: /home/jramos/homelab/monitoring/README.md
  • Status: Fully operational

Network Security (NEW)

CT 112 - twingate-connector

  • Purpose: Zero-trust network access
  • Type: Lightweight connector
  • Status: Running
  • Integration: Connects homelab to Twingate network

Automation & Integration

CT 113 - n8n (192.168.2.113)

  • Purpose: Workflow automation platform
  • Technology: n8n.io
  • Database: PostgreSQL 15+
  • Features: API integration, scheduled workflows, webhook triggers
  • Documentation: /home/jramos/homelab/services/README.md#n8n-workflow-automation
  • Status: Operational (resolved database locale issues)

Authentication & SSO

CT 115 - tinyauth (192.168.2.10)

  • Purpose: Lightweight SSO authentication layer
  • Technology: TinyAuth v4 (Docker container)
  • Port: 8000
  • Domain: tinyauth.apophisnetworking.net
  • Integration: Authentication gateway for NetBox via Nginx Proxy Manager
  • Security: Bcrypt-hashed credentials, HTTPS enforcement
  • Documentation: /home/jramos/homelab/services/tinyauth/README.md
  • Status: Operational

AI Chatbot Gateway

VM 120 - openclaw (192.168.2.120)

  • Purpose: Multi-platform AI chatbot gateway
  • Technology: OpenClaw (Docker container)
  • Ports: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
  • Domain: openclaw.apophisnetworking.net
  • LLM Providers: Anthropic, OpenAI, Ollama
  • Messaging: Discord, Telegram, Slack, WhatsApp
  • Security: CVE-2026-25253 patched (v2026.2.1), cap_drop ALL, non-root, read-only FS
  • Documentation: /home/jramos/homelab/services/openclaw/README.md
  • Status: Operational - Container healthy

Infrastructure Documentation

CT 103 - netbox

  • Purpose: Network documentation and IPAM
  • Status: Stopped (on-demand use)
  • Function: Infrastructure source of truth

Reverse Proxy & Load Balancing

CT 102 - nginx (192.168.2.101)

  • Purpose: Nginx Proxy Manager
  • Ports: 80, 81, 443
  • Function: SSL termination, reverse proxy, certificate management
  • Upstream Services: All web-facing applications

Three-Tier Application Stack

Web Tier:

  • VM 109 (web-server-01) - Primary web server
  • VM 110 (web-server-02) - Load-balanced pair

Database Tier:

  • VM 111 (db-server-01) - Backend database

Proxy Tier:

  • CT 102 (nginx) - Load balancer and SSL termination

Development & Automation

VM 106 - Ansible-Control

  • Purpose: Infrastructure as Code orchestration
  • Tools: Ansible, Terraform/OpenTofu (potential)
  • Status: Running

Container Registry

VM 100 - docker-hub

  • Purpose: Local Docker registry and hub mirror
  • Function: Caching container images for faster deployments
  • Status: Running

Network Simulation

VM 108 - CML

  • Purpose: Cisco Modeling Labs
  • Function: Network topology testing and simulation
  • Status: Stopped (resource-intensive, on-demand use)

Architecture Patterns

Monitoring & Observability (NEW)

The infrastructure now implements a comprehensive monitoring stack following industry best practices:

  • Metrics Collection: Prometheus scraping Proxmox metrics via PVE Exporter
  • Visualization: Grafana providing real-time dashboards and alerting
  • Isolation: Dedicated VM for monitoring services (fault isolation)
  • Integration: Ready for AlertManager, additional exporters, and integrations

Design Decision: VM-based deployment provides kernel-level isolation and prevents resource contention with critical infrastructure services.

Zero-Trust Security (NEW)

Implementation of zero-trust network access principles:

  • Twingate Connector: Lightweight connector providing secure access without VPNs
  • Container Deployment: LXC container for minimal resource overhead
  • Network Segmentation: Secure access to homelab from external networks

Design Decision: LXC container chosen for quick provisioning and low resource consumption.

Automation-First Approach

Workflow automation and infrastructure orchestration:

  • n8n Platform: Visual workflow builder for API integrations
  • Scheduled Tasks: Automated backup checks, monitoring alerts, reports
  • Integration Hub: Connects monitoring, documentation, and operational tools

Design Decision: PostgreSQL backend ensures data persistence and supports complex workflows.

Tiered Application Architecture

Classic three-tier design for production-like environments:

  • Presentation Tier: Paired web servers (109, 110) behind load balancer
  • Business Logic: Application processing on web tier
  • Data Tier: Dedicated database server (111) with backup strategy

Design Decision: Separation of concerns, scalability testing, high availability patterns.

Selective Containerization Strategy

Hybrid approach balancing performance and resource efficiency:

  • LXC Containers: Stateless services (nginx, netbox, twingate, n8n)
  • Full VMs: Complex applications, kernel dependencies, heavy workloads
  • Rationale: LXC for ~10x lower overhead, VMs for isolation and compatibility

Recent Infrastructure Changes

2026-02-03: OpenClaw AI Chatbot Gateway Deployment (In Progress)

Service: VM 120 - OpenClaw multi-platform AI chatbot gateway

Purpose: Bridge messaging platforms (Discord, Telegram, Slack, WhatsApp) with LLM providers (Anthropic, OpenAI, Ollama) through a unified gateway.

Specifications:

  • VM: 120 (cloned from template 107, ubuntu-docker)
  • IP: 192.168.2.120
  • Resources: 4 vCPUs, 16GB RAM, 50GB disk on Vault (ZFS)
  • Ports: 18789 (Gateway WS+UI), 18790 (Bridge), 1455 (OAuth)
  • Domain: openclaw.apophisnetworking.net
  • Image: ghcr.io/openclaw/openclaw:2026.2.1

Security Hardening:

  • Version >= 2026.2.1 (patches CVE-2026-25253, CVSS 8.8 1-click RCE)
  • All ports bound to 127.0.0.1 (reverse proxy required)
  • Docker: cap_drop ALL, no-new-privileges, read-only filesystem, non-root user (1001:1001)
  • UFW: deny-all + whitelist 192.168.2.0/24 + 192.168.1.91 (desktop PC)
  • fail2ban on SSH (3 retries), unattended-upgrades
  • Prometheus node_exporter at port 9100

Completed Steps:

  • Docker Compose configuration files created
  • Security hardening overlay (docker-compose.override.yml)
  • Environment variable template (.env.example)
  • Prometheus scrape target added
  • Documentation created (README, services/README, CLAUDE_STATUS, INDEX)
  • VM 120 Creation & SSH Setup
  • OS Hardening (UFW, user creation)

Pending Steps:

  • NPM reverse proxy configuration (manual - web UI)
  • Twingate resource creation (manual - admin console)
  • Prometheus config on VM 101 (manual - no SSH access)
  • Configure LLM provider API key in .env

Status: Container healthy - Manual network integration remaining


2025-12-20: Comprehensive Security Audit Completed

Activity: Complete infrastructure security assessment and remediation planning

Audit Scope:

  • All Docker Compose services (Portainer, NPM, Paperless-ngx, ByteStash, Speedtest Tracker, FileBrowser)
  • Proxmox VE infrastructure and API access
  • Network security and segmentation
  • Credential management and storage
  • SSL/TLS configuration
  • Container security and runtime configuration

Findings Summary:

  • CRITICAL (6): Docker socket exposure, hardcoded credentials, database passwords in git
  • HIGH (3): Missing SSL/TLS, weak passwords, containers running as root
  • MEDIUM (2): SSL verification disabled, missing authentication
  • LOW (20): Documentation gaps, monitoring improvements, backup encryption

Deliverables:

  1. Security Policy (SECURITY.md): 864 lines - Comprehensive security best practices
  2. Audit Report (troubleshooting/SECURITY_AUDIT_2025-12-20.md): 2,350 lines - Detailed findings and remediation plan
  3. Security Checklist (templates/SECURITY_CHECKLIST.md): 750 lines - Pre-deployment validation template
  4. Validation Report (scripts/security/VALIDATION_REPORT.md): 2,092 lines - Script safety assessment
  5. Container Fixes (scripts/security/CONTAINER_NAME_FIXES.md): 621 lines - Container name verification
  6. Security Scripts (8 total):
    • verify-service-status.sh - Service health checker
    • backup-before-remediation.sh - Comprehensive backup utility
    • rotate-pve-credentials.sh - Proxmox credential rotation
    • rotate-paperless-password.sh - Database password rotation
    • rotate-bytestash-jwt.sh - JWT secret rotation
    • rotate-logward-credentials.sh - Multi-service credential rotation
    • docker-socket-proxy/docker-compose.yml - Security proxy deployment
    • portainer/docker-compose.socket-proxy.yml - Portainer migration config

Script Validation:

  • Ready for execution: 5/8 scripts (verify-service-status.sh, rotate-pve-credentials.sh, rotate-bytestash-jwt.sh, backup-before-remediation.sh, docker-socket-proxy)
  • Needs container name fixes: 3/8 scripts (see CONTAINER_NAME_FIXES.md)

4-Phase Remediation Roadmap:

  • Phase 1 (Week 1): Immediate actions - Backups, secrets migration
  • Phase 2 (Weeks 2-3): Low-risk changes - Socket proxy, credential rotation
  • Phase 3 (Month 2): High-risk changes - Service migrations, SSL/TLS
  • Phase 4 (Quarter 1): Infrastructure - Network segmentation, scanning pipelines

Estimated Timeline:

  • Total downtime: 6-13 minutes (sequential script execution)
  • Full remediation: 8-16 weeks

Risk Assessment:

  • Current risk: HIGH - Multiple CRITICAL vulnerabilities active
  • Post-Phase 1 risk: MEDIUM - Credential exposure mitigated
  • Post-Phase 3 risk: LOW - All CRITICAL/HIGH findings remediated
  • Post-Phase 4 risk: VERY LOW - Defense-in-depth implemented

Status: Documentation complete, awaiting remediation execution approval


2025-12-18: TinyAuth SSO Deployment

Service Deployed: CT 115 - TinyAuth authentication layer

Purpose: Centralized SSO authentication for NetBox and future homelab services

Specifications:

  • Container: CT 115 (LXC with Docker)
  • IP Address: 192.168.2.10
  • Domain: tinyauth.apophisnetworking.net
  • Port: 8000 (external), 3000 (internal)
  • Docker Image: ghcr.io/steveiliop56/tinyauth:v4
  • Resource Usage: ~50-100 MB memory, <1% CPU

Integration Architecture:

  • Internet → Nginx Proxy Manager (CT 102) → TinyAuth (CT 115) → NetBox (CT 103)
  • NPM uses auth_request directive to validate credentials via TinyAuth
  • Bcrypt-hashed password storage for security
  • HTTPS enforcement via NPM SSL termination

Issues Resolved During Deployment:

  1. 500 Internal Server Error: Fixed Nginx advanced config syntax
  2. IP addresses not allowed: Changed APP_URL from IP to domain
  3. Port mapping: Corrected Docker port mapping from 8000:8000 to 8000:3000
  4. Invalid password: Implemented bcrypt hash requirement for TinyAuth v4

Integration Impact:

  • NetBox now protected by centralized authentication
  • Foundation for extending SSO to other services (Grafana, Proxmox UI future candidates)
  • Authentication logs available for security auditing

Documentation: Complete guide at /home/jramos/homelab/services/tinyauth/README.md

Status: Operational - Successfully authenticating NetBox access


2025-12-11: Loki-Stack Monitoring Fully Operational

Issue Resolved: Centralized logging pipeline now receiving syslog from UniFi router

Root Cause: rsyslog filter in /etc/rsyslog.d/unifi-router.conf was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1)

Fix Applied: Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1)

Status: Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana

Services Affected:

  • VM 101 (monitoring-docker): rsyslog configuration updated
  • Loki-stack: All components operational
  • Grafana: Dashboards receiving real-time syslog data

Technical Details: See troubleshooting/loki-stack-bugfix.md for complete 5-phase troubleshooting history


2025-12-11: Infrastructure Expansion & System Updates

Proxmox VE Platform Upgrade

  • Upgraded: Proxmox VE 8.3.3 → 8.4.0
  • Kernel: 6.8.12-8-pve
  • pve-manager: 8.4.14
  • Impact: Enhanced performance, security updates, bug fixes
  • Status: Complete - All VMs and containers operating normally

New VM 114: Home Assistant OS Deployment

  • Service: haos (Home Assistant Operating System)
  • Purpose: Smart home automation and integration platform
  • Specifications:
    • Memory: 4 GB (87% utilized)
    • CPU: 2 vCPUs
    • Boot Disk: 50 GB
    • Status: Running (~3 days uptime)
  • Rationale: Centralized home automation hub for IoT device management
  • Integration: Will integrate with monitoring stack for infrastructure metrics

CT 103: NetBox IPAM Activated

  • Service: netbox (Network Documentation & IPAM)
  • Status Change: Stopped → Running
  • Uptime: ~3.1 days
  • Resource Usage: 1.28 GB / 2 GB memory (64%)
  • Purpose: Active network documentation and IP address management
  • Rationale: Required for ongoing infrastructure expansion planning
  • PBS-Backups: 27.43% → 28.27% (+0.84%) - Normal backup retention growth
  • Vault (ZFS): 10.88% → 12.13% (+1.25%) - Data accumulation monitored
  • local: 15.13% → 19.11% (+3.98%) - New VM deployment and system updates
  • iso-share: 1.4% → 1.45% (+0.05%) - Minimal change
  • local-lvm: 0.0% → 0.01% (+0.01%) - Thin provisioned storage baseline

2025-12-25: RAG Vector Search - Phase 3 Complete

Activity: Implemented and debugged production-ready vector search system for AI-powered documentation retrieval

Deliverables:

  1. Production Module (n8n/vector_search.py): Complete API for semantic search

    • search_similar_documents() - Query with natural language
    • insert_document() - Add documents with embeddings
    • get_stats() - Database statistics
    • delete_by_repo() - Bulk cleanup
    • CLI interface for testing and manual operations
  2. Documentation Suite:

    • SESSION_HANDOFF_PHASE4_READY.md (17KB) - Comprehensive learning guide for next session
    • PHASE3_COMPLETE.md (12KB) - Complete debugging summary and deployment guide
    • VECTOR_SEARCH_DEBUG.md (4.7KB) - Technical root cause analysis
    • VECTOR_SEARCH_COMPARISON.md (2.5KB) - Before/after code comparison
  3. Diagnostic Scripts (8 total):

    • Embedding storage repair, parameter binding tests, SQL validation
    • All scripts validated and preserved for reference

Technical Achievement:

  • PostgreSQL 16.11 + pgvector 0.8.1 fully operational on CT 113
  • Vector similarity search returning accurate scores (0.5765 for related concepts)
  • Resolved 2 critical bugs:
    1. psycopg2 parameter handling for pgvector types (must cast in SQL, not Python)
    2. ORDER BY with vector operations (subquery pattern required)

Validation Results:

  • Query: "How do I create snapshots of virtual machines?"
  • Result: 0.5765 similarity to backup documentation
  • Interpretation: Correctly identifies semantic relationship between "snapshots" and "backups"

Infrastructure:

  • Database: n8n_db on CT 113
  • Table: rag_embeddings (id, source_repo, file_path, chunk_text, embedding vector(768), metadata jsonb)
  • Embedding API: Ollama at 192.168.1.81:11434 (nomic-embed-text, 768 dimensions)
  • Storage overhead: ~3KB per vector, ~5KB per document total

Status: Phase 3 Complete | Phase 4 Ready to Start Next Steps: Build n8n ingestion workflow to load homelab documentation from Gitea


2025-12-07: Infrastructure Documentation & Monitoring Stack

Additions

  1. VM 101 (monitoring-docker): New dedicated monitoring infrastructure

    • Grafana for visualization
    • Prometheus for metrics collection
    • PVE Exporter for Proxmox integration
    • IP: 192.168.2.114
  2. CT 112 (twingate-connector): Zero-trust network security

    • Lightweight connector
    • Secure remote access without VPN
  3. CT 113 (n8n): Workflow automation platform

    • PostgreSQL 16.11 backend (upgraded from 15+)
    • pgvector 0.8.1 extension for vector search
    • IP: 192.168.2.113
    • Resolved database locale issues

Modifications

  • Storage utilization updated across all pools
  • PBS-Backups now at 27.43% (increased retention)
  • Vault optimized to 10.88% (reduced usage)

Removals

  • VM 101 (gitlab): Decommissioned (previously at this ID)
  • CT 112 (Anytype): Replaced by n8n for better integration

Documentation Updates

  • Created comprehensive monitoring stack documentation
  • Updated all infrastructure tables with current VMs/CTs
  • Added architecture patterns for observability and zero-trust
  • Updated storage statistics
  • Referenced latest export: disaster-recovery/homelab-export-20251207-120040

Repository Structure

homelab/
    n8n/                             # RAG Vector Search Implementation (NEW)
        vector_search.py            # Production module for vector operations
        SESSION_HANDOFF_PHASE4_READY.md  # Learning guide for next session
        PHASE3_COMPLETE.md          # Phase 3 debugging and achievements summary
        fix_embedding_storage.py    # Diagnostic script (embedding repair)
        test_direct_sql.py          # Diagnostic script (query testing)
        test_vector_search_working.py  # Validated working implementation
        test_parameter_binding.py   # Diagnostic script (psycopg2 debugging)
        test_pgvector_direct.sql    # Raw SQL tests for pgvector
        VECTOR_SEARCH_DEBUG.md      # Technical debugging documentation
        VECTOR_SEARCH_COMPARISON.md # Before/after code comparison
        README_VECTOR_SEARCH.md     # Comprehensive setup guide
    monitoring/                      # Monitoring stack configurations
        README.md                   # Comprehensive monitoring documentation
        grafana/
            docker-compose.yml
        prometheus/
            docker-compose.yml
            prometheus.yml
        pve-exporter/
            docker-compose.yml
            pve.yml
            .env
    services/                        # Docker Compose service configurations
        n8n/                        # n8n workflow automation
        netbox/                     # Network documentation & IPAM
        openclaw/                   # OpenClaw AI chatbot gateway (VM 120)
        tinyauth/                   # SSO authentication layer
        README.md                   # Services overview (updated)
    disaster-recovery/
        homelab-export-20251207-120040/  # Latest infrastructure export
    scripts/
        crawlers-exporters/         # Infrastructure collection scripts
        fixers/                     # Problem-solving scripts
        qol/                        # Quality of life improvements
        security/                   # Security audit and remediation scripts (NEW)
            verify-service-status.sh
            backup-before-remediation.sh
            rotate-*.sh             # Credential rotation scripts
            QUICK_REFERENCE.md      # Security operations guide
    troubleshooting/
        SECURITY_AUDIT_2025-12-20.md  # Comprehensive security assessment
        loki-stack-bugfix.md        # Loki logging troubleshooting
    CLAUDE.md                        # AI assistant guidance (updated)
    SECURITY.md                      # Security policy and best practices (NEW)
    INDEX.md                         # Navigation index (updated)
    README.md                        # Repository overview (updated)
    CLAUDE_STATUS.md                # This file - current infrastructure status

Security Status

Latest Audit: 2025-12-20 Total Findings: 31 (6 CRITICAL, 3 HIGH, 2 MEDIUM, 20 LOW) Remediation Status: Planning Phase - Documentation Complete

Critical Vulnerabilities:

  • Docker socket exposure (3 containers)
  • Proxmox credentials in plaintext
  • Database passwords in git repository
  • Missing SSL/TLS for internal services
  • Weak/default passwords across services
  • Containers running as root

Documentation:

  • Security Policy: /home/jramos/homelab/SECURITY.md
  • Audit Report: /home/jramos/homelab/troubleshooting/SECURITY_AUDIT_2025-12-20.md
  • Security Checklist: /home/jramos/homelab/templates/SECURITY_CHECKLIST.md
  • Script Validation: /home/jramos/homelab/scripts/security/VALIDATION_REPORT.md

Current Initiative: n8n RAG Workflow for Homelab Documentation - Q4 2025

Goal

Build an interactive n8n workflow that implements Retrieval-Augmented Generation (RAG) to query homelab documentation stored in Gitea using local AI (Ollama). This is a learning-focused project to understand RAG architecture, embeddings, vector storage, and LLM integration.

Phase

Phase 3 Complete - Vector Storage Operational | Moving to Phase 4 - n8n Workflow Development

Infrastructure Components

  • AI Backend: Ollama running on Windows 11 PC (192.168.1.81)
    • Hardware: AMD 7900 GRE GPU, i7-12700KF, 32GB RAM @ 4000MHz, 2TB NVMe
    • Installation: Native Windows application (not Docker)
    • Open-WebUI: Running in Docker Desktop on same machine (port 3000)
  • Orchestrator: n8n workflow automation (CT 113, 192.168.2.113)
  • Data Source: Gitea repositories (192.168.2.102:3060)
    • Repositories: homelab, truenas
  • Vector Storage: PostgreSQL 16.11 + pgvector 0.8.1 (operational on CT 113)

Progress Checklist

Phase 1: Network & Connectivity Setup

  • Verify Gitea API accessibility (working: http://192.168.2.102:3060/api/v1)
  • Verify n8n instance running (CT 113, 192.168.2.113)
  • Configure Ollama network binding (set OLLAMA_HOST=0.0.0.0 via environment variables)
  • Verify Ollama API accessible from homelab (curl http://192.168.1.81:11434/api/tags)
  • Identify available Ollama models (LLMs: deepseek-r1:8.2B, gpt-oss:20.9B, llama3.2:3.2B, phi3:3.8B)
  • Pull embedding model (nomic-embed-text - 768 dimensions, 274MB)

Phase 2: Understanding Embeddings (Learning Phase)

  • Pull sample document from Gitea API
  • Send text to Ollama for embedding generation
  • Examine vector output (768-dimensional vectors for each text)
  • Understand semantic similarity concept (cosine similarity demo: 0.5764 for related topics)

Phase 3: Vector Storage Implementation COMPLETE

  • Evaluate PostgreSQL + pgvector (uses existing n8n database)
  • Evaluate Qdrant (lightweight Docker deployment)
  • Choose storage backend based on learning goals (PostgreSQL + pgvector selected)
  • Install pgvector extension on CT 113 (PostgreSQL 16.11, pgvector 0.8.1)
  • Create rag_embeddings table with vector(768) column
  • Debug and fix vector insertion (corrected string→vector conversion)
  • Debug and fix ORDER BY issue (subquery approach working)
  • Verify cosine similarity search (working: 0.5765 similarity for related concepts)
  • Create production-ready vector_search.py module with insert/search/stats functions

Phase 4: Build Ingestion Workflow (n8n) - READY TO START

  • Deploy vector_search.py production module to CT 113
  • Test manual document insertion via CLI
  • Implement text chunking strategy (500 char chunks, 100 char overlap)
  • Create minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
  • Test workflow with single README.md file from homelab repo
  • Scale to process all .md files in homelab repository
  • Add error handling and deduplication logic
  • Schedule automated daily ingestion runs

Phase 5: Build Query Workflow (n8n) - NOT STARTED

  • Create workflow: Webhook → User question
  • Generate embedding for user query
  • Implement vector similarity search (threshold >0.5)
  • Retrieve top 3-5 relevant chunks
  • Construct prompt with retrieved context
  • Call Ollama LLM for answer generation (llama3.2 or deepseek-r1)
  • Return formatted response with source references
  • Add webhook endpoint for external integrations

Context

RAG Architecture Overview:

  1. Ingestion Pipeline: Gitea API → Text Chunking → Ollama Embeddings → Vector Database
  2. Query Pipeline: User Question → Embedding → Vector Search → Context Retrieval → LLM Generation → Answer

Phase 3 Achievements (2025-12-25):

  • PostgreSQL + pgvector fully operational on CT 113
  • Vector search working with 0.5765 similarity for related concepts
  • Production-ready Python module (vector_search.py) with insert/search/stats functions
  • Debugged and resolved 2 critical issues:
    1. Embedding storage: Fixed psycopg2 parameter handling (must cast to ::vector(768) in SQL, not Python)
    2. ORDER BY bug: Subquery approach works, CTE approach fails (use ORDER BY similarity DESC instead of vector operation)

Key Learnings:

  • Embeddings convert text to 768-dimensional vectors representing semantic meaning
  • Vector databases enable semantic search (meaning-based, not keyword-based)
  • pgvector cosine distance operator (<=>) measures similarity: 0=identical, 2=opposite
  • Similarity scores: >0.7=highly relevant, 0.5-0.7=related, 0.3-0.5=somewhat related, <0.3=unrelated
  • psycopg2 doesn't natively support pgvector - must format vectors as strings and cast in SQL
  • Reusing vector parameters in ORDER BY causes silent failures - use subqueries instead

Technical Stack Validated:

  • Ollama API (192.168.1.81:11434) Accessible across subnets
  • nomic-embed-text model 768 dimensions, fast generation
  • PostgreSQL 16.11 + pgvector 0.8.1 Operators working correctly
  • Python psycopg2 With workarounds for vector handling

Success Metrics - Phase 3:

  • Successfully query "how to backup VM" and retrieve relevant homelab documentation (0.5765 similarity)
  • Understand each component of the vector storage pipeline
  • Create reusable Python module for n8n integration

Next Steps - Phase 4:

  • Deploy vector_search.py to CT 113 and test CLI interface
  • Create text chunking function (500 char chunks, 100 char overlap)
  • Build minimal n8n workflow: Manual Trigger → Gitea API → Chunk → Ollama → PostgreSQL
  • Scale to process all .md files in homelab repository
  • Add error handling and deduplication logic

Session Handoff Document: /home/jramos/homelab/n8n/SESSION_HANDOFF_PHASE4_READY.md Learning Resources: Step-by-step lessons with examples, mental models, troubleshooting guide


Previous Initiative: Security Audit Remediation - Q4 2025

Goal

Remediate 31 security findings identified in comprehensive security audit (2025-12-20), addressing critical vulnerabilities in Docker socket exposure, credential management, and SSL/TLS configuration.

Phase

Planning - Documentation Complete, Remediation Pending

Progress Checklist

Phase 1: Immediate Actions (Week 1) - Est. 30 min downtime

  • Complete security audit (31 findings documented)
  • Create remediation scripts (8 scripts validated)
  • Document security baseline in SECURITY.md
  • Backup all service configurations (backup-before-remediation.sh)
  • Migrate secrets to .env files (ByteStash, Paperless-ngx, Speedtest Tracker)

Phase 2: Low-Risk Changes (Weeks 2-3) - Est. 2-4 hours downtime

  • Deploy docker-socket-proxy
  • Rotate Proxmox API credentials (rotate-pve-credentials.sh)
  • Rotate database passwords (rotate-paperless-password.sh)
  • Rotate JWT secrets (rotate-bytestash-jwt.sh)

Phase 3: High-Risk Changes (Month 2) - Est. 4-8 hours downtime

  • Migrate Portainer to socket proxy
  • Migrate NPM to socket proxy or remove socket access
  • Remove socket mounts from Speedtest Tracker
  • Implement SSL/TLS for internal services
  • Enable container user namespacing

Phase 4: Infrastructure Improvements (Quarter 1) - Est. 8-16 hours

  • Implement network segmentation (VLANs for service tiers)
  • Deploy fail2ban for rate limiting
  • Enable backup encryption (PBS configuration)
  • Container vulnerability scanning pipeline
  • Automated credential rotation system

Context

Security audit revealed critical infrastructure vulnerabilities requiring systematic remediation. Priority on CRITICAL findings (CVSS 8.5-9.8) to reduce attack surface and prevent credential compromise.

Risk Management:

  • Phase 1: Zero downtime (configuration changes only)
  • Phase 2: Minimal downtime (credential rotation, proxy deployment)
  • Phase 3: Moderate downtime (service reconfiguration)
  • Phase 4: Planned maintenance windows (infrastructure changes)

Success Metrics:

  • All CRITICAL findings remediated (6/6)
  • All HIGH findings remediated (3/3)
  • Secrets removed from git repository
  • Docker socket access eliminated or proxied
  • SSL/TLS enabled for all external services

Previous Initiative: Claude Code Tool Inheritance Bug Investigation (2025-12-18)

Goal

Investigate and document a critical bug in Claude Code CLI where sub-agents with explicit tools: declarations receive only a subset of their configured tools, with first and last array elements consistently dropped.

Phase

COMPLETED - Bug confirmed, comprehensive report generated for Anthropic

Progress Checklist

  • Reproduce bug with scribe agent (confirmed: missing Read and Write)
  • Reproduce bug with lab-operator agent (confirmed: missing Bash and Write)
  • Test backend-builder agent (working correctly - exception to pattern)
  • Test librarian agent (working correctly - no tools: declaration)
  • Identify pattern: First and last tools dropped for agents with explicit tools: arrays
  • Document impact: Scribe cannot create docs, lab-operator cannot execute commands
  • Generate comprehensive bug report for Anthropic with all evidence
  • Update CLAUDE_STATUS.md with investigation status
  • Submit bug report to Anthropic via GitHub issues

Key Findings

Bug Pattern: Sub-agents with tools: [A, B, C, D, E] receive only [B, C, D] at runtime Affected: scribe (no Read/Write), lab-operator (no Bash/Write) Unaffected: backend-builder (exception), librarian (no tools: line) Workaround: Remove tools: declarations to grant all tools by default

Artifacts:

  • Bug report: /home/jramos/homelab/troubleshooting/ANTHROPIC_BUG_REPORT_TOOL_INHERITANCE.md
  • Original report: /home/jramos/homelab/troubleshooting/BUG_REPORT.md
  • Test agent IDs: scribe=a32bd54, lab-operator=ad681e8, backend-builder=aba15f6, librarian=a4cfeb7

Context

Critical workflow disruption: Documentation and infrastructure operations workflows completely broken due to missing tools. This is a Claude Code CLI internal bug, not a user configuration issue.


Previous Initiative: Sub-Agent Architecture Optimization (2025-12-07)

Goal

Improve the quality and effectiveness of all sub-agent prompt definitions to match best practices identified through comprehensive Opus-powered prompt engineering analysis. Target: bring all sub-agents to the quality standard established by librarian.md (~120-340 lines with comprehensive examples, safety protocols, and decision frameworks).

Phase

COMPLETED - All sub-agent improvements and validations finished

Progress Checklist

  • Prompt engineering analysis completed (Opus model)
    • Analyzed CLAUDE.md and all 4 sub-agent files
    • Identified 5 critical issues, 12 high-impact improvements
    • Generated comprehensive improvement recommendations
  • scribe.md improved (29 340 lines)
    • Added 6 usage examples (4 positive, 2 negative redirects)
    • Implemented comprehensive responsibilities section
    • Added 3 complete ASCII diagram templates
    • Included safety protocols and decision frameworks
    • Quality now matches librarian.md standard
  • backend-builder.md improved (40 291 lines)
    • Added 6 usage examples with clear boundaries
    • Expanded core responsibilities with Ansible, Terraform, Docker Compose, Python, Shell
    • Added technology stack table and validation rules table
    • Included safety protocols for secrets and destructive operations
    • Added handoff protocol for lab-operator deployment
    • Defined clear boundaries (CREATES code, does NOT deploy)
  • lab-operator.md improved (37 193 lines)
    • Added 6 usage examples with role clarity
    • Expanded domain expertise with specific commands
    • Added command style guide (5-step pattern)
    • Included safety protocols and decision-making framework
    • Added error handling and escalation guidelines
    • Defined clear boundaries (DEPLOYS/OPERATES, does NOT create IaC)
  • CLAUDE.md structural fixes
    • Moved YAML frontmatter to line 1 (was at line 89)
    • Fixed trailing pipe character on line 87
    • Completed incomplete sentence about backup strategy
    • Completed incomplete sentence about storage growth
    • Removed redundant "Key Services" reference
    • Expanded status file template with actual structure and recovery instructions
  • Final validation and testing
    • librarian: Git status check successful, clear output format
    • scribe: File reading functional (note: reported encoding issue, likely false positive)
    • backend-builder: YAML validation successful, proper syntax checking
    • lab-operator: Directory listing successful, proper command execution
    • All agents demonstrate improved structure and clarity

Context

Why It Matters: Well-designed sub-agent prompts improve task routing accuracy, execution quality, error reduction, and maintainability. The librarian.md agent (143 lines) sets the quality standard; scribe was severely underdeveloped at 29 lines before improvement.

Next Steps: Improve backend-builder.md and lab-operator.md using scribe.md as quality template.


Previous Phase: Infrastructure Documentation Complete

Goal

Comprehensive documentation of monitoring stack and updated infrastructure inventory.

Phase

Documentation & Maintenance

Completed Tasks

  • Created /home/jramos/homelab/monitoring/README.md with comprehensive monitoring documentation
  • Updated CLAUDE_STATUS.md with current infrastructure state
  • Documented 8 VMs, 2 Templates, and 4 LXC containers
  • Updated storage statistics (PBS 27.43%, Vault 10.88%, local 15.13%)
  • Added monitoring stack architecture and deployment procedures
  • Documented new services: monitoring-docker, twingate-connector, n8n
  • Referenced latest export: disaster-recovery/homelab-export-20251207-120040

Remaining Documentation Tasks

  • Update INDEX.md with monitoring section and current VM/CT counts
  • Update README.md with infrastructure (8 VMs, 2 Templates, 4 LXC)
  • Update CLAUDE.md with architecture tables for monitoring and zero-trust
  • Update services/README.md with monitoring stack and twingate sections
  • Verify all documentation cross-references are accurate
  • Test monitoring stack deployment procedures

Access Information

Management Interfaces

Key Network Segments

  • Management Network: 192.168.2.0/24
  • Proxmox Host: 192.168.2.200
  • Reverse Proxy: 192.168.2.101 (CT 102)
  • TinyAuth: 192.168.2.10 (CT 115)
  • n8n: 192.168.2.113 (CT 113)
  • Monitoring: 192.168.2.114 (VM 101)
  • OpenClaw: 192.168.2.120 (VM 120)

Maintenance Schedule

Automated Tasks

  • Backups: Proxmox Backup Server - Daily incremental, Weekly full
  • Monitoring Scrapes: Prometheus - Every 30 seconds
  • Certificate Renewal: Nginx Proxy Manager - Automatic via Let's Encrypt
  • Weekly: Review Grafana dashboards for anomalies
  • Monthly: Update monitoring stack Docker images
  • Quarterly: Review backup retention policies
  • Semi-Annual: Kernel updates on Proxmox host and VMs

Known Issues & Resolutions

Resolved

  • n8n PostgreSQL locale errors (fixed with fix_n8n_db_c_locale.sh)
  • n8n database permissions (fixed with fix_n8n_db_permissions.sh)

Active Security Vulnerabilities (2025-12-20 Audit)

CRITICAL Severity:

  1. Docker Socket Exposure (CVSS 9.8)

    • Affected: Portainer, Nginx Proxy Manager, Speedtest Tracker
    • Impact: Container escape to root access
    • Remediation: Deploy docker-socket-proxy (Phase 2)
  2. Proxmox Credentials in Plaintext (CVSS 9.1)

    • Affected: PVE Exporter .env and pve.yml
    • Impact: Full infrastructure compromise
    • Remediation: Rotate credentials, use API tokens (Phase 2)
  3. Database Passwords in Git (CVSS 8.5)

    • Affected: Paperless-ngx, ByteStash, Speedtest Tracker
    • Impact: Credential exposure to all repository users
    • Remediation: Migrate to .env files, scrub git history (Phase 1)

HIGH Severity: 4. Missing SSL/TLS (CVSS 7.5)

  • Affected: Internal service communication
  • Impact: Traffic interception, credential sniffing
  • Remediation: Enable HTTPS via NPM or self-signed certs (Phase 3)
  1. Weak/Default Passwords (CVSS 7.2)

    • Affected: Multiple services
    • Impact: Brute-force attacks, unauthorized access
    • Remediation: Generate strong passwords, implement rotation (Phase 2)
  2. Containers Running as Root (CVSS 7.0)

    • Affected: Most Docker containers
    • Impact: Privilege escalation if container compromised
    • Remediation: Enable user namespacing, set non-root users (Phase 3)

Remediation Timeline: See "Security Audit Remediation - Q4 2025" initiative above

Active Monitoring

  • PVE Exporter SSL verification (set to false for self-signed certificates) - SECURITY RISK
  • Prometheus retention policies (currently 15 days, may need adjustment)
  • Security script container names need verification (3/8 scripts)

Deferred

  • NetBox container offline (on-demand service)
  • Development VMs stopped (resource conservation)
  • Network segmentation implementation (Phase 4)
  • Backup encryption (Phase 4)

Version History

  • v2.1.0 (2025-12-07): Added monitoring stack, twingate connector, updated infrastructure counts
  • v2.0.0 (2025-12-02): Repository reorganization, services migration from GitLab
  • v1.0.0 (2025-11-29): Initial infrastructure documentation

Maintained by: jramos Repository: Homelab Infrastructure Configuration Platform: Proxmox VE 8.4.0 Infrastructure Scale: 10 VMs, 2 Templates, 5 Containers Current Status: Operational - OpenClaw Deployment In Progress