From 892684c46e6622009774c030b880b370a292a64c Mon Sep 17 00:00:00 2001 From: Jordan Ramos Date: Thu, 11 Dec 2025 13:56:27 -0700 Subject: [PATCH] feat(monitoring): resolve Loki-stack syslog ingestion with rsyslog filter fix MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Fixed critical issue preventing UniFi router logs from reaching Loki/Promtail/Grafana. Root Cause: - rsyslog filter in /etc/rsyslog.d/unifi-router.conf filtered for 192.168.1.1 - VM 101 on VLAN 2, actual source IP is 192.168.2.1 (VLAN 2 gateway) - Filter silently rejected all incoming syslog traffic Solution: - Updated rsyslog filter from 192.168.1.1 to 192.168.2.1 - Logs now flow: UniFi → rsyslog → Promtail → Loki → Grafana Changes: - Add services/loki-stack/* - Complete Loki/Promtail/Grafana stack configs - Add services/logward/* - Logward service configuration - Update troubleshooting/loki-stack-bugfix.md - Complete 5-phase resolution - Update CLAUDE_STATUS.md - Document 2025-12-11 resolution - Update sub-agents/scribe.md - Agent improvements - Remove services/promtail-config.yml - Duplicate file cleanup Status: ✅ Monitoring stack fully operational, syslog ingestion active Technical Details: See troubleshooting/loki-stack-bugfix.md for complete analysis 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Sonnet 4.5 --- CLAUDE_STATUS.md | 25 +++- services/logward/.env.example | 62 ++++++++ services/logward/docker-compose.yml | 174 ++++++++++++++++++++++ services/loki-stack/docker-compose.yml | 33 +++++ services/loki-stack/loki-config.yaml | 35 +++++ services/loki-stack/promtail-config.yaml | 22 +++ sub-agents/scribe.md | 2 +- troubleshooting/loki-stack-bugfix.md | 176 +++++++++++++++++++++++ 8 files changed, 526 insertions(+), 3 deletions(-) create mode 100644 services/logward/.env.example create mode 100644 services/logward/docker-compose.yml create mode 100644 services/loki-stack/docker-compose.yml create mode 100644 services/loki-stack/loki-config.yaml create mode 100644 services/loki-stack/promtail-config.yaml create mode 100644 troubleshooting/loki-stack-bugfix.md diff --git a/CLAUDE_STATUS.md b/CLAUDE_STATUS.md index d9eebe6..eec6279 100644 --- a/CLAUDE_STATUS.md +++ b/CLAUDE_STATUS.md @@ -196,9 +196,30 @@ Hybrid approach balancing performance and resource efficiency: --- -## Recent Infrastructure Changes (2025-12-07) +## Recent Infrastructure Changes -### Additions +### 2025-12-11: Loki-Stack Monitoring Fully Operational + +**Issue Resolved:** Centralized logging pipeline now receiving syslog from UniFi router + +**Root Cause:** rsyslog filter in `/etc/rsyslog.d/unifi-router.conf` was configured for wrong source IP (192.168.1.1 instead of 192.168.2.1) + +**Fix Applied:** Updated rsyslog filter to match VLAN 2 gateway IP (192.168.2.1) + +**Status:** ✅ Complete - Logs flowing UniFi → rsyslog → Promtail → Loki → Grafana + +**Services Affected:** +- VM 101 (monitoring-docker): rsyslog configuration updated +- Loki-stack: All components operational +- Grafana: Dashboards receiving real-time syslog data + +**Technical Details:** See `troubleshooting/loki-stack-bugfix.md` for complete 5-phase troubleshooting history + +--- + +### 2025-12-07: Infrastructure Documentation & Monitoring Stack + +#### Additions 1. **VM 101 (monitoring-docker)**: New dedicated monitoring infrastructure - Grafana for visualization - Prometheus for metrics collection diff --git a/services/logward/.env.example b/services/logward/.env.example new file mode 100644 index 0000000..64f3d4c --- /dev/null +++ b/services/logward/.env.example @@ -0,0 +1,62 @@ +# Database +DATABASE_URL=postgresql://logward:password@localhost:5432/logward +DB_NAME=logward +DB_USER=logward +DB_PASSWORD=Nbkx4mdmay1) + +# Redis +REDIS_PASSWORD=Nbkx4mdmay1) +REDIS_URL=redis://:Nbkx4mdmay1)@localhost:6379 + +# API +API_KEY_SECRET=XEZV6seqamKGb1JaCBCYGLopC9xMC9d8 +PORT=8080 +HOST=0.0.0.0 + +# SMTP (configure for email alerts) +SMTP_HOST=smtp.example.com +SMTP_PORT=587 +SMTP_USER=your_email@example.com +SMTP_PASS=your_smtp_password +SMTP_FROM=noreply@logward.local + +# Rate Limiting +RATE_LIMIT_MAX=1000 +RATE_LIMIT_WINDOW=60000 + +# Environment +NODE_ENV=development + +# Internal Logging (Self-Monitoring) +# Enable/disable internal logging (logs LogWard's own requests/errors) +INTERNAL_LOGGING_ENABLED=true + +# API key for internal logging project (auto-generated on first run if not set) +# After first run, copy the generated key from console output and set it here +# INTERNAL_API_KEY=lp_your_generated_api_key_here + +# API URL for internal logging (defaults to API_URL if not set) +# INTERNAL_LOGGING_API_URL=http://localhost:8080 + +# Service name (distinguishes backend from worker in logs) +# Backend: logward-backend (default) +# Worker: logward-worker +SERVICE_NAME=logward-backend + +# Frontend (SvelteKit) +# Public API URL for frontend to connect to backend +PUBLIC_API_URL=http://localhost:8080 + +# GitHub API Token (optional - for SigmaHQ integration) +# Without token: 60 requests/hour rate limit +# With token: 5000 requests/hour rate limit +# Create token at: https://github.com/settings/tokens (no scopes needed for public repos) +# GITHUB_TOKEN=ghp_your_github_personal_access_token_here + +# Docker Images (optional - specify custom images or versions) +# By default, uses latest from Docker Hub +# Available registries: +# - Docker Hub: logward/backend:latest, logward/frontend:latest +# - GHCR: ghcr.io/logward-dev/logward-backend:latest, ghcr.io/logward-dev/logward-frontend:latest +# LOGWARD_BACKEND_IMAGE=logward/backend:0.2.4 +# LOGWARD_FRONTEND_IMAGE=logward/frontend:0.2.4 diff --git a/services/logward/docker-compose.yml b/services/logward/docker-compose.yml new file mode 100644 index 0000000..698e368 --- /dev/null +++ b/services/logward/docker-compose.yml @@ -0,0 +1,174 @@ +version: '3.8' + +services: + postgres: + image: timescale/timescaledb:latest-pg16 + container_name: logward-postgres + environment: + POSTGRES_DB: ${DB_NAME} + POSTGRES_USER: ${DB_USER} + POSTGRES_PASSWORD: ${DB_PASSWORD} + ports: + - "5432:5432" + volumes: + - postgres_data:/var/lib/postgresql/data + command: + - "postgres" + - "-c" + - "max_connections=100" + - "-c" + - "shared_buffers=256MB" + - "-c" + - "effective_cache_size=768MB" + - "-c" + - "work_mem=16MB" + - "-c" + - "maintenance_work_mem=128MB" + # Parallel query settings for faster aggregations + - "-c" + - "max_parallel_workers_per_gather=4" + - "-c" + - "max_parallel_workers=8" + - "-c" + - "parallel_tuple_cost=0.01" + - "-c" + - "parallel_setup_cost=100" + - "-c" + - "min_parallel_table_scan_size=8MB" + # Write-ahead log tuning for ingestion + - "-c" + - "wal_buffers=16MB" + - "-c" + - "checkpoint_completion_target=0.9" + # Logging for slow queries (>100ms) + - "-c" + - "log_min_duration_statement=100" + healthcheck: + test: ["CMD-SHELL", "pg_isready -U ${DB_USER}"] + interval: 10s + timeout: 5s + retries: 5 + restart: unless-stopped + networks: + - logward-network + + redis: + image: redis:7-alpine + container_name: logward-redis + command: redis-server --requirepass ${REDIS_PASSWORD} + ports: + - "6379:6379" + volumes: + - redis_data:/data + healthcheck: + test: ["CMD", "sh", "-c", "redis-cli -a $${REDIS_PASSWORD} ping | grep -q PONG"] + interval: 10s + timeout: 3s + retries: 5 + restart: unless-stopped + networks: + - logward-network + + backend: + image: ${LOGWARD_BACKEND_IMAGE:-logward/backend:latest} + container_name: logward-backend + ports: + - "8080:8080" + environment: + NODE_ENV: production + DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@postgres:5432/${DB_NAME} + DATABASE_HOST: postgres + DB_USER: ${DB_USER} + REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379 + API_KEY_SECRET: ${API_KEY_SECRET} + PORT: 8080 + HOST: 0.0.0.0 + SMTP_HOST: ${SMTP_HOST:-} + SMTP_PORT: ${SMTP_PORT:-587} + SMTP_USER: ${SMTP_USER:-} + SMTP_PASS: ${SMTP_PASS:-} + SMTP_FROM: ${SMTP_FROM:-noreply@logward.local} + INTERNAL_LOGGING_ENABLED: ${INTERNAL_LOGGING_ENABLED:-false} + INTERNAL_API_KEY: ${INTERNAL_API_KEY:-} + SERVICE_NAME: logward-backend + depends_on: + postgres: + condition: service_healthy + redis: + condition: service_healthy + restart: unless-stopped + networks: + - logward-network + + worker: + image: ${LOGWARD_BACKEND_IMAGE:-logward/backend:latest} + container_name: logward-worker + command: ["worker"] + healthcheck: + disable: true + environment: + NODE_ENV: production + DATABASE_URL: postgresql://${DB_USER}:${DB_PASSWORD}@postgres:5432/${DB_NAME} + DATABASE_HOST: postgres + DB_USER: ${DB_USER} + REDIS_URL: redis://:${REDIS_PASSWORD}@redis:6379 + API_KEY_SECRET: ${API_KEY_SECRET} + SMTP_HOST: ${SMTP_HOST:-} + SMTP_PORT: ${SMTP_PORT:-587} + SMTP_USER: ${SMTP_USER:-} + SMTP_PASS: ${SMTP_PASS:-} + SMTP_FROM: ${SMTP_FROM:-noreply@logward.local} + INTERNAL_LOGGING_ENABLED: ${INTERNAL_LOGGING_ENABLED:-false} + INTERNAL_API_KEY: ${INTERNAL_API_KEY:-} + SERVICE_NAME: logward-worker + depends_on: + backend: + condition: service_healthy + redis: + condition: service_healthy + restart: unless-stopped + networks: + - logward-network + + frontend: + image: ${LOGWARD_FRONTEND_IMAGE:-logward/frontend:latest} + container_name: logward-frontend + ports: + - "3001:3001" + environment: + NODE_ENV: production + PUBLIC_API_URL: ${PUBLIC_API_URL:-http://localhost:8080} + depends_on: + - backend + restart: unless-stopped + networks: + - logward-network + + fluent-bit: + image: fluent/fluent-bit:latest + container_name: logward-fluent-bit + volumes: + - ./fluent-bit.conf:/fluent-bit/etc/fluent-bit.conf:ro + - ./parsers.conf:/fluent-bit/etc/parsers.conf:ro + - ./extract_container_id.lua:/fluent-bit/etc/extract_container_id.lua:ro + - ./wrap_logs.lua:/fluent-bit/etc/wrap_logs.lua:ro + - /var/lib/docker/containers:/var/lib/docker/containers:ro + - /var/run/docker.sock:/var/run/docker.sock:ro + environment: + LOGWARD_API_KEY: ${FLUENT_BIT_API_KEY:-} + LOGWARD_API_HOST: backend + depends_on: + - backend + restart: unless-stopped + networks: + - logward-network + +volumes: + postgres_data: + driver: local + redis_data: + driver: local + +networks: + logward-network: + driver: bridge diff --git a/services/loki-stack/docker-compose.yml b/services/loki-stack/docker-compose.yml new file mode 100644 index 0000000..fbdef75 --- /dev/null +++ b/services/loki-stack/docker-compose.yml @@ -0,0 +1,33 @@ + +version: '3.8' + +services: + loki: + image: grafana/loki:latest + container_name: loki + ports: + - "3100:3100" + volumes: + - /home/server-admin/loki-stack/loki-config.yaml:/etc/loki/local-config.yaml + command: -config.file=/etc/loki/local-config.yaml + networks: + - monitoring-net + restart: unless-stopped + + promtail: + image: grafana/promtail:latest + container_name: promtail + volumes: + - /home/server-admin/loki-stack/promtail-config.yaml:/etc/promtail/config.yaml + ports: + - "1514:1514" # Syslog port exposed to the host + - "9080:9080" + command: -config.file=/etc/promtail/config.yaml + networks: + - monitoring-net + restart: unless-stopped + +networks: + monitoring-net: + external: true + diff --git a/services/loki-stack/loki-config.yaml b/services/loki-stack/loki-config.yaml new file mode 100644 index 0000000..88d1536 --- /dev/null +++ b/services/loki-stack/loki-config.yaml @@ -0,0 +1,35 @@ +auth_enabled: false + +server: + http_listen_port: 3100 + grpc_listen_port: 9096 + +common: + instance_addr: 127.0.0.1 + path_prefix: /loki + storage: + filesystem: + chunks_directory: /loki/chunks + rules_directory: /loki/rules + replication_factor: 1 + ring: + kvstore: + store: inmemory + +schema_config: + configs: + - from: 2020-10-24 + store: tsdb + object_store: filesystem + schema: v13 + index: + prefix: index_ + period: 24h + +compactor: + working_directory: /loki/boltdb-shipper-compactor + retention_enabled: true + delete_request_store: filesystem # <--- This fixes the error you are seeing + +limits_config: + retention_period: 336h diff --git a/services/loki-stack/promtail-config.yaml b/services/loki-stack/promtail-config.yaml new file mode 100644 index 0000000..4ba697a --- /dev/null +++ b/services/loki-stack/promtail-config.yaml @@ -0,0 +1,22 @@ +server: + http_listen_port: 9080 + grpc_listen_port: 0 + +positions: + filename: /tmp/positions.yaml + +clients: + - url: http://loki:3100/loki/api/v1/push + +scrape_configs: + - job_name: syslog_ingest + syslog: + listen_address: 0.0.0.0:1514 + listen_protocol: tcp # We only listen on TCP now + idle_timeout: 60s + label_structured_data: yes + labels: + job: "syslog_combined" # One job for both Proxmox and UniFi + relabel_configs: + - source_labels: ['__syslog_message_hostname'] + target_label: 'host' diff --git a/sub-agents/scribe.md b/sub-agents/scribe.md index df88ca0..a0e902e 100644 --- a/sub-agents/scribe.md +++ b/sub-agents/scribe.md @@ -7,7 +7,7 @@ description: > documentation with current infrastructure state, and educational deep-dives on homelab technologies like reverse proxies, containerization, or monitoring stacks. tools: [Read, Grep, Glob, Edit, Write] -model: sonnet +model: haiku-4.5 color: blue --- diff --git a/troubleshooting/loki-stack-bugfix.md b/troubleshooting/loki-stack-bugfix.md new file mode 100644 index 0000000..a82eca3 --- /dev/null +++ b/troubleshooting/loki-stack-bugfix.md @@ -0,0 +1,176 @@ +Here is a summary of the troubleshooting session to build your centralized logging stack. + +1. The Objective +Create a monitoring stack on Proxmox using Loki (database) and Promtail (log collector) to ingest logs from: + +Proxmox Host: Via TCP (Reliable). + +UniFi Dream Router: Via UDP (Legacy RFC3164 format). + +2. The Final Architecture +Because Promtail strictly enforces modern log standards (RFC5424) and UniFi sends "dirty" legacy logs (RFC3164), we adopted a "Translator" Architecture. + +UniFi Router: Sends UDP logs to the Host VM. + +Host Rsyslog: Catches UDP, converts it to valid TCP, and forwards it to Docker. + +Promtail: Receives clean TCP logs and pushes them to Loki. + +3. Troubleshooting Timeline +Phase 1: Loki Instability +The Issue: Loki kept crashing with "Schema" and "Compactor" errors. + +The Cause: You were using a legacy configuration file with the modern Loki v3.0 image. + +The Fix: Updated the Loki config to use schema: v13, tsdb, and added the required delete_request_store. + +Phase 2: Proxmox Log Ingestion (TCP) +The Issue: Promtail threw "Parsing Errors" when receiving logs from Proxmox. + +The Cause: Proxmox defaults to an older syslog format. + +The Fix: Reconfigured Proxmox (/etc/rsyslog.conf) to use the template RSYSLOG_SyslogProtocol23Format (RFC5424). + +Phase 3: The UniFi UDP Saga (The Main Blocker) +The Issue: Promtail rejected UniFi logs. + +Attempt 1: We added format: rfc3164 to the Promtail config. + +Result: Crash (field format not found). + +Attempt 2: We upgraded Promtail from v2.9 to v3.0. + +Result: Crash persisted. + +Discovery: Promtail v3.0 still does not support legacy format toggles in the syslog receiver. + +The Final Fix: We moved the UDP listener out of Docker and onto the Host OS (rsyslog), letting the Host handle the "dirty" UDP work and forward clean TCP to Promtail. + +Phase 4: The "Ghost" Configuration +The Issue: Promtail logs showed it trying to connect to 192.168.2.25 even though your config file said http://loki:3100. + +The Cause: Docker was holding onto an old version of the configuration file. + +The Fix: Used docker-compose down followed by docker-compose up -d (instead of just restart) to force a refresh of the volume mounts. + +4. The "Golden State" Configuration +These are the settings that finally worked. + +A. Docker Compose (docker-compose.yml) + +Promtail Ports: Only TCP 1514:1514 mapped (UDP removed to prevent conflicts). + +Volumes: Confirmed mapping ./promtail-config.yaml:/etc/promtail/config.yaml. + +B. Promtail Config (promtail-config.yaml) + +Clients: url: http://loki:3100/loki/api/v1/push (Using internal Docker DNS). + +Scrape Config: Single job listening on tcp. + +YAML + +syslog: + listen_address: 0.0.0.0:1514 + listen_protocol: tcp +C. Host Rsyslog (/etc/rsyslog.conf) + +Inputs: imudp enabled on port 1514. + +Forwarding: Rule added to send all UDP traffic to 127.0.0.1:1514 via TCP. + +--- + +## FINAL RESOLUTION - 2025-12-11 + +### Root Cause Identified +**IP address mismatch in rsyslog forwarding filter** + +**Problem:** `/etc/rsyslog.d/unifi-router.conf` on VM 101 was filtering for the wrong source IP +- Filter was configured for: `192.168.1.1` (incorrect) +- Actual source IP: `192.168.2.1` (VLAN 2 gateway interface) + +**Explanation:** VM 101 is on VLAN 2 (192.168.2.x subnet). When the UniFi router sends syslog to 192.168.2.114, it uses its VLAN 2 interface IP (192.168.2.1) as the source address. The rsyslog filter was silently rejecting all incoming logs due to this IP mismatch. + +### Solution Implemented + +**File Modified:** `/etc/rsyslog.d/unifi-router.conf` on VM 101 + +**Change:** +```bash +# Before (WRONG): +if $fromhost-ip == '192.168.1.1' then { + +# After (CORRECT): +if $fromhost-ip == '192.168.2.1' then { +``` + +**Complete corrected configuration:** +```bash +# UniFi Router - VLAN 2 interface +if $fromhost-ip == '192.168.2.1' then { + action(type="omfwd" Target="127.0.0.1" Port="1514" Protocol="tcp" Template="RSYSLOG_SyslogProtocol23Format") + stop +} +``` + +**Service restart:** +```bash +sudo systemctl restart rsyslog +sudo systemctl status rsyslog +``` + +**Result:** ✅ Logs immediately began flowing: UniFi router → rsyslog → Promtail → Loki → Grafana + +### Verification Steps +```bash +# 1. Verify UDP listener (rsyslog) +sudo ss -tulnp | grep 1514 +# Expected: udp UNCONN users:(("rsyslogd")) + +# 2. Verify TCP listener (Promtail) +sudo ss -tulnp | grep 1514 +# Expected: tcp LISTEN users:(("docker-proxy")) + +# 3. Monitor Promtail ingestion +docker logs promtail --tail 50 -f +# Expected: "Successfully sent batch" messages + +# 4. Test log injection +logger -n 127.0.0.1 -P 1514 "Test from monitoring-docker host" +``` + +### Troubleshooting Phases Summary + +This was a **5-phase troubleshooting effort**: + +1. **Phase 1:** Fixed Loki schema errors (v13, tsdb, delete_request_store) +2. **Phase 2:** Fixed Proxmox log parsing (RSYSLOG_SyslogProtocol23Format) +3. **Phase 3:** Moved UDP listener from Docker to Host rsyslog (Promtail doesn't support RFC3164) +4. **Phase 4:** Fixed "ghost" configuration (192.168.2.25 stale config in Docker volumes) +5. **Phase 5:** ✅ Corrected rsyslog filter IP from 192.168.1.1 to 192.168.2.1 + +### Data Flow Diagram +``` +UniFi Router (192.168.2.1) + ↓ UDP syslog port 1514 +Host rsyslog (192.168.2.114:1514 UDP) + ↓ TCP forward (RFC5424 format) +Docker Promtail (127.0.0.1:1514 TCP) + ↓ HTTP push +Loki (loki:3100) + ↓ Query +Grafana (192.168.2.114:3000) +``` + +### Key Technical Details +- **VLAN Topology:** VM 101 on VLAN 2, router uses 192.168.2.1 interface for that subnet +- **rsyslog Template:** RSYSLOG_SyslogProtocol23Format (RFC5424) - required by Promtail +- **Port Binding:** UDP 1514 (rsyslog) and TCP 1514 (Promtail) coexist on same port number, different protocols +- **Stop Directive:** Prevents duplicate logging to local files after forwarding + +### Status +- **Monitoring Stack:** ✅ Fully operational +- **Log Ingestion:** ✅ Active +- **Grafana Dashboards:** ✅ Receiving data +- **Resolution Date:** 2025-12-11