feat: RSO observation system, child safety, Discord adapter, Telegram watchdog, email attachments

Core agent improvements: - RSO (Relevance Scoring & Observation) system: interaction_logger, memory_scorer, signal_detector - Memory access logging (memory_access_log table) for relevance scoring; high-signal turn detection - Rich conversation storage for notable turns; compact_conversation truncates long user messages - Task-type classifier (query/action/analysis/creative) for observation tagging - Nested sub-agent visibility: deep delegations now register against the main agent's manager Child safety (Gabriel profile): - child_safety.py: filtering, audit logging, prompt constants for restricted sessions - .kiro/specs/child-safety-profile: requirements, design, tasks specs - GABRIEL_BOT_PROPOSAL.md: initial proposal doc - Reduced context window (10 msgs) and tutor-mode identity for restricted users Telegram adapter: - Polling watchdog: auto-restarts updater if polling drops unexpectedly - get_me() with exponential-backoff retry on NetworkError at startup - Correct stop() ordering: signal watchdog before cancelling tasks Email / Gmail: - send_email: supports file attachments (attachments list param) - get_email: surfaces attachment metadata in response Scheduled tasks / weather: - Remove OpenWeatherMap API calls from morning-weather task; use wttr.in exclusively - New scheduled tasks and scheduler state persistence Discord: - adapters/discord/__init__.py scaffold - discord-plugin: MCP plugin for Claude Code Discord integration (server.ts, skills, config) Infrastructure: - n8n workflow exports (garvis_webhook, content_pipeline variants) - memory_workspace: context, homelab-repo-updates, weekly observation summaries, error logs - UCS C240 migration plan doc - requirements.txt: new deps - .claude/settings.json, fix_hooks.py: hook/permission tuning
2026-04-23 07:54:01 -06:00
parent 1232490c3b
commit 916f86725d
70 changed files with 10945 additions and 187 deletions
--- a/memory_workspace/observation/summaries/memory-scores-2026-04-20.json
+++ b/memory_workspace/observation/summaries/memory-scores-2026-04-20.json
--- a/memory_workspace/observation/summaries/week-2026-14.md
+++ b/memory_workspace/observation/summaries/week-2026-14.md
@@ -0,0 +1,134 @@
+# Weekly Reflection Report — Week 14 (2026-03-30 → 2026-04-05)
+
+## Overview
+
+| Metric | Value |
+|--------|-------|
+| Total interactions | 81 |
+| Total signals | 88 |
+| Total errors | 8 |
+| Timeouts (30min limit) | 7 |
+| Avg response time | 80.0s |
+| Max response time | 659.6s (11 min) |
+| Min response time | 11.5s |
+| Slow (>60s) | 34 (41%) |
+| Positive signals | 12 (14%) |
+| Negative signals | 9 (10%) |
+| Corrections followed | 3 |
+
+## Task Breakdown
+
+| Type | Count | % |
+|------|-------|---|
+| Query | 53 | 65% |
+| Creative | 13 | 16% |
+| Analysis | 9 | 11% |
+| Action | 6 | 7% |
+
+| Complexity | Count | % |
+|------------|-------|---|
+| Complex | 36 | 44% |
+| Simple | 24 | 30% |
+| Moderate | 21 | 26% |
+
+## Top Tools Used
+
+| Tool | Calls |
+|------|-------|
+| Bash | 225 |
+| Read | 163 |
+| Glob | 68 |
+| SSH Execute | 43 |
+| Gitea Read File | 39 |
+| File System Read | 22 |
+| Grep | 22 |
+| WebSearch | 22 |
+| Gitea List Files | 18 |
+| TodoWrite | 15 |
+| Task (sub-agents) | 14 |
+| Search Vault | 13 |
+
+---
+
+## Q1: What Went Well?
+
+**Positive signal rate held at 14%** — 12 of 88 signals were explicitly positive, which tracks with Jordan's communication style (he doesn't hand out gold stars, so 14% is actually decent).
+
+**Infrastructure diagnostics were a strength.** The Apollo/Sunshine log analysis, resolution debugging, and Proxmox SSH operations all completed efficiently. SSH Execute was used 43 times without a single SSH-related error — the connection to Proxmox and monitoring VMs is rock solid.
+
+**Gitea integration performed well.** 39 file reads + 18 directory listings for code review tasks (CVE dashboard, etc.) completed without errors. The tool chain of `gitea_list_files` → `gitea_read_file` is now a reliable pattern for repo analysis.
+
+**Simple queries were fast.** Min response time of 11.5s shows that when the task is straightforward, the system responds efficiently. The 24 simple-complexity tasks likely averaged well under the 80s mean.
+
+---
+
+## Q2: What Went Wrong?
+
+**Timeouts are the headline problem.** 7 of 8 errors were 30-minute timeout kills. That's a 8.6% timeout rate across 81 interactions — far too high.
+
+Breakdown of timeout causes:
+- **4 timeouts (Apr 3–4)**: All had `WebFetch` as last tool used. WebFetch is hanging on certain URLs and never returning, burning the entire 30-minute budget.
+- **1 timeout (Apr 2)**: `delegate_task` — sub-agent spawned but didn't complete within budget.
+- **1 timeout (Apr 2)**: `run_command` — likely a long-running shell command without timeout.
+- **1 crash (Apr 4)**: Exit code 3221225786 — a Windows-specific process crash (0xC000013A = Ctrl+C termination or similar).
+
+**41% of interactions exceeded 60 seconds.** The average of 80s is dragged up by the long tail, but even so — 34 of 81 interactions taking over a minute indicates systemic sluggishness on complex tasks.
+
+**The 659s interaction** ("What's the error. This is twice you've timed out...") is ironic — Jordan was complaining about timeouts, and the response itself nearly timed out. That's a bad look.
+
+**Negative signal rate at 10%** with 3 corrections. The corrections suggest I'm sometimes heading in the wrong direction before Jordan steers me back.
+
+---
+
+## Q3: What Patterns Emerged?
+
+**Query-dominant workload (65%).** Jordan primarily uses Garvis for information retrieval and analysis — checking configs, reading logs, reviewing code. Creative tasks (16%) include documentation and report generation. Pure actions (7%) are rare.
+
+**High complexity ratio.** 44% of tasks rated complex. This aligns with the slow response times — Jordan isn't asking simple questions, he's asking for multi-file analysis and cross-system diagnostics.
+
+**Bash dominance (225 calls).** Bash is used 2.7× as often as the next tool. This makes sense given the infra-heavy workload, but it also means shell execution efficiency directly impacts overall performance.
+
+**Read-heavy pattern.** Read (163) + Glob (68) + Grep (22) = 253 file-reading operations. That's 3× the total interactions — averaging ~3 file reads per task. Code review and config analysis tasks are file-IO bound.
+
+**WebFetch is a liability.** It appears 22 times in tool usage but is the last tool in 4 of 7 timeouts. It has a ~18% failure rate when it's the primary operation.
+
+---
+
+## Q4: What Is Being Wasted?
+
+**~3.5 hours of compute burned on timeouts.** 7 timeouts × 30 minutes = 210 minutes of wall-clock time where I was running but producing nothing. That's time Jordan was waiting.
+
+**WebFetch retry loops.** The Apr 3–4 timeouts all show WebFetch as the culprit — likely the same or similar URLs being retried without a circuit breaker. Each retry burns another 30 minutes.
+
+**The 659s interaction was salvageable.** An 11-minute response that started with "What's the error" could have been broken into a quick acknowledgment + background investigation. Instead, Jordan waited 11 minutes for what was probably a diagnostic dump.
+
+**Zettelkasten daily review is stale.** The same 3 fleeting notes (from March 18 and April 2) appear every review cycle. The task runs daily but produces no new value until Jordan actually processes them. Consider: auto-skip notes older than 7 days, or batch-prompt less frequently.
+
+---
+
+## Q5: Recommendations
+
+### 1. `[config]` Add WebFetch timeout/circuit breaker
+**Data:** 4 of 7 timeouts (57%) were WebFetch hangs. WebFetch has an ~18% failure rate.
+**Action:** Implement a 30-second timeout on WebFetch calls. After 2 failed fetches in a session, switch to alternative tools (Bash curl, or skip). This alone would have prevented 4 of 7 timeouts this week.
+
+### 2. `[prompt]` Break complex tasks into checkpoint responses
+**Data:** 34 of 81 interactions (41%) exceeded 60s. Average is 80s.
+**Action:** For any task estimated to take >60s, send an immediate acknowledgment ("On it — checking X, Y, Z") then work in stages. Jordan shouldn't stare at a spinner for 11 minutes. The 659s interaction is the poster child for this.
+
+### 3. `[tool_usage]` Prefer Bash curl over WebFetch for known-unreliable URLs
+**Data:** 4 WebFetch timeouts on Apr 3–4, all during the same type of operation.
+**Action:** For web content fetching, use `Bash` with `curl --max-time 15` as the primary approach. Fall back to WebFetch only when HTML-to-markdown processing is specifically needed.
+
+### 4. `[memory]` Auto-archive stale fleeting notes
+**Data:** 3 fleeting notes have persisted across 14+ daily review cycles without being processed.
+**Action:** After 7 days unprocessed, automatically move fleeting notes to an "archive/stale" tag and stop surfacing them in daily reviews. Resurface weekly instead, or prompt Jordan once with "These have been sitting for 2 weeks — bulk delete?"
+
+### 5. `[config]` Add sub-agent timeout guard
+**Data:** 1 timeout from `delegate_task` running unchecked for 30 minutes.
+**Action:** Set a 5-minute hard timeout on delegated sub-agents. If a sub-agent hasn't returned in 5 minutes, kill it and report partial results. The watchdog exists in concept but clearly didn't catch this one.
+
+---
+
+*Report generated: 2026-04-05T20:00 MST*
+*Next review: Week 15 (2026-04-12)*
--- a/memory_workspace/observation/summaries/week-2026-15.md
+++ b/memory_workspace/observation/summaries/week-2026-15.md
@@ -0,0 +1,109 @@
+# RSO Weekly Reflection — Week 15 (2026-04-06 → 2026-04-12)
+
+## Summary
+
+| Metric | Value |
+|---|---|
+| Total interactions | 72 |
+| Total signals | 74 |
+| Positive signals | 12 (16%) |
+| Negative signals | 9 (12%) |
+| Corrections followed | 5 (7%) |
+| Errors | 1 |
+| Timeouts | 1 |
+| Avg response time | 82.1s |
+| Max response time | 397.5s |
+| Slow interactions (>60s) | 29 (40%) |
+
+---
+
+## Q1: What went well?
+
+**Positive signal rate held at 16%** — 12 of 74 signals were explicitly positive, meaning roughly 1 in 6 interactions earned direct approval. Given Jordan's communication style (he tends not to praise unless something genuinely landed), this is a reasonable baseline.
+
+**Query-type tasks dominated (58%)** and completed reliably — 42 of 72 interactions were queries (weather checks, vault reviews, article analysis). These are the bread-and-butter tasks where tool chains are predictable and delivery is fast.
+
+**SSH execution was the workhorse** — 158 `ssh_execute` calls across the week, covering Twingate updates, Proxmox management, and infrastructure checks. Zero SSH-related errors logged, meaning the homelab connectivity pipeline is solid.
+
+**Tool diversity was high** — 12+ distinct tools used regularly, indicating the full MCP toolkit is being exercised rather than falling back to a narrow subset.
+
+---
+
+## Q2: What went wrong?
+
+**40% of interactions were slow (>60s)** — 29 of 72 interactions exceeded 60 seconds. This is the single biggest issue. The average duration was 82.1s, dragged up by several interactions exceeding 5 minutes.
+
+**Top offenders by duration:**
+- 397s — "Where's the plan?" — likely a complex planning/search task that spiraled
+- 380s — Clipboard/TikTok data entry scoping — creative task with ambiguous requirements
+- 318s — A bare "yes" confirmation that triggered a 5+ minute execution chain
+- 302s — Git pull/check workflow — waiting on sequential operations
+
+**1 timeout (30-minute hard limit)** on April 8 — Agent SDK killed a task after 39 messages. Last tool was `TodoWrite` with 5 different tools in play. This was likely a complex multi-step task that kept spawning sub-steps without converging.
+
+**9 negative signals + 5 corrections** — 19% of signals indicated dissatisfaction or course correction. That's nearly 1 in 5 responses needing adjustment, which is too high.
+
+---
+
+## Q3: What patterns emerged?
+
+**Task type distribution:**
+- Query: 42 (58%) — weather, vault reviews, lookups
+- Creative: 15 (21%) — article analysis, planning, content generation
+- Analysis: 10 (14%) — technical assessments, comparisons
+- Action: 5 (7%) — actual infrastructure changes (Twingate update, etc.)
+
+**Complexity split:**
+- Simple: 34 (47%)
+- Complex: 28 (39%)
+- Moderate: 10 (14%)
+
+This is a bimodal distribution — tasks are either quick lookups or deep multi-tool operations. Very few land in the middle. The "moderate" category is underrepresented, suggesting Jordan either asks simple questions or launches full projects with little in between.
+
+**Tool chain patterns:**
+- `Read → Bash → ssh_execute` — standard infrastructure management chain
+- `search_vault → read_file` — zettelkasten review pattern (repeated 3+ times this week for the same 3 fleeting notes)
+- `WebSearch → web_fetch → Read` — article analysis chain
+- `gitea_list_files → gitea_read_file` — code review/repo exploration
+
+**Recurring task:** The daily zettelkasten review ran 3 times this week, each time surfacing the same 3 unprocessed fleeting notes. The review itself works; the processing step is stalled on Jordan's decision.
+
+---
+
+## Q4: What is being wasted?
+
+**Zettelkasten review overhead** — 3 reviews this week, ~60-90s each, for the same 3 notes that haven't been actioned in 25 days. Estimated 3-4 minutes of compute time this week producing identical output. The reviews are generating recommendations Jordan isn't acting on.
+
+**Weather report redundancy** — Multiple weather checks this week using the same dual-fetch pattern (OpenWeatherMap fails on "Centennial" every time, wttr.in succeeds every time). ~30s wasted per check on the OpenWeatherMap call that will never work.
+
+**Slow "yes" confirmations** — Two interactions where a simple "yes" triggered 240-318s execution chains. These likely involve complex multi-step operations where the confirmation kicks off a long sequential pipeline. The work itself may be necessary, but the duration suggests opportunities for parallelization.
+
+**Read tool overuse** — 193 Read calls (highest of any tool). Some of this is necessary context-loading, but the volume suggests repeated reads of the same files across interactions rather than caching/remembering content from earlier in the session.
+
+---
+
+## Q5: Recommendations
+
+### 1. `config` — Remove OpenWeatherMap from weather workflow
+**Data:** OpenWeatherMap fails on "Centennial, CO" in 100% of attempts (3+ this week, consistent across all prior weeks). Every weather request wastes ~10-15s on a guaranteed failure.
+**Action:** Update weather logic to skip OpenWeatherMap entirely for Centennial and go straight to wttr.in, or use "Denver, CO" as the OpenWeatherMap fallback.
+
+### 2. `prompt` — Auto-process stale fleeting notes after 3 reviews
+**Data:** 3 zettelkasten reviews this week produced identical output for 3 notes that have been fleeting for 25+ days. 3-4 minutes of total compute wasted on repeated recommendations.
+**Action:** After the 3rd review with no action, auto-propose a batch action ("I'll merge notes 1+2 into a permanent note and archive note 3 — say 'no' to stop me"). Shift from passive recommendation to opt-out execution.
+
+### 3. `tool_usage` — Parallelize confirmation-triggered workflows
+**Data:** 2 interactions where a "yes" confirmation led to 240-318s sequential execution. 40% of all interactions exceeded 60s.
+**Action:** When a "yes" triggers multiple independent operations, use `delegate_task` or parallel tool calls instead of sequential execution. Target: reduce the 40% slow-interaction rate to <25%.
+
+### 4. `memory` — Cache repeated file reads within sessions
+**Data:** 193 Read calls — highest tool count, exceeding even Bash (186). Many are likely re-reads of the same files (MEMORY.md, SOUL.md, user profiles) across multi-turn conversations.
+**Action:** When a file has been read earlier in the same session and hasn't been modified, reference the cached content instead of re-reading. Won't help across sessions but reduces intra-session overhead.
+
+### 5. `prompt` — Reduce negative signal rate from 19% to <10%
+**Data:** 9 negative + 5 correction signals out of 74 total (19%). Nearly 1 in 5 responses needed adjustment.
+**Action:** Review the 9 negative-signal interactions to identify common triggers. Likely causes: over-explaining when action was wanted, or misreading task scope. Specific patterns to investigate next week.
+
+---
+
+*Generated: 2026-04-12 | Next review: 2026-04-19*
--- a/memory_workspace/observation/summaries/week-2026-17.md
+++ b/memory_workspace/observation/summaries/week-2026-17.md
@@ -0,0 +1,124 @@
+# RSO Weekly Reflection — Week 17 (2026-04-14 → 2026-04-20)
+
+## Summary Statistics
+
+| Metric | Value |
+|--------|-------|
+| Total interactions | 80 |
+| Total signals | 78 |
+| Errors / Timeouts | 0 / 0 |
+| Avg duration | 55.9s |
+| Max duration | 438.8s |
+| Slow (>60s) | 16 (20%) |
+| Positive signals | 5 (6.4%) |
+| Negative signals | 5 (6.4%) |
+| Corrections followed | 3 |
+
+**Task types**: query (55), creative (11), action (8), analysis (6)
+**Complexity**: simple (53), complex (20), moderate (7)
+
+---
+
+## Q1: What Went Well?
+
+- **Zero errors and zero timeouts** — a clean week from an infrastructure stability standpoint. No tool failures, no dropped connections.
+- **Simple tasks dominated** (53 of 80 = 66%) and completed within acceptable latency for the majority.
+- **5 explicit positive signals** received with neutral follow-ups being the overwhelming majority (66 of 78 = 85%), indicating Jordan generally accepted outputs without needing refinement.
+- **Tool diversity** was high — 12+ distinct tools actively used, demonstrating the MCP ecosystem is functioning end-to-end (SSH, file system, search, web fetch, Bash, delegation).
+- **Delegation via Task agent** used 20 times — appropriate offloading of complex sub-tasks to parallel agents.
+
+---
+
+## Q2: What Went Wrong?
+
+- **20% of interactions exceeded 60s** (16 of 80) — one in five requests ran slow. The worst offender was 438s (7+ minutes) for the RSO weekly reflection itself.
+- **5 negative signals and 3 corrections** — a 6.4% dissatisfaction rate. Combined with 2 refinement requests, 10 of 78 signals (12.8%) indicated suboptimal first-response quality.
+- **Complex tasks (25%) drove disproportionate latency**: the top 10 slowest interactions averaged ~230s and were all complex/analysis tasks (repo analysis, tax research, configuration parsing).
+- **No recurring error patterns** (0 errors), but the slow-task concentration suggests architectural limits are being hit on multi-file analysis tasks.
+
+---
+
+## Q3: What Patterns Emerged?
+
+### Task Distribution
+- **Queries dominate** (69% of all interactions) — Jordan uses Garvis primarily as a lookup/research tool, not an action executor.
+- **Creative tasks** (14%) are the second most common — writing, drafting, ideation.
+- **Actions** (10%) and **analysis** (8%) are minority use cases but account for most of the slow interactions.
+
+### Tool Usage Chains
+- **Bash (75) + Read (74) + mcp__file_system__read_file (47)** — the "investigate" pattern. Nearly every interaction involves reading something.
+- **mcp__file_system__list_directory (42)** — heavy directory traversal, often preceding file reads. Suggests exploration-before-action is the dominant workflow.
+- **TodoWrite (23)** — used in ~29% of interactions, indicating multi-step tasks are common.
+- **Task delegation (20)** — healthy delegation rate for complex subtasks.
+- **search_vault (19)** — memory/zettelkasten lookups are a core pattern.
+
+### Emerging Anti-Patterns
+- The RSO reflection itself is the single slowest task (438s). It's recursive overhead.
+- Repo analysis tasks (CVE dashboard, Kira configs) consistently exceed 150s — these are the prime delegation candidates.
+
+---
+
+## Q4: What Is Being Wasted?
+
+### Slow Interactions
+- **16 interactions >60s consumed ~56 minutes** of total processing time. If halved, that's 28 minutes of latency savings per week.
+- The 438s RSO reflection and 425s input-validation analysis together consumed 14+ minutes — nearly as much as all other slow tasks combined.
+
+### Redundant Patterns
+- **Bash (75) + mcp__file_system__run_command (22)** — two tools serving overlapping purposes. 22 uses of `run_command` could potentially be consolidated with Bash.
+- **Read (74) + mcp__file_system__read_file (47)** — 121 combined file reads. Some of these may be re-reads of the same files within a session.
+
+### Memory Waste
+- **73 of 75 memory files scored as stale** — 97% of indexed memory is not being actively referenced.
+- **2 archive candidates** with scores below -10 (ages 56–61 days): daily logs from February containing IP addresses, credentials, and status references that are now outdated.
+- The memory workspace has accumulated operational debt — most daily memory entries become noise after ~30 days.
+
+### Scheduled Tasks
+- The "daily API usage and cost report" appears repeatedly in memory context but no evidence of it producing actionable output this week.
+
+---
+
+## Q5: Recommendations
+
+### 1. `tool_usage` — Consolidate file-read tools
+**Evidence**: 74 `Read` + 47 `mcp__file_system__read_file` = 121 file reads across 80 interactions. Standardize on one tool per context to reduce overhead.
+**Action**: Default to Claude Code `Read` for local files; reserve `mcp__file_system__read_file` for MCP-only contexts (sub-agents, delegated tasks).
+
+### 2. `prompt` — Break complex analysis tasks into delegation chains
+**Evidence**: 6 of the top 10 slowest interactions (150–438s) involved multi-file repo analysis. These exceed the 5-minute agent timeout risk threshold.
+**Action**: For any task involving >3 files or repo-wide analysis, immediately delegate to a sub-agent with a scoped prompt rather than running inline.
+
+### 3. `memory` — Archive stale memory files (>30 days, score < -9)
+**Evidence**: 73 of 75 files (97%) scored stale. Top 10 archive candidates average score -10.2 with ages 33–61 days. None are being referenced in current interactions.
+**Action**: Move files with score < -9 and age > 45 days to `memory_workspace/archive/`. Retain only the last 30 days of daily logs in active memory. This would archive ~10 files immediately.
+
+### 4. `config` — Optimize the RSO reflection pipeline itself
+**Evidence**: The weekly reflection is the single slowest task at 438s (7.3 min). It's recursive: the observation system's most expensive operation is observing itself.
+**Action**: Pre-compute stats via a lightweight scheduled script (cron/daily) that writes a summary JSON. The weekly reflection then reads pre-computed data instead of parsing raw JSONL each time.
+
+### 5. `prompt` — Improve first-response quality to reduce corrections
+**Evidence**: 3 corrections + 2 refinements + 5 negative signals = 10 of 78 signals (12.8%) indicated the first response missed the mark.
+**Action**: For complex/moderate tasks, add a brief "understanding check" before executing — restate the interpreted request in one line before proceeding. This front-loads alignment and should reduce correction rate.
+
+---
+
+## Memory Scorer Output
+
+| Metric | Value |
+|--------|-------|
+| Files scored | 75 |
+| Core memory | 0 |
+| Active memory | 0 |
+| Archive candidates | 2 |
+| Stale candidates | 73 |
+
+**Top archive candidates:**
+- `memory/2026-02-18.md` — score: -12.1, age: 61d
+- `memory/2026-02-23.md` — score: -11.6, age: 56d
+- `memory/2026-03-01.md` — score: -11.0, age: 50d
+- `memory/2026-02-22.md` — score: -10.7, age: 57d
+- `memory/2026-02-26.md` — score: -10.3, age: 53d
+
+---
+
+*Generated: 2026-04-20 | Agent: RSO Weekly Reflection | Week 17*