SUB_AGENT_WATCHDOG_STATUS.md

# Sub-Agent Watchdog Implementation Status

## ✅ Completed

### 1. SubAgentManager Class (`sub_agent_manager.py`)
- Tracks sub-agent state (ID, task, timestamps, completion status)
- Background watchdog thread checks for hung agents every 30s
- Detects hangs: no activity for 5+ minutes
- Cleanup: marks hung agents as failed

### 2. Agent Integration (`agent.py`)
- Import added: `from sub_agent_manager import SubAgentManager`
- Initialized in `Agent.__init__`:
  ```python
  self.sub_agent_manager = SubAgentManager(timeout_seconds=300)
  if not is_sub_agent:
      self.sub_agent_manager.start_watchdog()
  ```
- Registration in `spawn_sub_agent()`:
  ```python
  if agent_id and not self.is_sub_agent:
      self.sub_agent_manager.register_sub_agent(agent_id, specialist_prompt[:100])
  ```

## ⏳ Still Needed

### 3. Activity Updates During Execution
**Challenge**: Agent SDK handles tool calls internally - no clear injection point for progress tracking.

**Options**:
1. Add activity updates in `llm_interface._agent_sdk_chat()` message loop
2. Hook into tool execution callbacks (if Agent SDK supports them)
3. Poll conversation history length as proxy for activity

**Recommended**: Add in message receive loop:
```python
# In llm_interface.py, _agent_sdk_chat(), async for message loop:
if hasattr(agent, 'sub_agent_manager'):
    # Update activity for current sub-agent if applicable
    agent.sub_agent_manager.update_activity(current_agent_id)
```

### 4. Mark Complete After Execution
Add after sub-agent chat completes:
```python
# In agent.py, delegate() method:
try:
    result = sub_agent.chat(task, username=username)
    self.sub_agent_manager.mark_complete(agent_id, result=result)
except Exception as e:
    self.sub_agent_manager.mark_complete(agent_id, error=str(e))
    raise
```

### 5. Retry Logic for Hung Agents
Detect hung agents and restart task:
```python
# In llm_interface.py or agent.py:
hung_agents = self.sub_agent_manager.get_hung_agents()
if hung_agents:
    logger.error(f"Detected {len(hung_agents)} hung sub-agents - restarting")
    for agent_id in hung_agents:
        self.sub_agent_manager.cleanup_agent(agent_id)
    # Retry the original request
    return self.chat(original_message, ...)  # Requires saving original context
```

## Next Steps

1. **Test current implementation**: Restart bot, verify watchdog starts
2. **Add activity tracking**: Integrate into message receive loop  
3. **Add completion marking**: Hook into delegate() method
4. **Add retry logic**: Detect hangs and restart tasks
5. **Test with hung agent**: Create artificial hang to verify detection

## Known Issues

- Hook error blocks Write/Edit tools (workaround: use Bash for file operations)
- `CLAUDE_PLUGIN_ROOT` env var points to stale plugin hash `261ce4fba4f2`
- No current mechanism to save original task context for retry
feat: RSO observation system, child safety, Discord adapter, Telegram watchdog, email attachments Core agent improvements: - RSO (Relevance Scoring & Observation) system: interaction_logger, memory_scorer, signal_detector - Memory access logging (memory_access_log table) for relevance scoring; high-signal turn detection - Rich conversation storage for notable turns; compact_conversation truncates long user messages - Task-type classifier (query/action/analysis/creative) for observation tagging - Nested sub-agent visibility: deep delegations now register against the main agent's manager Child safety (Gabriel profile): - child_safety.py: filtering, audit logging, prompt constants for restricted sessions - .kiro/specs/child-safety-profile: requirements, design, tasks specs - GABRIEL_BOT_PROPOSAL.md: initial proposal doc - Reduced context window (10 msgs) and tutor-mode identity for restricted users Telegram adapter: - Polling watchdog: auto-restarts updater if polling drops unexpectedly - get_me() with exponential-backoff retry on NetworkError at startup - Correct stop() ordering: signal watchdog before cancelling tasks Email / Gmail: - send_email: supports file attachments (attachments list param) - get_email: surfaces attachment metadata in response Scheduled tasks / weather: - Remove OpenWeatherMap API calls from morning-weather task; use wttr.in exclusively - New scheduled tasks and scheduler state persistence Discord: - adapters/discord/__init__.py scaffold - discord-plugin: MCP plugin for Claude Code Discord integration (server.ts, skills, config) Infrastructure: - n8n workflow exports (garvis_webhook, content_pipeline variants) - memory_workspace: context, homelab-repo-updates, weekly observation summaries, error logs - UCS C240 migration plan doc - requirements.txt: new deps - .claude/settings.json, fix_hooks.py: hook/permission tuning 2026-04-23 07:54:01 -06:00			`# Sub-Agent Watchdog Implementation Status`

			`## ✅ Completed`

			### 1. SubAgentManager Class (`sub_agent_manager.py`)
			`- Tracks sub-agent state (ID, task, timestamps, completion status)`
			`- Background watchdog thread checks for hung agents every 30s`
			`- Detects hangs: no activity for 5+ minutes`
			`- Cleanup: marks hung agents as failed`

			### 2. Agent Integration (`agent.py`)
			- Import added: `from sub_agent_manager import SubAgentManager`
			- Initialized in `Agent.__init__`:
			```python
			`self.sub_agent_manager = SubAgentManager(timeout_seconds=300)`
			`if not is_sub_agent:`
			`self.sub_agent_manager.start_watchdog()`
			```
			- Registration in `spawn_sub_agent()`:
			```python
			`if agent_id and not self.is_sub_agent:`
			`self.sub_agent_manager.register_sub_agent(agent_id, specialist_prompt[:100])`
			```

			`## ⏳ Still Needed`

			`### 3. Activity Updates During Execution`
			`Challenge: Agent SDK handles tool calls internally - no clear injection point for progress tracking.`

			`Options:`
			1. Add activity updates in `llm_interface._agent_sdk_chat()` message loop
			`2. Hook into tool execution callbacks (if Agent SDK supports them)`
			`3. Poll conversation history length as proxy for activity`

			`Recommended: Add in message receive loop:`
			```python
			`# In llm_interface.py, _agent_sdk_chat(), async for message loop:`
			`if hasattr(agent, 'sub_agent_manager'):`
			`# Update activity for current sub-agent if applicable`
			`agent.sub_agent_manager.update_activity(current_agent_id)`
			```

			`### 4. Mark Complete After Execution`
			`Add after sub-agent chat completes:`
			```python
			`# In agent.py, delegate() method:`
			`try:`
			`result = sub_agent.chat(task, username=username)`
			`self.sub_agent_manager.mark_complete(agent_id, result=result)`
			`except Exception as e:`
			`self.sub_agent_manager.mark_complete(agent_id, error=str(e))`
			`raise`
			```

			`### 5. Retry Logic for Hung Agents`
			`Detect hung agents and restart task:`
			```python
			`# In llm_interface.py or agent.py:`
			`hung_agents = self.sub_agent_manager.get_hung_agents()`
			`if hung_agents:`
			`logger.error(f"Detected {len(hung_agents)} hung sub-agents - restarting")`
			`for agent_id in hung_agents:`
			`self.sub_agent_manager.cleanup_agent(agent_id)`
			`# Retry the original request`
			`return self.chat(original_message, ...) # Requires saving original context`
			```

			`## Next Steps`

			`1. Test current implementation: Restart bot, verify watchdog starts`
			`2. Add activity tracking: Integrate into message receive loop`
			`3. Add completion marking: Hook into delegate() method`
			`4. Add retry logic: Detect hangs and restart tasks`
			`5. Test with hung agent: Create artificial hang to verify detection`

			`## Known Issues`

			`- Hook error blocks Write/Edit tools (workaround: use Bash for file operations)`
			- `CLAUDE_PLUGIN_ROOT` env var points to stale plugin hash `261ce4fba4f2`
			`- No current mechanism to save original task context for retry`