Files
ajarbot/SUB_AGENT_WATCHDOG_STATUS.md
Jordan Ramos 916f86725d feat: RSO observation system, child safety, Discord adapter, Telegram watchdog, email attachments
Core agent improvements:
- RSO (Relevance Scoring & Observation) system: interaction_logger, memory_scorer, signal_detector
- Memory access logging (memory_access_log table) for relevance scoring; high-signal turn detection
- Rich conversation storage for notable turns; compact_conversation truncates long user messages
- Task-type classifier (query/action/analysis/creative) for observation tagging
- Nested sub-agent visibility: deep delegations now register against the main agent's manager

Child safety (Gabriel profile):
- child_safety.py: filtering, audit logging, prompt constants for restricted sessions
- .kiro/specs/child-safety-profile: requirements, design, tasks specs
- GABRIEL_BOT_PROPOSAL.md: initial proposal doc
- Reduced context window (10 msgs) and tutor-mode identity for restricted users

Telegram adapter:
- Polling watchdog: auto-restarts updater if polling drops unexpectedly
- get_me() with exponential-backoff retry on NetworkError at startup
- Correct stop() ordering: signal watchdog before cancelling tasks

Email / Gmail:
- send_email: supports file attachments (attachments list param)
- get_email: surfaces attachment metadata in response

Scheduled tasks / weather:
- Remove OpenWeatherMap API calls from morning-weather task; use wttr.in exclusively
- New scheduled tasks and scheduler state persistence

Discord:
- adapters/discord/__init__.py scaffold
- discord-plugin: MCP plugin for Claude Code Discord integration (server.ts, skills, config)

Infrastructure:
- n8n workflow exports (garvis_webhook, content_pipeline variants)
- memory_workspace: context, homelab-repo-updates, weekly observation summaries, error logs
- UCS C240 migration plan doc
- requirements.txt: new deps
- .claude/settings.json, fix_hooks.py: hook/permission tuning
2026-04-23 07:54:01 -06:00

2.8 KiB

Sub-Agent Watchdog Implementation Status

Completed

1. SubAgentManager Class (sub_agent_manager.py)

  • Tracks sub-agent state (ID, task, timestamps, completion status)
  • Background watchdog thread checks for hung agents every 30s
  • Detects hangs: no activity for 5+ minutes
  • Cleanup: marks hung agents as failed

2. Agent Integration (agent.py)

  • Import added: from sub_agent_manager import SubAgentManager
  • Initialized in Agent.__init__:
    self.sub_agent_manager = SubAgentManager(timeout_seconds=300)
    if not is_sub_agent:
        self.sub_agent_manager.start_watchdog()
    
  • Registration in spawn_sub_agent():
    if agent_id and not self.is_sub_agent:
        self.sub_agent_manager.register_sub_agent(agent_id, specialist_prompt[:100])
    

Still Needed

3. Activity Updates During Execution

Challenge: Agent SDK handles tool calls internally - no clear injection point for progress tracking.

Options:

  1. Add activity updates in llm_interface._agent_sdk_chat() message loop
  2. Hook into tool execution callbacks (if Agent SDK supports them)
  3. Poll conversation history length as proxy for activity

Recommended: Add in message receive loop:

# In llm_interface.py, _agent_sdk_chat(), async for message loop:
if hasattr(agent, 'sub_agent_manager'):
    # Update activity for current sub-agent if applicable
    agent.sub_agent_manager.update_activity(current_agent_id)

4. Mark Complete After Execution

Add after sub-agent chat completes:

# In agent.py, delegate() method:
try:
    result = sub_agent.chat(task, username=username)
    self.sub_agent_manager.mark_complete(agent_id, result=result)
except Exception as e:
    self.sub_agent_manager.mark_complete(agent_id, error=str(e))
    raise

5. Retry Logic for Hung Agents

Detect hung agents and restart task:

# In llm_interface.py or agent.py:
hung_agents = self.sub_agent_manager.get_hung_agents()
if hung_agents:
    logger.error(f"Detected {len(hung_agents)} hung sub-agents - restarting")
    for agent_id in hung_agents:
        self.sub_agent_manager.cleanup_agent(agent_id)
    # Retry the original request
    return self.chat(original_message, ...)  # Requires saving original context

Next Steps

  1. Test current implementation: Restart bot, verify watchdog starts
  2. Add activity tracking: Integrate into message receive loop
  3. Add completion marking: Hook into delegate() method
  4. Add retry logic: Detect hangs and restart tasks
  5. Test with hung agent: Create artificial hang to verify detection

Known Issues

  • Hook error blocks Write/Edit tools (workaround: use Bash for file operations)
  • CLAUDE_PLUGIN_ROOT env var points to stale plugin hash 261ce4fba4f2
  • No current mechanism to save original task context for retry