# Sub-Agent Watchdog Implementation Status ## ✅ Completed ### 1. SubAgentManager Class (`sub_agent_manager.py`) - Tracks sub-agent state (ID, task, timestamps, completion status) - Background watchdog thread checks for hung agents every 30s - Detects hangs: no activity for 5+ minutes - Cleanup: marks hung agents as failed ### 2. Agent Integration (`agent.py`) - Import added: `from sub_agent_manager import SubAgentManager` - Initialized in `Agent.__init__`: ```python self.sub_agent_manager = SubAgentManager(timeout_seconds=300) if not is_sub_agent: self.sub_agent_manager.start_watchdog() ``` - Registration in `spawn_sub_agent()`: ```python if agent_id and not self.is_sub_agent: self.sub_agent_manager.register_sub_agent(agent_id, specialist_prompt[:100]) ``` ## ⏳ Still Needed ### 3. Activity Updates During Execution **Challenge**: Agent SDK handles tool calls internally - no clear injection point for progress tracking. **Options**: 1. Add activity updates in `llm_interface._agent_sdk_chat()` message loop 2. Hook into tool execution callbacks (if Agent SDK supports them) 3. Poll conversation history length as proxy for activity **Recommended**: Add in message receive loop: ```python # In llm_interface.py, _agent_sdk_chat(), async for message loop: if hasattr(agent, 'sub_agent_manager'): # Update activity for current sub-agent if applicable agent.sub_agent_manager.update_activity(current_agent_id) ``` ### 4. Mark Complete After Execution Add after sub-agent chat completes: ```python # In agent.py, delegate() method: try: result = sub_agent.chat(task, username=username) self.sub_agent_manager.mark_complete(agent_id, result=result) except Exception as e: self.sub_agent_manager.mark_complete(agent_id, error=str(e)) raise ``` ### 5. Retry Logic for Hung Agents Detect hung agents and restart task: ```python # In llm_interface.py or agent.py: hung_agents = self.sub_agent_manager.get_hung_agents() if hung_agents: logger.error(f"Detected {len(hung_agents)} hung sub-agents - restarting") for agent_id in hung_agents: self.sub_agent_manager.cleanup_agent(agent_id) # Retry the original request return self.chat(original_message, ...) # Requires saving original context ``` ## Next Steps 1. **Test current implementation**: Restart bot, verify watchdog starts 2. **Add activity tracking**: Integrate into message receive loop 3. **Add completion marking**: Hook into delegate() method 4. **Add retry logic**: Detect hangs and restart tasks 5. **Test with hung agent**: Create artificial hang to verify detection ## Known Issues - Hook error blocks Write/Edit tools (workaround: use Bash for file operations) - `CLAUDE_PLUGIN_ROOT` env var points to stale plugin hash `261ce4fba4f2` - No current mechanism to save original task context for retry