82 lines
2.8 KiB
Markdown
82 lines
2.8 KiB
Markdown
|
|
# Sub-Agent Watchdog Implementation Status
|
||
|
|
|
||
|
|
## ✅ Completed
|
||
|
|
|
||
|
|
### 1. SubAgentManager Class (`sub_agent_manager.py`)
|
||
|
|
- Tracks sub-agent state (ID, task, timestamps, completion status)
|
||
|
|
- Background watchdog thread checks for hung agents every 30s
|
||
|
|
- Detects hangs: no activity for 5+ minutes
|
||
|
|
- Cleanup: marks hung agents as failed
|
||
|
|
|
||
|
|
### 2. Agent Integration (`agent.py`)
|
||
|
|
- Import added: `from sub_agent_manager import SubAgentManager`
|
||
|
|
- Initialized in `Agent.__init__`:
|
||
|
|
```python
|
||
|
|
self.sub_agent_manager = SubAgentManager(timeout_seconds=300)
|
||
|
|
if not is_sub_agent:
|
||
|
|
self.sub_agent_manager.start_watchdog()
|
||
|
|
```
|
||
|
|
- Registration in `spawn_sub_agent()`:
|
||
|
|
```python
|
||
|
|
if agent_id and not self.is_sub_agent:
|
||
|
|
self.sub_agent_manager.register_sub_agent(agent_id, specialist_prompt[:100])
|
||
|
|
```
|
||
|
|
|
||
|
|
## ⏳ Still Needed
|
||
|
|
|
||
|
|
### 3. Activity Updates During Execution
|
||
|
|
**Challenge**: Agent SDK handles tool calls internally - no clear injection point for progress tracking.
|
||
|
|
|
||
|
|
**Options**:
|
||
|
|
1. Add activity updates in `llm_interface._agent_sdk_chat()` message loop
|
||
|
|
2. Hook into tool execution callbacks (if Agent SDK supports them)
|
||
|
|
3. Poll conversation history length as proxy for activity
|
||
|
|
|
||
|
|
**Recommended**: Add in message receive loop:
|
||
|
|
```python
|
||
|
|
# In llm_interface.py, _agent_sdk_chat(), async for message loop:
|
||
|
|
if hasattr(agent, 'sub_agent_manager'):
|
||
|
|
# Update activity for current sub-agent if applicable
|
||
|
|
agent.sub_agent_manager.update_activity(current_agent_id)
|
||
|
|
```
|
||
|
|
|
||
|
|
### 4. Mark Complete After Execution
|
||
|
|
Add after sub-agent chat completes:
|
||
|
|
```python
|
||
|
|
# In agent.py, delegate() method:
|
||
|
|
try:
|
||
|
|
result = sub_agent.chat(task, username=username)
|
||
|
|
self.sub_agent_manager.mark_complete(agent_id, result=result)
|
||
|
|
except Exception as e:
|
||
|
|
self.sub_agent_manager.mark_complete(agent_id, error=str(e))
|
||
|
|
raise
|
||
|
|
```
|
||
|
|
|
||
|
|
### 5. Retry Logic for Hung Agents
|
||
|
|
Detect hung agents and restart task:
|
||
|
|
```python
|
||
|
|
# In llm_interface.py or agent.py:
|
||
|
|
hung_agents = self.sub_agent_manager.get_hung_agents()
|
||
|
|
if hung_agents:
|
||
|
|
logger.error(f"Detected {len(hung_agents)} hung sub-agents - restarting")
|
||
|
|
for agent_id in hung_agents:
|
||
|
|
self.sub_agent_manager.cleanup_agent(agent_id)
|
||
|
|
# Retry the original request
|
||
|
|
return self.chat(original_message, ...) # Requires saving original context
|
||
|
|
```
|
||
|
|
|
||
|
|
## Next Steps
|
||
|
|
|
||
|
|
1. **Test current implementation**: Restart bot, verify watchdog starts
|
||
|
|
2. **Add activity tracking**: Integrate into message receive loop
|
||
|
|
3. **Add completion marking**: Hook into delegate() method
|
||
|
|
4. **Add retry logic**: Detect hangs and restart tasks
|
||
|
|
5. **Test with hung agent**: Create artificial hang to verify detection
|
||
|
|
|
||
|
|
## Known Issues
|
||
|
|
|
||
|
|
- Hook error blocks Write/Edit tools (workaround: use Bash for file operations)
|
||
|
|
- `CLAUDE_PLUGIN_ROOT` env var points to stale plugin hash `261ce4fba4f2`
|
||
|
|
- No current mechanism to save original task context for retry
|
||
|
|
|