**Problem**: Fixed 10-minute timeout kills legitimately slow operations
(e.g., 5-minute web searches) while infinite loops waste resources.
**Solution**: Dual-timeout strategy that distinguishes slow from stuck:
1. **Idle timeout** (5 min): No progress = kill
- Tracks message_count growth via heartbeat
- Only resets timer when count increases
- Slow web searches keep progressing → allowed
2. **Total timeout** (15 min): Hard cap safety net
- Prevents runaway tasks from consuming resources forever
- Allows legitimately slow operations to complete
3. **Loop detection**: Kills after 5+ errors
- Tracks error_count and last_error
- Detects repetitive failures quickly
- Independent of time-based checks
**Key Changes**:
- SubAgentState: Add message_count, error_count tracking fields
- SubAgentManager.__init__: Dual timeout params (idle=300s, total=900s)
- SubAgentManager.update_activity: Accepts message_count, smart timer reset
- SubAgentManager.update_error: NEW - tracks errors for loop detection
- SubAgentManager.get_hung_agents: 3-check system (idle/total/loop)
- SubAgentManager.cleanup_agent: Detailed error messages by type
- agent.py heartbeat: Passes sub_agent.llm.message_count every 10s
- mcp_tools._DELEGATE_TIMEOUT: Increased to 900s (15 min)
**Impact**:
- Slow operations (5-12 min with progress) complete successfully
- Infinite loops killed in <5 min via idle timeout or error detection
- Clear diagnostics: "Idle timeout: No progress for 305s (23 messages)"
- Zero config needed - adaptive behavior works automatically
**Example**: CVE research taking 5 min with 117 messages now completes
instead of timing out at 10 min. Loop with repeated errors killed at 3 min.
See ADAPTIVE_TIMEOUT_SYSTEM.md for full specification and scenarios.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root Cause Analysis:
- delegate_task used run_in_executor with default ThreadPoolExecutor (8-12 threads)
- Each delegation blocked one thread for 2-8 minutes (full sub-agent conversation)
- After 6-8 parallel delegations, pool exhausted → all work hung
- Tool tracking used hasattr(block, 'type') but ToolUseBlock has no .type attribute
Changes:
1. mcp_tools.py: Replace thread pool with dedicated threads
- Each delegate_task creates dedicated daemon thread with isolated event loop
- Uses asyncio.Future + loop.call_soon_threadsafe for result communication
- Added semaphore to limit concurrent delegations (4 max)
- Eliminates pool exhaustion, enables unlimited parallel delegations
2. llm_interface.py: Fix tool tracking
- Added TextBlock/ToolUseBlock imports from claude_agent_sdk
- Replaced hasattr(block, 'type') checks with isinstance() checks
- Fixes tool_calls=0 bug (now correctly tracks tools used)
3. agent.py: Event loop isolation and thread safety
- Added defensive sub_agent.llm._event_loop = None in spawn_sub_agent
- Ensures sub-agents use asyncio.run() fallback with isolated loops
- Generate unique agent IDs with timestamps to prevent caching race conditions
Impact:
- Fixes 6-8 message hang pattern (no more 10-minute timeouts)
- Enables parallel sub-agent execution via delegate_task
- Tool tracking now reports accurate tool usage counts
- All sub-agents remain in Agent SDK mode (as required)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>