Fix critical performance issues: thread pool exhaustion and tool tracking

Root Cause Analysis:
- delegate_task used run_in_executor with default ThreadPoolExecutor (8-12 threads)
- Each delegation blocked one thread for 2-8 minutes (full sub-agent conversation)
- After 6-8 parallel delegations, pool exhausted → all work hung
- Tool tracking used hasattr(block, 'type') but ToolUseBlock has no .type attribute

Changes:

1. mcp_tools.py: Replace thread pool with dedicated threads
   - Each delegate_task creates dedicated daemon thread with isolated event loop
   - Uses asyncio.Future + loop.call_soon_threadsafe for result communication
   - Added semaphore to limit concurrent delegations (4 max)
   - Eliminates pool exhaustion, enables unlimited parallel delegations

2. llm_interface.py: Fix tool tracking
   - Added TextBlock/ToolUseBlock imports from claude_agent_sdk
   - Replaced hasattr(block, 'type') checks with isinstance() checks
   - Fixes tool_calls=0 bug (now correctly tracks tools used)

3. agent.py: Event loop isolation and thread safety
   - Added defensive sub_agent.llm._event_loop = None in spawn_sub_agent
   - Ensures sub-agents use asyncio.run() fallback with isolated loops
   - Generate unique agent IDs with timestamps to prevent caching race conditions

Impact:
- Fixes 6-8 message hang pattern (no more 10-minute timeouts)
- Enables parallel sub-agent execution via delegate_task
- Tool tracking now reports accurate tool usage counts
- All sub-agents remain in Agent SDK mode (as required)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
2026-03-03 20:48:43 -07:00
parent cc7e623d74
commit a8f3ed40a8
3 changed files with 2101 additions and 27 deletions

File diff suppressed because it is too large Load Diff