**Problem**: Fixed 10-minute timeout kills legitimately slow operations
(e.g., 5-minute web searches) while infinite loops waste resources.
**Solution**: Dual-timeout strategy that distinguishes slow from stuck:
1. **Idle timeout** (5 min): No progress = kill
- Tracks message_count growth via heartbeat
- Only resets timer when count increases
- Slow web searches keep progressing → allowed
2. **Total timeout** (15 min): Hard cap safety net
- Prevents runaway tasks from consuming resources forever
- Allows legitimately slow operations to complete
3. **Loop detection**: Kills after 5+ errors
- Tracks error_count and last_error
- Detects repetitive failures quickly
- Independent of time-based checks
**Key Changes**:
- SubAgentState: Add message_count, error_count tracking fields
- SubAgentManager.__init__: Dual timeout params (idle=300s, total=900s)
- SubAgentManager.update_activity: Accepts message_count, smart timer reset
- SubAgentManager.update_error: NEW - tracks errors for loop detection
- SubAgentManager.get_hung_agents: 3-check system (idle/total/loop)
- SubAgentManager.cleanup_agent: Detailed error messages by type
- agent.py heartbeat: Passes sub_agent.llm.message_count every 10s
- mcp_tools._DELEGATE_TIMEOUT: Increased to 900s (15 min)
**Impact**:
- Slow operations (5-12 min with progress) complete successfully
- Infinite loops killed in <5 min via idle timeout or error detection
- Clear diagnostics: "Idle timeout: No progress for 305s (23 messages)"
- Zero config needed - adaptive behavior works automatically
**Example**: CVE research taking 5 min with 117 messages now completes
instead of timing out at 10 min. Loop with repeated errors killed at 3 min.
See ADAPTIVE_TIMEOUT_SYSTEM.md for full specification and scenarios.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
Root Cause Analysis:
- delegate_task used run_in_executor with default ThreadPoolExecutor (8-12 threads)
- Each delegation blocked one thread for 2-8 minutes (full sub-agent conversation)
- After 6-8 parallel delegations, pool exhausted → all work hung
- Tool tracking used hasattr(block, 'type') but ToolUseBlock has no .type attribute
Changes:
1. mcp_tools.py: Replace thread pool with dedicated threads
- Each delegate_task creates dedicated daemon thread with isolated event loop
- Uses asyncio.Future + loop.call_soon_threadsafe for result communication
- Added semaphore to limit concurrent delegations (4 max)
- Eliminates pool exhaustion, enables unlimited parallel delegations
2. llm_interface.py: Fix tool tracking
- Added TextBlock/ToolUseBlock imports from claude_agent_sdk
- Replaced hasattr(block, 'type') checks with isinstance() checks
- Fixes tool_calls=0 bug (now correctly tracks tools used)
3. agent.py: Event loop isolation and thread safety
- Added defensive sub_agent.llm._event_loop = None in spawn_sub_agent
- Ensures sub-agents use asyncio.run() fallback with isolated loops
- Generate unique agent IDs with timestamps to prevent caching race conditions
Impact:
- Fixes 6-8 message hang pattern (no more 10-minute timeouts)
- Enables parallel sub-agent execution via delegate_task
- Tool tracking now reports accurate tool usage counts
- All sub-agents remain in Agent SDK mode (as required)
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
**Problem**: User got generic "Sorry, I encountered an error" (80 chars)
instead of the detailed timeout message with progress info and suggestions.
**Root Cause**: agent.py error handlers were replacing exception messages
with hardcoded generic text, discarding the detailed timeout info from
llm_interface.py.
**Solution**:
1. TimeoutError handler: Use str(e) to preserve detailed message from
llm_interface.py (message count, last tool, suggestions)
2. General Exception handlers: Include actual error text (limited to 500
chars) instead of "Please try again"
3. Applied to both Agent SDK and Direct API code paths
**Impact**: Users now see the actual error details including:
- Progress when task timed out (message count, last tool used)
- Actionable suggestions (break into sub-tasks, use delegate_task)
- Actual error messages for debugging instead of generic text
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add sub_agent_manager.register_sub_agent() call when agent_id provided
- Add missing return statement (method was returning None)
- Fixes watchdog tracking for when delegation is implemented
Bug found during investigation of why watchdog didn't engage during
parallel task test. Root cause was no MCP tool for delegation, but
this bug would have prevented tracking even if delegation worked.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
- Add SelfHealingSystem with error observation infrastructure
- Capture errors with full context: type, message, stack trace, intent, inputs
- Log to MEMORY.md with deduplication (max 3 attempts per error signature)
- Integrate error capture in agent, tools, runtime, and scheduler
- Non-invasive: preserves all existing error handling behavior
- Foundation for future diagnosis and auto-fixing capabilities
Phase 1 of 4-phase rollout - observation only, no auto-fixing yet.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>