Implement adaptive timeout system with activity-based loop detection
**Problem**: Fixed 10-minute timeout kills legitimately slow operations (e.g., 5-minute web searches) while infinite loops waste resources. **Solution**: Dual-timeout strategy that distinguishes slow from stuck: 1. **Idle timeout** (5 min): No progress = kill - Tracks message_count growth via heartbeat - Only resets timer when count increases - Slow web searches keep progressing → allowed 2. **Total timeout** (15 min): Hard cap safety net - Prevents runaway tasks from consuming resources forever - Allows legitimately slow operations to complete 3. **Loop detection**: Kills after 5+ errors - Tracks error_count and last_error - Detects repetitive failures quickly - Independent of time-based checks **Key Changes**: - SubAgentState: Add message_count, error_count tracking fields - SubAgentManager.__init__: Dual timeout params (idle=300s, total=900s) - SubAgentManager.update_activity: Accepts message_count, smart timer reset - SubAgentManager.update_error: NEW - tracks errors for loop detection - SubAgentManager.get_hung_agents: 3-check system (idle/total/loop) - SubAgentManager.cleanup_agent: Detailed error messages by type - agent.py heartbeat: Passes sub_agent.llm.message_count every 10s - mcp_tools._DELEGATE_TIMEOUT: Increased to 900s (15 min) **Impact**: - Slow operations (5-12 min with progress) complete successfully - Infinite loops killed in <5 min via idle timeout or error detection - Clear diagnostics: "Idle timeout: No progress for 305s (23 messages)" - Zero config needed - adaptive behavior works automatically **Example**: CVE research taking 5 min with 117 messages now completes instead of timing out at 10 min. Loop with repeated errors killed at 3 min. See ADAPTIVE_TIMEOUT_SYSTEM.md for full specification and scenarios. Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
This commit is contained in:
4
agent.py
4
agent.py
@@ -135,7 +135,9 @@ class Agent:
|
||||
def heartbeat():
|
||||
while heartbeat_running[0]:
|
||||
if retry_id and not self.is_sub_agent:
|
||||
self.sub_agent_manager.update_activity(retry_id)
|
||||
# Pass message count to detect progress (vs idle heartbeat)
|
||||
msg_count = getattr(sub_agent.llm, 'message_count', 0)
|
||||
self.sub_agent_manager.update_activity(retry_id, message_count=msg_count)
|
||||
time.sleep(10)
|
||||
|
||||
heartbeat_thread = None
|
||||
|
||||
Reference in New Issue
Block a user