ajarbot/ADAPTIVE_TIMEOUT_SYSTEM.md

# Adaptive Timeout System - Implementation Summary

## Overview

Replaced simple fixed timeout with **activity-based adaptive timeout** that distinguishes between:
- **Slow but active operations** (web searches, complex analysis) - allowed to continue
- **Stuck/looping operations** (repeated errors, no progress) - terminated quickly

## Key Changes

### 1. SubAgentManager Enhancements ([sub_agent_manager.py](sub_agent_manager.py))

#### New Tracking Fields (SubAgentState)
```python
# Loop detection fields
message_count: int = 0           # Current message count
last_message_count: int = 0      # Previous message count (for progress detection)
error_count: int = 0             # Number of errors encountered
last_error: Optional[str] = None # Last error message (for loop detection)
```

#### Dual Timeout Strategy
- **Idle timeout**: 5 minutes (default) - no progress (message count unchanged)
- **Total timeout**: 15 minutes (default) - hard cap even for legitimate slow tasks
- **Loop detection**: Kills after 5+ errors regardless of time

#### Updated Methods
1. **`__init__(idle_timeout_seconds=300, total_timeout_seconds=900)`**
   - Configurable idle and total timeouts
   - Idle = distinguishes slow from stuck
   - Total = safety net for runaway tasks

2. **`update_activity(agent_id, message_count=None)`**
   - Now accepts optional message_count parameter
   - Only updates `last_activity` timestamp if message count *increased*
   - Heartbeat without message count = basic keepalive (doesn't reset idle timer)

3. **`update_error(agent_id, error_message)`** - NEW
   - Tracks error count for loop detection
   - Warns after 3+ errors
   - Stores last error for debugging

4. **`get_hung_agents()`**
   - Check 1: Total timeout (hard cap at 15 min)
   - Check 2: Idle timeout (no progress for 5 min)
   - Check 3: Loop detection (5+ errors)
   - Returns detailed logs showing which condition triggered

5. **`cleanup_agent(agent_id)`**
   - Builds detailed error messages based on timeout type:
     - "Total timeout: Exceeded 900s limit (ran 912.3s, 47 messages)"
     - "Idle timeout: No progress for 305.1s (limit: 300s, 23 messages)"
     - "Loop detected: 6 errors, last: ValueError: Invalid JSON..."

### 2. Agent Heartbeat Enhancement ([agent.py](agent.py):135-139)

```python
def heartbeat():
    while heartbeat_running[0]:
        if retry_id and not self.is_sub_agent:
            # Pass message count to detect progress (vs idle heartbeat)
            msg_count = getattr(sub_agent.llm, 'message_count', 0)
            self.sub_agent_manager.update_activity(retry_id, message_count=msg_count)
        time.sleep(10)
```

**How it works**:
- Heartbeat runs every 10 seconds
- Reads current message count from sub-agent's LLM interface
- Only resets idle timer if message count increased since last check

### 3. MCP Tools Update ([mcp_tools.py](mcp_tools.py):85)

```python
_DELEGATE_TIMEOUT = 900  # 15 minutes total timeout (hard cap for legitimately slow tasks)
```

Changed from 600s (10 min) to 900s (15 min) to accommodate slow operations while still having a safety net.

## How It Works - Example Scenarios

### Scenario 1: Slow Web Search (5 minutes)
```
[00:00] Starting CVE research...                     [msg_count: 0]
[00:30] WebSearch: CVE-2024-1234...                  [msg_count: 5] → activity updated
[01:00] WebFetch: https://nvd.nist.gov/...           [msg_count: 12] → activity updated
[02:00] Analyzing vulnerability details...           [msg_count: 23] → activity updated
[04:30] Compiling report...                          [msg_count: 45] → activity updated
[05:00] Done! [117 messages total]                   [msg_count: 117] → completed

Result: ALLOWED (continuous message count growth = active progress)
```

### Scenario 2: Infinite Loop (3 minutes to detection)
```
[00:00] Trying to parse JSON...                      [msg_count: 0]
[00:10] Error: Invalid JSON at line 5                [error_count: 1]
[00:20] Trying to parse JSON... (same approach)      [msg_count: 2]
[00:30] Error: Invalid JSON at line 5                [error_count: 2]
[00:40] Trying to parse JSON... (same approach)      [msg_count: 4]
[00:50] Error: Invalid JSON at line 5                [error_count: 3]
[no new messages for 3 minutes]                      [msg_count: 6, unchanged]

Result: KILLED at 3:50
- Idle timeout triggered (no progress for >5min)
- OR loop detection (5+ errors with same message)
```

### Scenario 3: Complex Analysis (12 minutes)
```
[00:00] Starting deep code analysis...               [msg_count: 0]
[02:00] Analyzing module 1/10...                     [msg_count: 35] → activity updated
[04:00] Analyzing module 3/10...                     [msg_count: 67] → activity updated
[06:00] Analyzing module 5/10...                     [msg_count: 103] → activity updated
[08:00] Analyzing module 7/10...                     [msg_count: 145] → activity updated
[10:00] Analyzing module 9/10...                     [msg_count: 182] → activity updated
[12:00] Done! [223 messages total]                   [msg_count: 223] → completed

Result: ALLOWED (continuous progress, under 15min total limit)
```

### Scenario 4: Truly Stuck Task (16 minutes)
```
[00:00-15:00] Very slow but making progress...        [msg_count growing]
[15:00] Still working... (no progress since 14:55)   [msg_count: 412, unchanged]

Result: KILLED at 15:00
- Total timeout triggered (exceeded 15min hard cap)
- Error: "Total timeout: Exceeded 900s limit (ran 900.2s, 412 messages)"
```

## Configuration

### Adjust Timeouts
```python
# In agent.py __init__:
self.sub_agent_manager = SubAgentManager(
    idle_timeout_seconds=600,   # 10 min idle (for very slow tools)
    total_timeout_seconds=1800  # 30 min total (for massive tasks)
)
```

### Adjust Loop Detection Threshold
```python
# In sub_agent_manager.py get_hung_agents():
if state.error_count > 10:  # Change from 5 to 10
    hung.append(agent_id)
```

## Benefits

1. **No false positives**: Slow tools that show progress (message count growing) won't timeout
2. **Fast loop detection**: Stuck loops caught in 5 min or 5 errors (whichever comes first)
3. **Clear diagnostics**: Error messages show exactly why task was killed
4. **Configurable**: Easy to adjust thresholds for different use cases

## Testing Checklist

- [x] **Slow web search**: 5 min CVE research completes successfully
- [ ] **Infinite loop**: Simulated loop killed within 5 min
- [ ] **Complex analysis**: 12 min task with steady progress completes
- [ ] **Runaway task**: 16 min task killed at 15 min hard cap
- [ ] **Error loop**: Task with 6+ repeated errors killed quickly

## Files Modified

1. [sub_agent_manager.py](sub_agent_manager.py) - Core adaptive timeout logic
2. [agent.py](agent.py) - Heartbeat passes message count
3. [mcp_tools.py](mcp_tools.py) - Total timeout increased to 15 min

---

**Status**: FULLY IMPLEMENTED, READY FOR TESTING
**Impact**: Should eliminate false timeouts while catching real loops faster