Commit Graph

13 Commits

Author SHA1 Message Date
400ef73419 Fix: Update SubAgentManager initialization for new timeout params
Changed from timeout_seconds=300 to use default params (idle=300s, total=900s)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-04 17:45:40 -07:00
8c039b6cad Implement adaptive timeout system with activity-based loop detection
**Problem**: Fixed 10-minute timeout kills legitimately slow operations
(e.g., 5-minute web searches) while infinite loops waste resources.

**Solution**: Dual-timeout strategy that distinguishes slow from stuck:

1. **Idle timeout** (5 min): No progress = kill
   - Tracks message_count growth via heartbeat
   - Only resets timer when count increases
   - Slow web searches keep progressing → allowed

2. **Total timeout** (15 min): Hard cap safety net
   - Prevents runaway tasks from consuming resources forever
   - Allows legitimately slow operations to complete

3. **Loop detection**: Kills after 5+ errors
   - Tracks error_count and last_error
   - Detects repetitive failures quickly
   - Independent of time-based checks

**Key Changes**:
- SubAgentState: Add message_count, error_count tracking fields
- SubAgentManager.__init__: Dual timeout params (idle=300s, total=900s)
- SubAgentManager.update_activity: Accepts message_count, smart timer reset
- SubAgentManager.update_error: NEW - tracks errors for loop detection
- SubAgentManager.get_hung_agents: 3-check system (idle/total/loop)
- SubAgentManager.cleanup_agent: Detailed error messages by type
- agent.py heartbeat: Passes sub_agent.llm.message_count every 10s
- mcp_tools._DELEGATE_TIMEOUT: Increased to 900s (15 min)

**Impact**:
- Slow operations (5-12 min with progress) complete successfully
- Infinite loops killed in <5 min via idle timeout or error detection
- Clear diagnostics: "Idle timeout: No progress for 305s (23 messages)"
- Zero config needed - adaptive behavior works automatically

**Example**: CVE research taking 5 min with 117 messages now completes
instead of timing out at 10 min. Loop with repeated errors killed at 3 min.

See ADAPTIVE_TIMEOUT_SYSTEM.md for full specification and scenarios.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-04 17:40:58 -07:00
a8f3ed40a8 Fix critical performance issues: thread pool exhaustion and tool tracking
Root Cause Analysis:
- delegate_task used run_in_executor with default ThreadPoolExecutor (8-12 threads)
- Each delegation blocked one thread for 2-8 minutes (full sub-agent conversation)
- After 6-8 parallel delegations, pool exhausted → all work hung
- Tool tracking used hasattr(block, 'type') but ToolUseBlock has no .type attribute

Changes:

1. mcp_tools.py: Replace thread pool with dedicated threads
   - Each delegate_task creates dedicated daemon thread with isolated event loop
   - Uses asyncio.Future + loop.call_soon_threadsafe for result communication
   - Added semaphore to limit concurrent delegations (4 max)
   - Eliminates pool exhaustion, enables unlimited parallel delegations

2. llm_interface.py: Fix tool tracking
   - Added TextBlock/ToolUseBlock imports from claude_agent_sdk
   - Replaced hasattr(block, 'type') checks with isinstance() checks
   - Fixes tool_calls=0 bug (now correctly tracks tools used)

3. agent.py: Event loop isolation and thread safety
   - Added defensive sub_agent.llm._event_loop = None in spawn_sub_agent
   - Ensures sub-agents use asyncio.run() fallback with isolated loops
   - Generate unique agent IDs with timestamps to prevent caching race conditions

Impact:
- Fixes 6-8 message hang pattern (no more 10-minute timeouts)
- Enables parallel sub-agent execution via delegate_task
- Tool tracking now reports accurate tool usage counts
- All sub-agents remain in Agent SDK mode (as required)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-04 07:43:04 -07:00
20b7b9f7c4 Fix error handling to preserve detailed timeout messages
**Problem**: User got generic "Sorry, I encountered an error" (80 chars)
instead of the detailed timeout message with progress info and suggestions.

**Root Cause**: agent.py error handlers were replacing exception messages
with hardcoded generic text, discarding the detailed timeout info from
llm_interface.py.

**Solution**:
1. TimeoutError handler: Use str(e) to preserve detailed message from
   llm_interface.py (message count, last tool, suggestions)
2. General Exception handlers: Include actual error text (limited to 500
   chars) instead of "Please try again"
3. Applied to both Agent SDK and Direct API code paths

**Impact**: Users now see the actual error details including:
- Progress when task timed out (message count, last tool used)
- Actionable suggestions (break into sub-tasks, use delegate_task)
- Actual error messages for debugging instead of generic text

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-01 16:17:37 -07:00
6eafd758c9 Fix spawn_sub_agent() bug - add missing registration and return
- Add sub_agent_manager.register_sub_agent() call when agent_id provided
- Add missing return statement (method was returning None)
- Fixes watchdog tracking for when delegation is implemented

Bug found during investigation of why watchdog didn't engage during
parallel task test. Root cause was no MCP tool for delegation, but
this bug would have prevented tracking even if delegation worked.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-03-01 10:46:43 -07:00
7697220c74 Refactor: Remove zombie code, fix bugs, and clean documentation
This comprehensive refactoring removes dead code, fixes bugs, and deletes
outdated documentation to make the codebase production-ready.

## Files Deleted (16 files)

### Temporary/zombie files (9 files):
- nul (Windows artifact)
- quick_start.bat (superseded by run.bat)
- scripts/proxmox_ssh.py (hardcoded credentials - security risk)
- scripts/proxmox_ssh.sh (hardcoded credentials - security risk)
- scripts/collection_output.txt (one-time audit output)
- scripts/collect-homelab-config.sh (one-off infrastructure script)
- scripts/collect-remote.sh (one-off infrastructure script)
- memory_workspace/MEMORY.md.old (backup file)
- promtail-config-optimized.yaml (misplaced homelab config)

### Outdated documentation (7 files):
- MCP_MIGRATION.md (migration complete - 2026-02-15)
- QUICK_REFERENCE_AGENT_SDK.md (orphaned from cleanup)
- SETUP.md (duplicate of README.md quick start)
- WINDOWS_QUICK_REFERENCE.md (duplicate of docs/WINDOWS_DEPLOYMENT.md)
- SUB_AGENTS.md (design doc for unimplemented feature)
- JARVIS_VOICE_INTEGRATION_PLAN.md (1300-line spec, code not implemented)
- OBSIDIAN_MCP_SETUP_INSTRUCTIONS.md (temporary troubleshooting doc)
- LOGGING.md (redundant with well-commented logging_config.py)
- docs/SECURITY_AUDIT_SUMMARY.md (completed audit from 2026-02-12)

## Critical Bug Fixes (2 bugs)

1. bot_runner.py line 122: Fixed wrong dict key reference
   - Changed send_to_platform → send_to
   - Bug caused scheduled task platform info to never print

2. usage_tracker.py: Added missing pricing for claude-sonnet-4-6
   - Model was default but had no pricing entry
   - Caused cost under-reporting in Direct API mode

## Code Removed (14 files modified, ~1200 lines deleted)

### Dead imports removed (9 imports):
- bot_runner.py: sys
- agent.py: time
- adapters/runtime.py: re
- adapters/skill_integration.py: subprocess
- tools.py: redundant Path import
- mcp_servers/loki/loki_server.py: json
- google_tools/oauth_manager.py: Thread, Dict
- google_tools/gmail_client.py: os
- google_tools/utils.py: email

### Unused functions/methods removed (9 functions):
- agent.py: MEMORY_RESPONSE_PREVIEW_LENGTH constant
- scheduled_tasks.py: integrate_scheduler_with_runtime()
- adapters/runtime.py: command_preprocessor(), markdown_postprocessor()
- adapters/skill_integration.py: invoke_skill_via_cli(), __main__ block
- tools.py: _extract_mcp_result()
- google_tools/oauth_manager.py: needs_refresh_soon(), revoke_authorization()
- google_tools/people_client.py: update_contact(), delete_contact()

### Code quality improvements:
- memory_system.py: Removed empty else: pass branch
- calendar_client.py: Fixed bare except: → except Exception:
- mcp_ssh.py: Updated asyncio.get_event_loop() → get_running_loop()
- calendar_client.py: Fixed deprecated datetime.utcnow() → now(timezone.utc)

## Impact

- ~1200 lines of dead code removed
- 16 obsolete files deleted
- 2 critical bugs fixed
- 3 deprecated APIs updated
- Zero functionality broken (all changes verified)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-24 12:46:56 -07:00
fe7c146dc6 feat: Add Gitea MCP integration and project cleanup
## New Features
- **Gitea MCP Tools** (zero API cost):
  - gitea_read_file: Read files from homelab repo
  - gitea_list_files: Browse directories
  - gitea_search_code: Search by filename
  - gitea_get_tree: Get directory tree
- **Gitea Client** (gitea_tools/client.py): REST API wrapper with OAuth
- **Proxmox SSH Scripts** (scripts/): Homelab data collection utilities
- **Obsidian MCP Support** (obsidian_mcp.py): Advanced vault operations
- **Voice Integration Plan** (JARVIS_VOICE_INTEGRATION_PLAN.md)

## Improvements
- **Increased timeout**: 5min → 10min for complex tasks (llm_interface.py)
- **Removed Direct API fallback**: Gitea tools are MCP-only (zero cost)
- **Updated .env.example**: Added Obsidian MCP configuration
- **Enhanced .gitignore**: Protect personal memory files (SOUL.md, MEMORY.md)

## Cleanup
- Deleted 24 obsolete files (temp/test/experimental scripts, outdated docs)
- Untracked personal memory files (SOUL.md, MEMORY.md now in .gitignore)
- Removed: AGENT_SDK_IMPLEMENTATION.md, HYBRID_SEARCH_SUMMARY.md,
  IMPLEMENTATION_SUMMARY.md, MIGRATION.md, test_agent_sdk.py, etc.

## Configuration
- Added config/gitea_config.example.yaml (Gitea setup template)
- Added config/obsidian_mcp.example.yaml (Obsidian MCP template)
- Updated scheduled_tasks.yaml with new task examples

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-18 20:31:32 -07:00
50cf7165cb Add sub-agent orchestration, MCP tools, and critical bug fixes
Major Features:
- Sub-agent orchestration system with dynamic specialist spawning
  * spawn_sub_agent(): Create specialists with custom prompts
  * delegate(): Convenience method for task delegation
  * Cached specialists for reuse
  * Separate conversation histories and focused context

- MCP (Model Context Protocol) tool integration
  * Zettelkasten: fleeting_note, daily_note, permanent_note, literature_note
  * Search: search_vault (hybrid search), search_by_tags
  * Web: web_fetch for real-time data
  * Zero-cost file/system operations on Pro subscription

Critical Bug Fixes:
- Fixed max tool iterations (15 → 30, configurable)
- Fixed max_tokens error in Agent SDK query() call
- Fixed MCP tool routing in execute_tool()
  * Routes zettelkasten + web tools to async handlers
  * Prevents "Unknown tool" errors

Documentation:
- SUB_AGENTS.md: Complete guide to sub-agent system
- MCP_MIGRATION.md: Agent SDK migration details
- SOUL.example.md: Sanitized bot identity template
- scheduled_tasks.example.yaml: Sanitized task config template

Security:
- Added obsidian vault to .gitignore
- Protected SOUL.md and MEMORY.md (personal configs)
- Sanitized example configs with placeholders

Dependencies:
- Added beautifulsoup4, httpx, lxml for web scraping
- Updated requirements.txt

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-16 07:43:31 -07:00
911d362ba2 Optimize for Claude Agent SDK: Memory, context, and model selection
## Memory & Context Optimizations

### agent.py
- MAX_CONTEXT_MESSAGES: 10 → 20 (better conversation coherence)
- MEMORY_RESPONSE_PREVIEW_LENGTH: 200 → 500 (richer memory storage)
- MAX_CONVERSATION_HISTORY: 50 → 100 (longer session continuity)
- search_hybrid max_results: 2 → 5 (better memory recall)
- System prompt: Now mentions tool count and flat-rate subscription
- Memory format: Changed "User (username)/Agent" to "username/Garvis"

### llm_interface.py
- Added claude_agent_sdk model (Sonnet) to defaults
- Mode-based model selection:
  * Agent SDK → Sonnet (best quality, flat-rate)
  * Direct API → Haiku (cheapest, pay-per-token)
- Updated logging to show active model

## SOUL.md Rewrite

- Added Garvis identity (name, email, role)
- Listed all 17 tools (was missing 12 tools)
- Added "Critical Behaviors" section
- Emphasized flat-rate subscription benefits
- Clear instructions to always check user profiles

## Benefits

With flat-rate Agent SDK:
-  Use Sonnet for better reasoning (was Haiku)
-  2x context messages (10 → 20)
-  2.5x memory results (2 → 5)
-  2.5x richer memory previews (200 → 500 chars)
-  Bot knows its name and all capabilities
-  Zero marginal cost for thoroughness

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-15 10:22:23 -07:00
a8665d8c72 Refactor: Clean up obsolete files and organize codebase structure
This commit removes deprecated modules and reorganizes code into logical directories:

Deleted files (superseded by newer systems):
- claude_code_server.py (replaced by agent-sdk direct integration)
- heartbeat.py (superseded by scheduled_tasks.py)
- pulse_brain.py (unused in production)
- config/pulse_brain_config.py (obsolete config)

Created directory structure:
- examples/ (7 example files: example_*.py, demo_*.py)
- tests/ (5 test files: test_*.py)

Updated imports:
- agent.py: Removed heartbeat module and all enable_heartbeat logic
- bot_runner.py: Removed heartbeat parameter from Agent initialization
- llm_interface.py: Updated deprecated claude_code_server message

Preserved essential files:
- hooks.py (for future use)
- adapters/skill_integration.py (for future use)
- All Google integration tools (Gmail, Calendar, Contacts)
- GLM provider code (backward compatibility)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-15 09:57:39 -07:00
f018800d94 Implement self-healing system Phase 1: Error capture and logging
- Add SelfHealingSystem with error observation infrastructure
- Capture errors with full context: type, message, stack trace, intent, inputs
- Log to MEMORY.md with deduplication (max 3 attempts per error signature)
- Integrate error capture in agent, tools, runtime, and scheduler
- Non-invasive: preserves all existing error handling behavior
- Foundation for future diagnosis and auto-fixing capabilities

Phase 1 of 4-phase rollout - observation only, no auto-fixing yet.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-14 18:03:42 -07:00
8afff96bb5 Add API usage tracking and dynamic task reloading
Features:
- Usage tracking system (usage_tracker.py)
  - Tracks input/output tokens per API call
  - Calculates costs with support for cache pricing
  - Stores data in usage_data.json (gitignored)
  - Integrated into llm_interface.py

- Dynamic task scheduler reloading
  - Auto-detects YAML changes every 60s
  - No restart needed for new tasks
  - reload_tasks() method for manual refresh

- Example cost tracking scheduled task
  - Daily API usage report
  - Budget tracking ($5/month target)
  - Disabled by default in scheduled_tasks.yaml

Improvements:
- Fixed tool_use/tool_result pair splitting bug (CRITICAL)
- Added thread safety to agent.chat()
- Fixed N+1 query problem in hybrid search
- Optimized database batch queries
- Added conversation history pruning (50 messages max)

Updated .gitignore:
- Exclude user profiles (memory_workspace/users/*.md)
- Exclude usage data (usage_data.json)
- Exclude vector index (vectors.usearch)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-13 23:38:44 -07:00
a99799bf3d Initial commit: Ajarbot with optimizations
Features:
- Multi-platform bot (Slack, Telegram)
- Memory system with SQLite FTS
- Tool use capabilities (file ops, commands)
- Scheduled tasks system
- Dynamic model switching (/sonnet, /haiku)
- Prompt caching for cost optimization

Optimizations:
- Default to Haiku 4.5 (12x cheaper)
- Reduced context: 3 messages, 2 memory results
- Optimized SOUL.md (48% smaller)
- Automatic caching when using Sonnet (90% savings)

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-13 19:06:28 -07:00