Files
ajarbot/JARVIS_VOICE_INTEGRATION_PLAN.md
Jordan Ramos fe7c146dc6 feat: Add Gitea MCP integration and project cleanup
## New Features
- **Gitea MCP Tools** (zero API cost):
  - gitea_read_file: Read files from homelab repo
  - gitea_list_files: Browse directories
  - gitea_search_code: Search by filename
  - gitea_get_tree: Get directory tree
- **Gitea Client** (gitea_tools/client.py): REST API wrapper with OAuth
- **Proxmox SSH Scripts** (scripts/): Homelab data collection utilities
- **Obsidian MCP Support** (obsidian_mcp.py): Advanced vault operations
- **Voice Integration Plan** (JARVIS_VOICE_INTEGRATION_PLAN.md)

## Improvements
- **Increased timeout**: 5min → 10min for complex tasks (llm_interface.py)
- **Removed Direct API fallback**: Gitea tools are MCP-only (zero cost)
- **Updated .env.example**: Added Obsidian MCP configuration
- **Enhanced .gitignore**: Protect personal memory files (SOUL.md, MEMORY.md)

## Cleanup
- Deleted 24 obsolete files (temp/test/experimental scripts, outdated docs)
- Untracked personal memory files (SOUL.md, MEMORY.md now in .gitignore)
- Removed: AGENT_SDK_IMPLEMENTATION.md, HYBRID_SEARCH_SUMMARY.md,
  IMPLEMENTATION_SUMMARY.md, MIGRATION.md, test_agent_sdk.py, etc.

## Configuration
- Added config/gitea_config.example.yaml (Gitea setup template)
- Added config/obsidian_mcp.example.yaml (Obsidian MCP template)
- Updated scheduled_tasks.yaml with new task examples

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
2026-02-18 20:31:32 -07:00

44 KiB

Jarvis Voice Integration Plan

Executive Summary

This document provides a comprehensive plan for adding ElevenLabs text-to-speech capabilities to Ajarbot, enabling Garvis to deliver occasional voice responses with a British AI assistant personality (the "Jarvis - Robot" voice). The integration follows the existing codebase patterns: an MCP tool for zero-cost routing when unused, lazy-loaded client for the ElevenLabs API, platform-specific audio delivery via Telegram voice notes and Slack file uploads, and a character budget tracker to stay within the free tier's 10,000 characters per month.

Key decisions:

  • Architecture: Hybrid MCP tool (voice generation) + adapter-level audio delivery
  • Voice: ElevenLabs pre-made "Jarvis - Robot" (ID: WWtyH2oxeOp9yZwK8ERD)
  • Trigger model: Explicit user commands and optional LLM-driven autonomous voice for high-impact moments
  • Cost: Free tier (10,000 chars/month) -- sufficient for casual use (roughly 40-50 short voice messages)

Table of Contents

  1. Architecture Design
  2. Implementation Plan
  3. ElevenLabs Setup Guide
  4. Configuration
  5. File-by-File Changes
  6. Voice Trigger Logic
  7. Platform Delivery
  8. Cost Monitoring
  9. Testing Strategy
  10. Edge Cases and Error Handling
  11. Troubleshooting
  12. Future Enhancements

1. Architecture Design

1.1 Why Hybrid (MCP Tool + Adapter Extension)

The existing codebase uses two tool paradigms:

  • MCP tools (mcp_tools.py): Zero API cost when unused, registered via @tool decorator, run in-process
  • Traditional tools (tools.py): Google/weather tools requiring external API calls

Voice generation naturally splits into two concerns:

Concern Component Rationale
Text-to-Speech generation MCP tool in mcp_tools.py Follows the pattern of web_fetch -- makes an external HTTP call but runs as an MCP tool. Zero cost when the tool is not invoked. Lazy-loads the ElevenLabs client.
Audio delivery to platform Adapter-level method on BaseAdapter Telegram needs send_voice() (OGG/Opus), Slack needs files_upload_v2() (MP3). The adapter already owns the platform connection. Adding a send_voice_message() method is the cleanest separation.

1.2 Component Diagram

User says: "Garvis, say that in your voice"
        |
        v
  [Agent / LLM] -----> decides to use speak_text tool
        |
        v
  [MCP Tool: speak_text]
        |  1. Validates character budget
        |  2. Calls ElevenLabs TTS API
        |  3. Returns audio bytes + metadata
        v
  [AdapterRuntime._process_message]
        |  Detects voice attachment in response metadata
        |  Routes to adapter.send_voice_message()
        v
  [TelegramAdapter.send_voice_message]  or  [SlackAdapter.send_voice_message]
        |  Sends OGG voice note                  Uploads MP3 file snippet
        v
  User receives voice message in chat

1.3 Why NOT a Standalone Traditional Tool

Traditional tools in tools.py return plain strings. Voice requires returning binary audio data plus metadata (format, duration, character count). The MCP tool pattern supports structured return values and integrates naturally with the Agent SDK's tool execution pipeline. Additionally, the MCP tool is never loaded or called unless the LLM decides to use it, matching the "zero cost when unused" principle from SOUL.md.

1.4 Data Flow for Voice Responses

The voice tool follows a two-phase approach:

Phase 1 - Generation (MCP Tool):

  1. LLM calls speak_text tool with the text to speak
  2. Tool checks character budget (reject if would exceed monthly limit)
  3. Tool calls ElevenLabs API, receives MP3 audio bytes
  4. Tool saves audio to a temporary file (temp/voice_{timestamp}.mp3)
  5. Tool returns success message with file path and metadata

Phase 2 - Delivery (Adapter Runtime):

  1. Agent's text response includes a voice marker: [VOICE: temp/voice_12345.mp3]
  2. Runtime postprocessor detects the marker
  3. Runtime calls adapter.send_voice_message(channel_id, audio_path)
  4. Adapter sends platform-native voice message
  5. Temporary file is cleaned up

This two-phase approach avoids passing binary data through the LLM response chain and uses the existing postprocessor pattern from adapters/runtime.py.


2. Implementation Plan

2.1 Overview of Changes

File Change Type Description
elevenlabs_client.py NEW ElevenLabs API client (TTS, usage tracking)
mcp_tools.py MODIFY Add speak_text MCP tool
adapters/base.py MODIFY Add send_voice_message() to BaseAdapter
adapters/telegram/adapter.py MODIFY Implement send_voice_message() using send_voice()
adapters/slack/adapter.py MODIFY Implement send_voice_message() using files_upload_v2()
adapters/runtime.py MODIFY Add voice postprocessor to detect and deliver voice messages
memory_workspace/SOUL.md MODIFY Add speak_text tool documentation and voice personality notes
llm_interface.py MODIFY Add speak_text to allowed_tools list
.env.example MODIFY Add ElevenLabs configuration variables
.gitignore MODIFY Add temp voice files
config/voice_preferences.yaml NEW Per-user voice preferences (optional)

2.2 Dependencies

pip install elevenlabs    # Official Python SDK
pip install pydub         # Audio format conversion (MP3 -> OGG/Opus for Telegram)

Note: pydub requires ffmpeg installed on the system for OGG/Opus conversion. On Windows: choco install ffmpeg or download from https://ffmpeg.org/download.html.


3. ElevenLabs Setup Guide

3.1 Account Creation

  1. Go to https://elevenlabs.io and sign up for a free account
  2. Verify your email address
  3. Navigate to Profile + API Key (click your avatar, top right)
  4. Copy your API key

3.2 Voice Selection

The pre-made "Jarvis - Robot" voice is ideal for this use case:

  • Voice ID: WWtyH2oxeOp9yZwK8ERD
  • Character: British, robotic, AI assistant personality
  • Quality: High quality even on free tier
  • No voice cloning needed: Pre-made voices are available immediately

To verify the voice ID or browse alternatives:

  1. Go to https://elevenlabs.io/voice-library
  2. Search for "Jarvis"
  3. Click the voice to preview it
  4. The voice ID is in the URL or available via API

3.3 Free Tier Limits

Limit Value
Characters per month 10,000
Max characters per request 2,500
Custom voices 3
Commercial use No
Audio quality Standard
Concurrent requests 2

Budget math: At ~200 characters per voice message (average sentence), 10,000 chars allows roughly 50 voice messages per month -- more than enough for "here and there" casual use.

3.4 API Key Configuration

Add to .env:

# ElevenLabs Voice (Optional)
ELEVENLABS_API_KEY=your-api-key-here
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD

4. Configuration

4.1 Environment Variables

# === ElevenLabs Voice Configuration ===

# API key from https://elevenlabs.io (required for voice features)
ELEVENLABS_API_KEY=your-api-key-here

# Voice ID - default: Jarvis - Robot (British AI assistant)
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD

# Model ID - default: eleven_multilingual_v2 (best quality)
# Options: eleven_multilingual_v2, eleven_turbo_v2_5 (faster, lower quality)
ELEVENLABS_MODEL_ID=eleven_multilingual_v2

# Monthly character budget (default: 9000, leaves 1000 char buffer from 10k limit)
ELEVENLABS_MONTHLY_BUDGET=9000

# Output format (default: mp3_44100_128 - good quality, reasonable size)
ELEVENLABS_OUTPUT_FORMAT=mp3_44100_128

# Enable/disable voice features globally
ELEVENLABS_ENABLED=true

4.2 Per-User Voice Preferences (Optional)

File: config/voice_preferences.yaml

# Voice preferences per user
# Users can enable/disable voice responses and set preferences

defaults:
  voice_enabled: true
  voice_mode: "explicit"  # "explicit" = only on request, "auto" = LLM decides
  max_chars_per_message: 500  # Limit text length for voice to save budget

users:
  jordan:
    voice_enabled: true
    voice_mode: "auto"  # Garvis can decide when to use voice
    preferred_voice: "WWtyH2oxeOp9yZwK8ERD"  # Jarvis - Robot

4.3 Voice Trigger Modes

Mode Behavior Configuration
explicit Voice only when user explicitly requests (e.g., "say that out loud", "voice response") Default, safest for budget
auto LLM decides when voice adds value (greetings, dramatic moments, short quips) Set voice_mode: auto for user
disabled No voice at all Set voice_enabled: false

5. File-by-File Changes

5.1 NEW: elevenlabs_client.py

This module handles all ElevenLabs API interaction, usage tracking, and audio format management.

"""ElevenLabs TTS client for Garvis voice capabilities.

Handles text-to-speech generation, character budget tracking,
and audio format management. Lazy-loaded -- zero cost when unused.
"""

import json
import os
import time
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Optional

import httpx

# Budget tracking file
_USAGE_FILE = Path("config/elevenlabs_usage.json")
_TEMP_DIR = Path("temp/voice")


class ElevenLabsClient:
    """ElevenLabs TTS client with budget tracking."""

    def __init__(self) -> None:
        self.api_key = os.getenv("ELEVENLABS_API_KEY", "")
        self.voice_id = os.getenv("ELEVENLABS_VOICE_ID", "WWtyH2oxeOp9yZwK8ERD")
        self.model_id = os.getenv("ELEVENLABS_MODEL_ID", "eleven_multilingual_v2")
        self.output_format = os.getenv("ELEVENLABS_OUTPUT_FORMAT", "mp3_44100_128")
        self.monthly_budget = int(os.getenv("ELEVENLABS_MONTHLY_BUDGET", "9000"))
        self.enabled = os.getenv("ELEVENLABS_ENABLED", "true").lower() == "true"

        self._base_url = "https://api.elevenlabs.io/v1"
        _TEMP_DIR.mkdir(parents=True, exist_ok=True)

    def is_available(self) -> bool:
        """Check if ElevenLabs is configured and enabled."""
        return bool(self.enabled and self.api_key)

    def get_remaining_budget(self) -> int:
        """Get remaining character budget for this month."""
        usage = self._load_usage()
        current_month = datetime.now().strftime("%Y-%m")

        if usage.get("month") != current_month:
            # New month, reset counter
            return self.monthly_budget

        return max(0, self.monthly_budget - usage.get("chars_used", 0))

    def text_to_speech(
        self,
        text: str,
        voice_id: Optional[str] = None,
        stability: float = 0.5,
        similarity_boost: float = 0.75,
        style: float = 0.0,
        speed: float = 1.0,
    ) -> Dict[str, Any]:
        """Convert text to speech using ElevenLabs API.

        Args:
            text: Text to convert (max 2500 chars on free tier)
            voice_id: Override default voice ID
            stability: Voice stability (0.0-1.0)
            similarity_boost: Voice clarity (0.0-1.0)
            style: Style exaggeration (0.0-1.0, costs more latency)
            speed: Speech speed (0.7-1.2)

        Returns:
            Dict with: success, audio_path, chars_used, remaining_budget, duration_ms
        """
        if not self.is_available():
            return {"success": False, "error": "ElevenLabs not configured or disabled"}

        # Budget check
        char_count = len(text)
        remaining = self.get_remaining_budget()

        if char_count > remaining:
            return {
                "success": False,
                "error": (
                    f"Insufficient character budget. "
                    f"Need {char_count} chars, only {remaining} remaining this month. "
                    f"Budget resets on the 1st."
                ),
            }

        if char_count > 2500:
            return {
                "success": False,
                "error": (
                    f"Text too long ({char_count} chars). "
                    f"Free tier limit is 2500 chars per request. "
                    f"Shorten the text or split into multiple requests."
                ),
            }

        if char_count == 0:
            return {"success": False, "error": "No text provided"}

        # Call ElevenLabs API
        vid = voice_id or self.voice_id
        url = f"{self._base_url}/text-to-speech/{vid}"

        headers = {
            "xi-api-key": self.api_key,
            "Content-Type": "application/json",
            "Accept": "audio/mpeg",
        }

        payload = {
            "text": text,
            "model_id": self.model_id,
            "voice_settings": {
                "stability": stability,
                "similarity_boost": similarity_boost,
                "style": style,
                "speed": speed,
                "use_speaker_boost": True,
            },
        }

        try:
            start_time = time.time()

            with httpx.Client(timeout=30.0) as client:
                response = client.post(
                    url,
                    headers=headers,
                    json=payload,
                    params={"output_format": self.output_format},
                )
                response.raise_for_status()

            duration_ms = (time.time() - start_time) * 1000

            # Save audio to temp file
            timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
            audio_path = _TEMP_DIR / f"voice_{timestamp}.mp3"
            audio_path.write_bytes(response.content)

            # Update usage tracking
            self._track_usage(char_count)

            return {
                "success": True,
                "audio_path": str(audio_path),
                "chars_used": char_count,
                "remaining_budget": self.get_remaining_budget(),
                "duration_ms": round(duration_ms),
                "audio_size_bytes": len(response.content),
            }

        except httpx.HTTPStatusError as e:
            if e.response.status_code == 401:
                return {"success": False, "error": "Invalid ElevenLabs API key"}
            elif e.response.status_code == 429:
                return {"success": False, "error": "Rate limited. Wait a moment and try again."}
            else:
                return {"success": False, "error": f"ElevenLabs API error: {e.response.status_code}"}
        except httpx.TimeoutException:
            return {"success": False, "error": "ElevenLabs API request timed out (30s)"}
        except Exception as e:
            return {"success": False, "error": f"Voice generation failed: {str(e)}"}

    def _load_usage(self) -> Dict:
        """Load usage data from file."""
        if not _USAGE_FILE.exists():
            return {"month": "", "chars_used": 0, "requests": 0, "history": []}

        try:
            return json.loads(_USAGE_FILE.read_text(encoding="utf-8"))
        except Exception:
            return {"month": "", "chars_used": 0, "requests": 0, "history": []}

    def _track_usage(self, chars: int) -> None:
        """Track character usage for budget monitoring."""
        usage = self._load_usage()
        current_month = datetime.now().strftime("%Y-%m")

        if usage.get("month") != current_month:
            # New month, archive old data and reset
            usage = {
                "month": current_month,
                "chars_used": 0,
                "requests": 0,
                "history": [],
            }

        usage["chars_used"] = usage.get("chars_used", 0) + chars
        usage["requests"] = usage.get("requests", 0) + 1
        usage["history"].append({
            "timestamp": datetime.now().isoformat(),
            "chars": chars,
        })

        # Keep history manageable (last 100 entries)
        if len(usage["history"]) > 100:
            usage["history"] = usage["history"][-100:]

        _USAGE_FILE.parent.mkdir(parents=True, exist_ok=True)
        _USAGE_FILE.write_text(
            json.dumps(usage, indent=2),
            encoding="utf-8",
        )

    def get_usage_report(self) -> str:
        """Get a formatted usage report."""
        usage = self._load_usage()
        current_month = datetime.now().strftime("%Y-%m")

        if usage.get("month") != current_month:
            return (
                f"Voice Usage ({current_month}):\n"
                f"  Characters used: 0 / {self.monthly_budget}\n"
                f"  Requests: 0\n"
                f"  Budget remaining: {self.monthly_budget} chars"
            )

        chars_used = usage.get("chars_used", 0)
        remaining = max(0, self.monthly_budget - chars_used)
        pct = (chars_used / self.monthly_budget * 100) if self.monthly_budget > 0 else 0

        report = (
            f"Voice Usage ({current_month}):\n"
            f"  Characters used: {chars_used:,} / {self.monthly_budget:,} ({pct:.1f}%)\n"
            f"  Requests: {usage.get('requests', 0)}\n"
            f"  Budget remaining: {remaining:,} chars"
        )

        if pct >= 80:
            report += "\n  WARNING: Approaching monthly limit!"
        elif pct >= 100:
            report += "\n  BUDGET EXHAUSTED: Voice disabled until next month."

        return report

    @staticmethod
    def cleanup_temp_files(max_age_hours: int = 1) -> int:
        """Remove temporary voice files older than max_age_hours.

        Returns number of files cleaned up.
        """
        if not _TEMP_DIR.exists():
            return 0

        cutoff = time.time() - (max_age_hours * 3600)
        cleaned = 0

        for audio_file in _TEMP_DIR.glob("voice_*.mp3"):
            try:
                if audio_file.stat().st_mtime < cutoff:
                    audio_file.unlink()
                    cleaned += 1
            except Exception:
                continue

        return cleaned

5.2 MODIFY: mcp_tools.py -- Add speak_text Tool

Add after the existing tool definitions, before the file_system_server creation:

# ============================================
# ElevenLabs Voice Tool (MCP)
# ============================================
# Lazy-loaded ElevenLabs client
_elevenlabs_client: Optional[Any] = None


def _get_elevenlabs_client():
    """Lazy-load ElevenLabs client when first needed."""
    global _elevenlabs_client
    if _elevenlabs_client is None:
        try:
            from elevenlabs_client import ElevenLabsClient
            _elevenlabs_client = ElevenLabsClient()
        except ImportError:
            return None
    return _elevenlabs_client


@tool(
    name="speak_text",
    description=(
        "Convert text to speech using Garvis's voice (British AI assistant). "
        "Use this to deliver important messages, greetings, witty remarks, or "
        "when the user explicitly asks for a voice response. The audio will be "
        "sent as a voice message on the user's platform (Telegram voice note "
        "or Slack audio). Keep text concise -- budget is limited to ~9000 "
        "chars/month (free tier). Returns a voice marker that the runtime "
        "will convert to an audio message."
    ),
    input_schema={
        "text": str,  # The text to speak (max 2500 chars)
    },
)
async def speak_text_tool(args: Dict[str, Any]) -> Dict[str, Any]:
    """Generate speech from text using ElevenLabs.

    Zero-cost MCP tool when not invoked. Calls ElevenLabs API only on use.
    Returns a voice marker for the runtime to process.
    """
    text = args.get("text", "").strip()

    if not text:
        return {
            "content": [{"type": "text", "text": "Error: No text provided for speech"}],
            "isError": True,
        }

    client = _get_elevenlabs_client()
    if not client:
        return {
            "content": [{
                "type": "text",
                "text": "Error: ElevenLabs not configured. Add ELEVENLABS_API_KEY to .env",
            }],
            "isError": True,
        }

    if not client.is_available():
        return {
            "content": [{
                "type": "text",
                "text": "Error: ElevenLabs voice is disabled or not configured",
            }],
            "isError": True,
        }

    # Check budget before calling API
    remaining = client.get_remaining_budget()
    if len(text) > remaining:
        return {
            "content": [{
                "type": "text",
                "text": (
                    f"Voice budget insufficient: need {len(text)} chars, "
                    f"only {remaining} remaining this month. "
                    f"Respond with text instead."
                ),
            }],
            "isError": True,
        }

    # Generate speech
    result = client.text_to_speech(text)

    if result["success"]:
        audio_path = result["audio_path"]
        chars_used = result["chars_used"]
        remaining_budget = result["remaining_budget"]

        # Return a voice marker that the runtime postprocessor will detect.
        # The marker embeds the audio file path so the runtime can find it.
        return {
            "content": [{
                "type": "text",
                "text": (
                    f"[VOICE:{audio_path}]\n\n"
                    f"Voice message generated ({chars_used} chars, "
                    f"{remaining_budget} remaining this month). "
                    f"The audio will be delivered as a voice message."
                ),
            }],
        }
    else:
        return {
            "content": [{
                "type": "text",
                "text": f"Voice generation failed: {result['error']}. Responding with text instead.",
            }],
            "isError": True,
        }

Then add speak_text_tool to the file_system_server tools list:

file_system_server = create_sdk_mcp_server(
    name="file_system",
    version="2.1.0",  # bump version
    tools=[
        # ... existing tools ...
        # Voice tool
        speak_text_tool,
    ]
)

5.3 MODIFY: llm_interface.py -- Add to Allowed Tools

In _build_agent_sdk_options(), add "speak_text" to the allowed_tools list:

allowed_tools = [
    # ... existing tools ...
    # Voice
    "speak_text",
]

5.4 MODIFY: adapters/base.py -- Add Voice Support

Add to AdapterCapabilities:

@dataclass
class AdapterCapabilities:
    supports_threads: bool = False
    supports_reactions: bool = False
    supports_media: bool = False
    supports_files: bool = False
    supports_markdown: bool = False
    supports_voice: bool = False  # NEW
    max_message_length: int = 2000
    chunking_strategy: Optional[str] = None

Add default method to BaseAdapter:

async def send_voice_message(
    self,
    channel_id: str,
    audio_path: str,
    reply_to_id: Optional[str] = None,
    thread_id: Optional[str] = None,
    caption: Optional[str] = None,
) -> Dict[str, Any]:
    """Send a voice/audio message to the platform. Optional.

    Args:
        channel_id: Target channel/chat ID
        audio_path: Path to the audio file (MP3)
        reply_to_id: Optional message to reply to
        thread_id: Optional thread to post in
        caption: Optional text caption with the voice message

    Returns:
        Dict with at least {"success": bool}
    """
    return {"success": False, "error": "Voice not supported on this platform"}

5.5 MODIFY: adapters/telegram/adapter.py -- Implement Voice Sending

Add to capabilities:

@property
def capabilities(self) -> AdapterCapabilities:
    return AdapterCapabilities(
        supports_threads=False,
        supports_reactions=True,
        supports_media=True,
        supports_files=True,
        supports_markdown=True,
        supports_voice=True,  # NEW
        max_message_length=4096,
        chunking_strategy="markdown",
    )

Add voice sending method:

async def send_voice_message(
    self,
    channel_id: str,
    audio_path: str,
    reply_to_id: Optional[str] = None,
    thread_id: Optional[str] = None,
    caption: Optional[str] = None,
) -> Dict[str, Any]:
    """Send a voice message to Telegram.

    Telegram voice notes require OGG/Opus format. If the source is MP3,
    we convert it using pydub + ffmpeg. Telegram also accepts MP3 directly
    via send_voice() since Bot API 6.0+.
    """
    if not self.bot:
        return {"success": False, "error": "Bot not started"}

    try:
        from pathlib import Path

        audio_file = Path(audio_path)
        if not audio_file.exists():
            return {"success": False, "error": f"Audio file not found: {audio_path}"}

        chat_id = int(channel_id)
        reply_id = int(reply_to_id) if reply_to_id else None

        # Attempt OGG/Opus conversion for native voice note display.
        # Falls back to sending MP3 directly if pydub/ffmpeg not available.
        ogg_path = None
        try:
            from pydub import AudioSegment
            audio = AudioSegment.from_mp3(str(audio_file))
            ogg_path = audio_file.with_suffix(".ogg")
            audio.export(str(ogg_path), format="ogg", codec="libopus")
            voice_file = ogg_path
        except Exception:
            # pydub or ffmpeg not available; send MP3 directly
            voice_file = audio_file

        with open(voice_file, "rb") as f:
            sent = await self.bot.send_voice(
                chat_id=chat_id,
                voice=f,
                caption=caption,
                reply_to_message_id=reply_id,
            )

        # Clean up temporary OGG file
        if ogg_path and ogg_path.exists():
            try:
                ogg_path.unlink()
            except Exception:
                pass

        return {
            "success": True,
            "message_id": sent.message_id,
            "chat_id": sent.chat_id,
        }

    except TelegramError as e:
        print(f"[Telegram] Error sending voice: {e}")
        return {"success": False, "error": str(e)}
    except Exception as e:
        print(f"[Telegram] Voice send error: {e}")
        return {"success": False, "error": str(e)}

5.6 MODIFY: adapters/slack/adapter.py -- Implement Voice Sending

Add to capabilities:

@property
def capabilities(self) -> AdapterCapabilities:
    return AdapterCapabilities(
        supports_threads=True,
        supports_reactions=True,
        supports_media=True,
        supports_files=True,
        supports_markdown=True,
        supports_voice=True,  # NEW
        max_message_length=4000,
        chunking_strategy="word",
    )

Add voice sending method:

async def send_voice_message(
    self,
    channel_id: str,
    audio_path: str,
    reply_to_id: Optional[str] = None,
    thread_id: Optional[str] = None,
    caption: Optional[str] = None,
) -> Dict[str, Any]:
    """Send a voice/audio file to Slack.

    Uses files_upload_v2 (the modern file upload method).
    Slack displays MP3 files with an inline audio player.
    """
    if not self.app:
        return {"success": False, "error": "Adapter not started"}

    try:
        from pathlib import Path

        audio_file = Path(audio_path)
        if not audio_file.exists():
            return {"success": False, "error": f"Audio file not found: {audio_path}"}

        result = await self.app.client.files_upload_v2(
            channel=channel_id,
            file=str(audio_file),
            filename=f"garvis_voice_{audio_file.stem}.mp3",
            title="Garvis Voice Message",
            initial_comment=caption or "",
            thread_ts=thread_id,
        )

        return {
            "success": True,
            "file_id": result.get("file", {}).get("id", "unknown"),
        }

    except SlackApiError as e:
        error_msg = e.response["error"]
        print(f"[Slack] Error sending voice: {error_msg}")
        return {"success": False, "error": error_msg}
    except Exception as e:
        print(f"[Slack] Voice send error: {e}")
        return {"success": False, "error": str(e)}

5.7 MODIFY: adapters/runtime.py -- Voice Postprocessor

Add a voice postprocessor that detects [VOICE:path] markers in the agent's response and triggers audio delivery. Add this function and register it:

import re
from pathlib import Path


def _extract_voice_markers(text: str) -> list:
    """Extract [VOICE:path] markers from text.

    Returns list of (marker_string, audio_path) tuples.
    """
    pattern = r'\[VOICE:(.*?)\]'
    matches = re.findall(pattern, text)
    return [(f"[VOICE:{path}]", path.strip()) for path in matches]


# In AdapterRuntime._process_message(), after the agent response is received
# and before sending the text response, add voice handling:

# --- Inside _process_message, after getting `response` from agent.chat() ---

# Handle voice markers in response
voice_markers = _extract_voice_markers(response)
if voice_markers and adapter and adapter.capabilities.supports_voice:
    for marker, audio_path in voice_markers:
        # Remove the marker from the text response
        response = response.replace(marker, "").strip()

        # Send the voice message
        voice_result = await adapter.send_voice_message(
            channel_id=message.channel_id,
            audio_path=audio_path,
            reply_to_id=(
                message.metadata.get("ts")
                or str(message.metadata.get("message_id", ""))
            ),
            thread_id=message.thread_id,
        )

        if voice_result.get("success"):
            print(
                f"[{message.platform.upper()}] Voice message sent "
                f"({Path(audio_path).stat().st_size} bytes)"
            )
        else:
            print(
                f"[{message.platform.upper()}] Voice send failed: "
                f"{voice_result.get('error')}"
            )

        # Clean up temp audio file
        try:
            Path(audio_path).unlink(missing_ok=True)
        except Exception:
            pass

# If the text response is now empty (was voice-only), add a minimal text fallback
if not response.strip() and voice_markers:
    response = ""  # Don't send empty text; voice was the response

# Continue with existing text send logic (only if response is non-empty)

5.8 MODIFY: memory_workspace/SOUL.md -- Document Voice Tool

Add to the "Available Tools" section:

### Voice (ElevenLabs - API Cost)
- speak_text (convert text to Garvis voice message, delivered as platform voice note)

**Voice Guidelines**:
- Use voice for: greetings, witty quips, important announcements, when user asks
- Keep voice messages SHORT (1-3 sentences, under 500 chars)
- Budget: ~9,000 chars/month -- be selective
- Always provide text alongside voice (accessibility)
- Voice personality: British AI assistant, dry wit, composed confidence
- Signature phrases in voice: "Right then", "Very good, sir", "I've taken the liberty of..."
- DO NOT use voice for: long explanations, code, lists, weather reports (waste of budget)

5.9 MODIFY: .env.example -- Add ElevenLabs Section

# ========================================
# ElevenLabs Voice (Optional)
# ========================================
# Enables Jarvis-style voice responses
# Sign up: https://elevenlabs.io (free tier: 10,000 chars/month)

# API Key (from Profile + API Key page)
ELEVENLABS_API_KEY=your-api-key-here

# Voice ID - Jarvis Robot (British AI assistant)
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD

# Model - eleven_multilingual_v2 (best) or eleven_turbo_v2_5 (faster)
# ELEVENLABS_MODEL_ID=eleven_multilingual_v2

# Monthly character budget (default: 9000, buffer below 10k free limit)
# ELEVENLABS_MONTHLY_BUDGET=9000

# Enable/disable voice globally
# ELEVENLABS_ENABLED=true

5.10 MODIFY: .gitignore -- Add Voice Temp Files

# Voice temp files
temp/voice/
config/elevenlabs_usage.json

6. Voice Trigger Logic

6.1 When Garvis Should Use Voice

The LLM decides when to use the speak_text tool based on SOUL.md instructions. The key triggers:

Explicit triggers (always use voice):

  • "Say that out loud"
  • "Voice response please"
  • "Tell me in your voice"
  • "Speak to me, Garvis"
  • Any message containing "voice" + "say/tell/speak/read"

Auto triggers (when voice_mode: auto, LLM decides):

  • Morning greetings: "Good morning, sir. I trust you slept well."
  • Task completion announcements: "All done. Your calendar is updated."
  • Witty remarks / personality moments
  • Important alerts: "Sir, your budget has exceeded 75%."

Never voice (even in auto mode):

  • Code blocks or technical output
  • Long responses (> 500 chars)
  • Lists, tables, structured data
  • Weather reports (text is more useful)
  • When budget is low (< 1000 chars remaining)

6.2 Dual Response Pattern

When using voice, Garvis should ALWAYS send both:

  1. Voice message: The spoken audio
  2. Text message: The same (or slightly different) text version

This ensures accessibility, searchability, and works even if voice delivery fails. The text accompanies the voice naturally -- Telegram shows it as a caption, Slack shows it as a message with the audio file.

Example agent response with voice:

[VOICE:temp/voice/voice_20260217_143022.mp3]

Good morning, sir. The weather in Centennial looks rather agreeable today -- 72 degrees with clear skies. I'd recommend that light jacket you've been neglecting.

The runtime strips the [VOICE:...] marker, sends the audio, then sends the remaining text as a regular message.


7. Platform Delivery

7.1 Telegram

Aspect Details
API Method bot.send_voice()
Format OGG/Opus (converted from MP3 via pydub) or MP3 directly
Display Native voice note player (waveform visualization)
Max Size 50 MB
Caption Supported (text alongside voice note)
Duration Auto-detected from audio metadata

User experience: The voice message appears as a playable waveform bubble in chat. The user taps to listen. It looks and feels like a standard Telegram voice message.

7.2 Slack

Aspect Details
API Method files_upload_v2()
Format MP3 (no conversion needed)
Display Inline audio player with play button
Max Size Determined by workspace plan
Caption Via initial_comment parameter
Thread Supported via thread_ts

User experience: The audio appears as an uploaded file with Slack's inline audio player. The user clicks play to listen. Less native-feeling than Telegram but functional.


8. Cost Monitoring

8.1 Character Budget System

The ElevenLabsClient tracks usage in config/elevenlabs_usage.json:

{
  "month": "2026-02",
  "chars_used": 2340,
  "requests": 12,
  "history": [
    {"timestamp": "2026-02-17T14:30:22", "chars": 180},
    {"timestamp": "2026-02-17T15:45:10", "chars": 220}
  ]
}

8.2 Budget Enforcement

Chars Remaining Behavior
> 3000 Normal operation
1000 - 3000 LLM gets usage warning in tool response
100 - 1000 Only explicit voice requests honored
0 Voice tool returns error, LLM responds with text only

8.3 Monthly Budget Reset

The budget resets automatically on the 1st of each month (detected by comparing usage.month with current month string).

8.4 Integration with Daily Cost Report

The existing daily cost report scheduled task (daily-cost-report in scheduled_tasks.yaml) can be extended to include voice usage. The agent can read config/elevenlabs_usage.json using the read_file tool and include voice stats in the report.


9. Testing Strategy

9.1 Unit Tests

# test_elevenlabs.py

def test_budget_tracking():
    """Ensure character budget is tracked correctly."""
    client = ElevenLabsClient()
    # Reset usage file
    # Track 100 chars
    # Assert remaining = budget - 100

def test_budget_rejection():
    """Ensure over-budget requests are rejected."""
    # Set budget to 50
    # Attempt to speak 100 chars
    # Assert error returned

def test_monthly_reset():
    """Ensure budget resets on new month."""
    # Write usage with month = "2026-01"
    # Check remaining in "2026-02"
    # Assert full budget available

def test_text_too_long():
    """Ensure 2500 char per-request limit is enforced."""
    # Attempt to speak 3000 chars
    # Assert error about per-request limit

def test_empty_text():
    """Ensure empty text is rejected."""

def test_temp_file_cleanup():
    """Ensure old temp files are cleaned up."""

9.2 Integration Tests

def test_voice_marker_extraction():
    """Test [VOICE:path] marker parsing."""
    text = "Hello [VOICE:temp/voice/v1.mp3] world"
    markers = _extract_voice_markers(text)
    assert len(markers) == 1
    assert markers[0][1] == "temp/voice/v1.mp3"

def test_voice_marker_removal():
    """Test that markers are cleanly removed from text."""
    text = "[VOICE:temp/v.mp3]\n\nHello, sir."
    markers = _extract_voice_markers(text)
    clean = text.replace(markers[0][0], "").strip()
    assert clean == "Hello, sir."

def test_telegram_voice_send():
    """Test Telegram voice message delivery (mock)."""
    # Mock bot.send_voice
    # Call adapter.send_voice_message
    # Assert send_voice called with correct params

def test_slack_voice_send():
    """Test Slack audio file upload (mock)."""
    # Mock app.client.files_upload_v2
    # Call adapter.send_voice_message
    # Assert upload called with correct params

9.3 Manual Testing Checklist

  • Set ELEVENLABS_API_KEY in .env
  • Send "Garvis, say hello in your voice" via Telegram
  • Verify voice note appears in Telegram chat
  • Verify voice waveform is playable
  • Verify text response also appears alongside voice
  • Send "Speak to me" via Slack (if configured)
  • Verify audio file appears in Slack with player
  • Check config/elevenlabs_usage.json for correct tracking
  • Test budget exhaustion (set budget to 10, speak > 10 chars)
  • Verify graceful fallback to text when voice fails
  • Test with ELEVENLABS_ENABLED=false -- voice tool should return error
  • Test with missing API key -- voice tool should return error
  • Test text > 2500 chars -- should reject with clear message
  • Verify temp files are cleaned up after delivery
  • Test on slow network (API timeout handling)

10. Edge Cases and Error Handling

10.1 Error Scenarios

Scenario Handling
No API key configured Tool returns error, LLM responds with text
Invalid API key Tool returns clear error message
API rate limit (429) Tool returns "wait and retry" message
API timeout Tool returns timeout error after 30s
Audio conversion fails (no ffmpeg) Send MP3 directly (Telegram supports it since Bot API 6.0)
Budget exhausted Tool rejects with remaining chars info
Temp file missing at send time Log error, send text-only response
Platform doesn't support voice Voice marker stays in text (removed by marker cleanup), text still sent
Large text (> 2500 chars) Tool rejects, suggests shortening
Empty text Tool rejects immediately
Network error during API call Tool returns error, LLM falls back to text
Concurrent voice requests Each gets its own timestamp-based temp file
Bot restart mid-voice Orphaned temp files cleaned up by periodic cleanup

10.2 Graceful Degradation

The system degrades gracefully at every level:

  1. No ElevenLabs configured: Tool returns error -> LLM uses text only
  2. Budget exhausted: Tool rejects -> LLM uses text only
  3. API failure: Tool returns error -> LLM uses text only
  4. Audio conversion fails: Send MP3 instead of OGG
  5. Platform doesn't support voice: Text response still delivered
  6. Voice file cleanup fails: No impact on user; files are small

10.3 Security Considerations

  • API key: Stored in .env (gitignored), never logged
  • Audio files: Temporary, auto-cleaned, stored in temp/voice/ (gitignored)
  • Usage data: config/elevenlabs_usage.json (gitignored) -- no sensitive data
  • User content: Text sent to ElevenLabs API for synthesis -- same privacy model as sending text to any external API. ElevenLabs has a zero-retention mode (enable_logging: false) that can be enabled for additional privacy.

11. Troubleshooting

11.1 Common Issues

"ElevenLabs not configured"

  • Ensure ELEVENLABS_API_KEY is set in .env
  • Ensure ELEVENLABS_ENABLED is not set to false
  • Restart the bot after changing .env

"Invalid ElevenLabs API key"

  • Check key at https://elevenlabs.io (Profile + API Key)
  • Ensure no trailing whitespace in .env
  • Free tier keys work fine; no paid plan needed

Voice note shows as file instead of voice player (Telegram)

  • Install ffmpeg: choco install ffmpeg (Windows) or apt install ffmpeg (Linux)
  • Install pydub: pip install pydub
  • Without these, MP3 is sent directly; Telegram may display it as audio file instead of voice note

No sound / corrupted audio

  • Check ElevenLabs dashboard for the request in usage history
  • Try changing ELEVENLABS_OUTPUT_FORMAT to mp3_22050_32 (smaller, more compatible)
  • Verify the voice ID is correct: WWtyH2oxeOp9yZwK8ERD

Budget shows exhausted but it's a new month

  • Delete config/elevenlabs_usage.json -- it will be recreated
  • The auto-reset checks the month string; manual deletion is safe

Voice works in Telegram but not Slack

  • Ensure Slack bot has files:write scope
  • Check Slack workspace file upload limits

11.2 Diagnostic Commands

You can ask Garvis directly:

  • "What's your voice budget status?" -- reads elevenlabs_usage.json
  • "Test your voice" -- triggers a short speak_text
  • "Disable voice" -- edit preferences

12. Future Enhancements

12.1 Short Term (After Initial Integration)

  • Voice-to-text (STT): Accept Telegram voice messages as input using ElevenLabs Speech-to-Text API or Whisper. User sends voice -> Garvis transcribes -> processes as text.
  • Voice preference commands: /voice on, /voice off, /voice status as Telegram commands.
  • Smart budget allocation: Reserve 20% of budget for the last week of the month.
  • Audio caching: Cache frequently spoken phrases (greetings, confirmations) to save API calls.

12.2 Medium Term

  • Custom voice cloning: Clone a custom Jarvis-like voice using ElevenLabs voice cloning (requires Starter plan at $5/month). Train on MCU JARVIS audio clips for closer personality match.
  • Scheduled voice messages: Morning briefing delivered as voice note instead of text. "Good morning, sir. Today's forecast calls for..."
  • Emotional voice modulation: Adjust stability and style parameters based on message tone (urgent = higher stability, witty = lower stability + more style).
  • Multi-language support: Use language_code parameter for occasional non-English responses.

12.3 Long Term

  • Real-time voice conversations: ElevenLabs Conversational AI SDK for live voice chat via Telegram voice calls.
  • Voice-based authentication: Recognize Jordan's voice vs. other users.
  • Ambient audio: Background music or sound effects for dramatic effect (Iron Man suit sounds).
  • Voice journal: Daily summary delivered as a podcast-style voice recording.

Implementation Order

For a smooth rollout, implement in this order:

  1. Create elevenlabs_client.py -- standalone, testable, no dependencies on existing code
  2. Add speak_text MCP tool to mcp_tools.py -- register the tool
  3. Add speak_text to llm_interface.py allowed tools -- make it discoverable
  4. Add send_voice_message() to adapters/base.py -- base interface
  5. Implement Telegram voice in adapters/telegram/adapter.py -- primary platform
  6. Add voice postprocessor to adapters/runtime.py -- wire up delivery
  7. Update SOUL.md -- teach Garvis when/how to use voice
  8. Update .env.example and .gitignore -- configuration
  9. Test end-to-end on Telegram
  10. Implement Slack voice in adapters/slack/adapter.py -- secondary platform
  11. Add budget monitoring to daily report -- observability

Estimated effort: 3-4 hours for core implementation, 1-2 hours for testing.


Quick Reference Card

Tool:          speak_text
Input:         { "text": "Hello, sir." }
API:           ElevenLabs TTS v1
Voice:         Jarvis - Robot (WWtyH2oxeOp9yZwK8ERD)
Model:         eleven_multilingual_v2
Format:        MP3 -> OGG/Opus (Telegram) or MP3 (Slack)
Budget:        9,000 chars/month (free tier: 10,000)
Max/request:   2,500 chars
Temp files:    temp/voice/voice_*.mp3
Usage file:    config/elevenlabs_usage.json
Delivery:      [VOICE:path] marker -> runtime postprocessor -> adapter.send_voice_message()
Fallback:      Always text, voice is enhancement