## New Features - **Gitea MCP Tools** (zero API cost): - gitea_read_file: Read files from homelab repo - gitea_list_files: Browse directories - gitea_search_code: Search by filename - gitea_get_tree: Get directory tree - **Gitea Client** (gitea_tools/client.py): REST API wrapper with OAuth - **Proxmox SSH Scripts** (scripts/): Homelab data collection utilities - **Obsidian MCP Support** (obsidian_mcp.py): Advanced vault operations - **Voice Integration Plan** (JARVIS_VOICE_INTEGRATION_PLAN.md) ## Improvements - **Increased timeout**: 5min → 10min for complex tasks (llm_interface.py) - **Removed Direct API fallback**: Gitea tools are MCP-only (zero cost) - **Updated .env.example**: Added Obsidian MCP configuration - **Enhanced .gitignore**: Protect personal memory files (SOUL.md, MEMORY.md) ## Cleanup - Deleted 24 obsolete files (temp/test/experimental scripts, outdated docs) - Untracked personal memory files (SOUL.md, MEMORY.md now in .gitignore) - Removed: AGENT_SDK_IMPLEMENTATION.md, HYBRID_SEARCH_SUMMARY.md, IMPLEMENTATION_SUMMARY.md, MIGRATION.md, test_agent_sdk.py, etc. ## Configuration - Added config/gitea_config.example.yaml (Gitea setup template) - Added config/obsidian_mcp.example.yaml (Obsidian MCP template) - Updated scheduled_tasks.yaml with new task examples Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
44 KiB
Jarvis Voice Integration Plan
Executive Summary
This document provides a comprehensive plan for adding ElevenLabs text-to-speech capabilities to Ajarbot, enabling Garvis to deliver occasional voice responses with a British AI assistant personality (the "Jarvis - Robot" voice). The integration follows the existing codebase patterns: an MCP tool for zero-cost routing when unused, lazy-loaded client for the ElevenLabs API, platform-specific audio delivery via Telegram voice notes and Slack file uploads, and a character budget tracker to stay within the free tier's 10,000 characters per month.
Key decisions:
- Architecture: Hybrid MCP tool (voice generation) + adapter-level audio delivery
- Voice: ElevenLabs pre-made "Jarvis - Robot" (ID:
WWtyH2oxeOp9yZwK8ERD) - Trigger model: Explicit user commands and optional LLM-driven autonomous voice for high-impact moments
- Cost: Free tier (10,000 chars/month) -- sufficient for casual use (roughly 40-50 short voice messages)
Table of Contents
- Architecture Design
- Implementation Plan
- ElevenLabs Setup Guide
- Configuration
- File-by-File Changes
- Voice Trigger Logic
- Platform Delivery
- Cost Monitoring
- Testing Strategy
- Edge Cases and Error Handling
- Troubleshooting
- Future Enhancements
1. Architecture Design
1.1 Why Hybrid (MCP Tool + Adapter Extension)
The existing codebase uses two tool paradigms:
- MCP tools (
mcp_tools.py): Zero API cost when unused, registered via@tooldecorator, run in-process - Traditional tools (
tools.py): Google/weather tools requiring external API calls
Voice generation naturally splits into two concerns:
| Concern | Component | Rationale |
|---|---|---|
| Text-to-Speech generation | MCP tool in mcp_tools.py |
Follows the pattern of web_fetch -- makes an external HTTP call but runs as an MCP tool. Zero cost when the tool is not invoked. Lazy-loads the ElevenLabs client. |
| Audio delivery to platform | Adapter-level method on BaseAdapter |
Telegram needs send_voice() (OGG/Opus), Slack needs files_upload_v2() (MP3). The adapter already owns the platform connection. Adding a send_voice_message() method is the cleanest separation. |
1.2 Component Diagram
User says: "Garvis, say that in your voice"
|
v
[Agent / LLM] -----> decides to use speak_text tool
|
v
[MCP Tool: speak_text]
| 1. Validates character budget
| 2. Calls ElevenLabs TTS API
| 3. Returns audio bytes + metadata
v
[AdapterRuntime._process_message]
| Detects voice attachment in response metadata
| Routes to adapter.send_voice_message()
v
[TelegramAdapter.send_voice_message] or [SlackAdapter.send_voice_message]
| Sends OGG voice note Uploads MP3 file snippet
v
User receives voice message in chat
1.3 Why NOT a Standalone Traditional Tool
Traditional tools in tools.py return plain strings. Voice requires returning binary audio data plus metadata (format, duration, character count). The MCP tool pattern supports structured return values and integrates naturally with the Agent SDK's tool execution pipeline. Additionally, the MCP tool is never loaded or called unless the LLM decides to use it, matching the "zero cost when unused" principle from SOUL.md.
1.4 Data Flow for Voice Responses
The voice tool follows a two-phase approach:
Phase 1 - Generation (MCP Tool):
- LLM calls
speak_texttool with the text to speak - Tool checks character budget (reject if would exceed monthly limit)
- Tool calls ElevenLabs API, receives MP3 audio bytes
- Tool saves audio to a temporary file (
temp/voice_{timestamp}.mp3) - Tool returns success message with file path and metadata
Phase 2 - Delivery (Adapter Runtime):
- Agent's text response includes a voice marker:
[VOICE: temp/voice_12345.mp3] - Runtime postprocessor detects the marker
- Runtime calls
adapter.send_voice_message(channel_id, audio_path) - Adapter sends platform-native voice message
- Temporary file is cleaned up
This two-phase approach avoids passing binary data through the LLM response chain and uses the existing postprocessor pattern from adapters/runtime.py.
2. Implementation Plan
2.1 Overview of Changes
| File | Change Type | Description |
|---|---|---|
elevenlabs_client.py |
NEW | ElevenLabs API client (TTS, usage tracking) |
mcp_tools.py |
MODIFY | Add speak_text MCP tool |
adapters/base.py |
MODIFY | Add send_voice_message() to BaseAdapter |
adapters/telegram/adapter.py |
MODIFY | Implement send_voice_message() using send_voice() |
adapters/slack/adapter.py |
MODIFY | Implement send_voice_message() using files_upload_v2() |
adapters/runtime.py |
MODIFY | Add voice postprocessor to detect and deliver voice messages |
memory_workspace/SOUL.md |
MODIFY | Add speak_text tool documentation and voice personality notes |
llm_interface.py |
MODIFY | Add speak_text to allowed_tools list |
.env.example |
MODIFY | Add ElevenLabs configuration variables |
.gitignore |
MODIFY | Add temp voice files |
config/voice_preferences.yaml |
NEW | Per-user voice preferences (optional) |
2.2 Dependencies
pip install elevenlabs # Official Python SDK
pip install pydub # Audio format conversion (MP3 -> OGG/Opus for Telegram)
Note: pydub requires ffmpeg installed on the system for OGG/Opus conversion. On Windows: choco install ffmpeg or download from https://ffmpeg.org/download.html.
3. ElevenLabs Setup Guide
3.1 Account Creation
- Go to https://elevenlabs.io and sign up for a free account
- Verify your email address
- Navigate to Profile + API Key (click your avatar, top right)
- Copy your API key
3.2 Voice Selection
The pre-made "Jarvis - Robot" voice is ideal for this use case:
- Voice ID:
WWtyH2oxeOp9yZwK8ERD - Character: British, robotic, AI assistant personality
- Quality: High quality even on free tier
- No voice cloning needed: Pre-made voices are available immediately
To verify the voice ID or browse alternatives:
- Go to https://elevenlabs.io/voice-library
- Search for "Jarvis"
- Click the voice to preview it
- The voice ID is in the URL or available via API
3.3 Free Tier Limits
| Limit | Value |
|---|---|
| Characters per month | 10,000 |
| Max characters per request | 2,500 |
| Custom voices | 3 |
| Commercial use | No |
| Audio quality | Standard |
| Concurrent requests | 2 |
Budget math: At ~200 characters per voice message (average sentence), 10,000 chars allows roughly 50 voice messages per month -- more than enough for "here and there" casual use.
3.4 API Key Configuration
Add to .env:
# ElevenLabs Voice (Optional)
ELEVENLABS_API_KEY=your-api-key-here
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD
4. Configuration
4.1 Environment Variables
# === ElevenLabs Voice Configuration ===
# API key from https://elevenlabs.io (required for voice features)
ELEVENLABS_API_KEY=your-api-key-here
# Voice ID - default: Jarvis - Robot (British AI assistant)
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD
# Model ID - default: eleven_multilingual_v2 (best quality)
# Options: eleven_multilingual_v2, eleven_turbo_v2_5 (faster, lower quality)
ELEVENLABS_MODEL_ID=eleven_multilingual_v2
# Monthly character budget (default: 9000, leaves 1000 char buffer from 10k limit)
ELEVENLABS_MONTHLY_BUDGET=9000
# Output format (default: mp3_44100_128 - good quality, reasonable size)
ELEVENLABS_OUTPUT_FORMAT=mp3_44100_128
# Enable/disable voice features globally
ELEVENLABS_ENABLED=true
4.2 Per-User Voice Preferences (Optional)
File: config/voice_preferences.yaml
# Voice preferences per user
# Users can enable/disable voice responses and set preferences
defaults:
voice_enabled: true
voice_mode: "explicit" # "explicit" = only on request, "auto" = LLM decides
max_chars_per_message: 500 # Limit text length for voice to save budget
users:
jordan:
voice_enabled: true
voice_mode: "auto" # Garvis can decide when to use voice
preferred_voice: "WWtyH2oxeOp9yZwK8ERD" # Jarvis - Robot
4.3 Voice Trigger Modes
| Mode | Behavior | Configuration |
|---|---|---|
explicit |
Voice only when user explicitly requests (e.g., "say that out loud", "voice response") | Default, safest for budget |
auto |
LLM decides when voice adds value (greetings, dramatic moments, short quips) | Set voice_mode: auto for user |
disabled |
No voice at all | Set voice_enabled: false |
5. File-by-File Changes
5.1 NEW: elevenlabs_client.py
This module handles all ElevenLabs API interaction, usage tracking, and audio format management.
"""ElevenLabs TTS client for Garvis voice capabilities.
Handles text-to-speech generation, character budget tracking,
and audio format management. Lazy-loaded -- zero cost when unused.
"""
import json
import os
import time
from datetime import datetime
from pathlib import Path
from typing import Any, Dict, Optional
import httpx
# Budget tracking file
_USAGE_FILE = Path("config/elevenlabs_usage.json")
_TEMP_DIR = Path("temp/voice")
class ElevenLabsClient:
"""ElevenLabs TTS client with budget tracking."""
def __init__(self) -> None:
self.api_key = os.getenv("ELEVENLABS_API_KEY", "")
self.voice_id = os.getenv("ELEVENLABS_VOICE_ID", "WWtyH2oxeOp9yZwK8ERD")
self.model_id = os.getenv("ELEVENLABS_MODEL_ID", "eleven_multilingual_v2")
self.output_format = os.getenv("ELEVENLABS_OUTPUT_FORMAT", "mp3_44100_128")
self.monthly_budget = int(os.getenv("ELEVENLABS_MONTHLY_BUDGET", "9000"))
self.enabled = os.getenv("ELEVENLABS_ENABLED", "true").lower() == "true"
self._base_url = "https://api.elevenlabs.io/v1"
_TEMP_DIR.mkdir(parents=True, exist_ok=True)
def is_available(self) -> bool:
"""Check if ElevenLabs is configured and enabled."""
return bool(self.enabled and self.api_key)
def get_remaining_budget(self) -> int:
"""Get remaining character budget for this month."""
usage = self._load_usage()
current_month = datetime.now().strftime("%Y-%m")
if usage.get("month") != current_month:
# New month, reset counter
return self.monthly_budget
return max(0, self.monthly_budget - usage.get("chars_used", 0))
def text_to_speech(
self,
text: str,
voice_id: Optional[str] = None,
stability: float = 0.5,
similarity_boost: float = 0.75,
style: float = 0.0,
speed: float = 1.0,
) -> Dict[str, Any]:
"""Convert text to speech using ElevenLabs API.
Args:
text: Text to convert (max 2500 chars on free tier)
voice_id: Override default voice ID
stability: Voice stability (0.0-1.0)
similarity_boost: Voice clarity (0.0-1.0)
style: Style exaggeration (0.0-1.0, costs more latency)
speed: Speech speed (0.7-1.2)
Returns:
Dict with: success, audio_path, chars_used, remaining_budget, duration_ms
"""
if not self.is_available():
return {"success": False, "error": "ElevenLabs not configured or disabled"}
# Budget check
char_count = len(text)
remaining = self.get_remaining_budget()
if char_count > remaining:
return {
"success": False,
"error": (
f"Insufficient character budget. "
f"Need {char_count} chars, only {remaining} remaining this month. "
f"Budget resets on the 1st."
),
}
if char_count > 2500:
return {
"success": False,
"error": (
f"Text too long ({char_count} chars). "
f"Free tier limit is 2500 chars per request. "
f"Shorten the text or split into multiple requests."
),
}
if char_count == 0:
return {"success": False, "error": "No text provided"}
# Call ElevenLabs API
vid = voice_id or self.voice_id
url = f"{self._base_url}/text-to-speech/{vid}"
headers = {
"xi-api-key": self.api_key,
"Content-Type": "application/json",
"Accept": "audio/mpeg",
}
payload = {
"text": text,
"model_id": self.model_id,
"voice_settings": {
"stability": stability,
"similarity_boost": similarity_boost,
"style": style,
"speed": speed,
"use_speaker_boost": True,
},
}
try:
start_time = time.time()
with httpx.Client(timeout=30.0) as client:
response = client.post(
url,
headers=headers,
json=payload,
params={"output_format": self.output_format},
)
response.raise_for_status()
duration_ms = (time.time() - start_time) * 1000
# Save audio to temp file
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
audio_path = _TEMP_DIR / f"voice_{timestamp}.mp3"
audio_path.write_bytes(response.content)
# Update usage tracking
self._track_usage(char_count)
return {
"success": True,
"audio_path": str(audio_path),
"chars_used": char_count,
"remaining_budget": self.get_remaining_budget(),
"duration_ms": round(duration_ms),
"audio_size_bytes": len(response.content),
}
except httpx.HTTPStatusError as e:
if e.response.status_code == 401:
return {"success": False, "error": "Invalid ElevenLabs API key"}
elif e.response.status_code == 429:
return {"success": False, "error": "Rate limited. Wait a moment and try again."}
else:
return {"success": False, "error": f"ElevenLabs API error: {e.response.status_code}"}
except httpx.TimeoutException:
return {"success": False, "error": "ElevenLabs API request timed out (30s)"}
except Exception as e:
return {"success": False, "error": f"Voice generation failed: {str(e)}"}
def _load_usage(self) -> Dict:
"""Load usage data from file."""
if not _USAGE_FILE.exists():
return {"month": "", "chars_used": 0, "requests": 0, "history": []}
try:
return json.loads(_USAGE_FILE.read_text(encoding="utf-8"))
except Exception:
return {"month": "", "chars_used": 0, "requests": 0, "history": []}
def _track_usage(self, chars: int) -> None:
"""Track character usage for budget monitoring."""
usage = self._load_usage()
current_month = datetime.now().strftime("%Y-%m")
if usage.get("month") != current_month:
# New month, archive old data and reset
usage = {
"month": current_month,
"chars_used": 0,
"requests": 0,
"history": [],
}
usage["chars_used"] = usage.get("chars_used", 0) + chars
usage["requests"] = usage.get("requests", 0) + 1
usage["history"].append({
"timestamp": datetime.now().isoformat(),
"chars": chars,
})
# Keep history manageable (last 100 entries)
if len(usage["history"]) > 100:
usage["history"] = usage["history"][-100:]
_USAGE_FILE.parent.mkdir(parents=True, exist_ok=True)
_USAGE_FILE.write_text(
json.dumps(usage, indent=2),
encoding="utf-8",
)
def get_usage_report(self) -> str:
"""Get a formatted usage report."""
usage = self._load_usage()
current_month = datetime.now().strftime("%Y-%m")
if usage.get("month") != current_month:
return (
f"Voice Usage ({current_month}):\n"
f" Characters used: 0 / {self.monthly_budget}\n"
f" Requests: 0\n"
f" Budget remaining: {self.monthly_budget} chars"
)
chars_used = usage.get("chars_used", 0)
remaining = max(0, self.monthly_budget - chars_used)
pct = (chars_used / self.monthly_budget * 100) if self.monthly_budget > 0 else 0
report = (
f"Voice Usage ({current_month}):\n"
f" Characters used: {chars_used:,} / {self.monthly_budget:,} ({pct:.1f}%)\n"
f" Requests: {usage.get('requests', 0)}\n"
f" Budget remaining: {remaining:,} chars"
)
if pct >= 80:
report += "\n WARNING: Approaching monthly limit!"
elif pct >= 100:
report += "\n BUDGET EXHAUSTED: Voice disabled until next month."
return report
@staticmethod
def cleanup_temp_files(max_age_hours: int = 1) -> int:
"""Remove temporary voice files older than max_age_hours.
Returns number of files cleaned up.
"""
if not _TEMP_DIR.exists():
return 0
cutoff = time.time() - (max_age_hours * 3600)
cleaned = 0
for audio_file in _TEMP_DIR.glob("voice_*.mp3"):
try:
if audio_file.stat().st_mtime < cutoff:
audio_file.unlink()
cleaned += 1
except Exception:
continue
return cleaned
5.2 MODIFY: mcp_tools.py -- Add speak_text Tool
Add after the existing tool definitions, before the file_system_server creation:
# ============================================
# ElevenLabs Voice Tool (MCP)
# ============================================
# Lazy-loaded ElevenLabs client
_elevenlabs_client: Optional[Any] = None
def _get_elevenlabs_client():
"""Lazy-load ElevenLabs client when first needed."""
global _elevenlabs_client
if _elevenlabs_client is None:
try:
from elevenlabs_client import ElevenLabsClient
_elevenlabs_client = ElevenLabsClient()
except ImportError:
return None
return _elevenlabs_client
@tool(
name="speak_text",
description=(
"Convert text to speech using Garvis's voice (British AI assistant). "
"Use this to deliver important messages, greetings, witty remarks, or "
"when the user explicitly asks for a voice response. The audio will be "
"sent as a voice message on the user's platform (Telegram voice note "
"or Slack audio). Keep text concise -- budget is limited to ~9000 "
"chars/month (free tier). Returns a voice marker that the runtime "
"will convert to an audio message."
),
input_schema={
"text": str, # The text to speak (max 2500 chars)
},
)
async def speak_text_tool(args: Dict[str, Any]) -> Dict[str, Any]:
"""Generate speech from text using ElevenLabs.
Zero-cost MCP tool when not invoked. Calls ElevenLabs API only on use.
Returns a voice marker for the runtime to process.
"""
text = args.get("text", "").strip()
if not text:
return {
"content": [{"type": "text", "text": "Error: No text provided for speech"}],
"isError": True,
}
client = _get_elevenlabs_client()
if not client:
return {
"content": [{
"type": "text",
"text": "Error: ElevenLabs not configured. Add ELEVENLABS_API_KEY to .env",
}],
"isError": True,
}
if not client.is_available():
return {
"content": [{
"type": "text",
"text": "Error: ElevenLabs voice is disabled or not configured",
}],
"isError": True,
}
# Check budget before calling API
remaining = client.get_remaining_budget()
if len(text) > remaining:
return {
"content": [{
"type": "text",
"text": (
f"Voice budget insufficient: need {len(text)} chars, "
f"only {remaining} remaining this month. "
f"Respond with text instead."
),
}],
"isError": True,
}
# Generate speech
result = client.text_to_speech(text)
if result["success"]:
audio_path = result["audio_path"]
chars_used = result["chars_used"]
remaining_budget = result["remaining_budget"]
# Return a voice marker that the runtime postprocessor will detect.
# The marker embeds the audio file path so the runtime can find it.
return {
"content": [{
"type": "text",
"text": (
f"[VOICE:{audio_path}]\n\n"
f"Voice message generated ({chars_used} chars, "
f"{remaining_budget} remaining this month). "
f"The audio will be delivered as a voice message."
),
}],
}
else:
return {
"content": [{
"type": "text",
"text": f"Voice generation failed: {result['error']}. Responding with text instead.",
}],
"isError": True,
}
Then add speak_text_tool to the file_system_server tools list:
file_system_server = create_sdk_mcp_server(
name="file_system",
version="2.1.0", # bump version
tools=[
# ... existing tools ...
# Voice tool
speak_text_tool,
]
)
5.3 MODIFY: llm_interface.py -- Add to Allowed Tools
In _build_agent_sdk_options(), add "speak_text" to the allowed_tools list:
allowed_tools = [
# ... existing tools ...
# Voice
"speak_text",
]
5.4 MODIFY: adapters/base.py -- Add Voice Support
Add to AdapterCapabilities:
@dataclass
class AdapterCapabilities:
supports_threads: bool = False
supports_reactions: bool = False
supports_media: bool = False
supports_files: bool = False
supports_markdown: bool = False
supports_voice: bool = False # NEW
max_message_length: int = 2000
chunking_strategy: Optional[str] = None
Add default method to BaseAdapter:
async def send_voice_message(
self,
channel_id: str,
audio_path: str,
reply_to_id: Optional[str] = None,
thread_id: Optional[str] = None,
caption: Optional[str] = None,
) -> Dict[str, Any]:
"""Send a voice/audio message to the platform. Optional.
Args:
channel_id: Target channel/chat ID
audio_path: Path to the audio file (MP3)
reply_to_id: Optional message to reply to
thread_id: Optional thread to post in
caption: Optional text caption with the voice message
Returns:
Dict with at least {"success": bool}
"""
return {"success": False, "error": "Voice not supported on this platform"}
5.5 MODIFY: adapters/telegram/adapter.py -- Implement Voice Sending
Add to capabilities:
@property
def capabilities(self) -> AdapterCapabilities:
return AdapterCapabilities(
supports_threads=False,
supports_reactions=True,
supports_media=True,
supports_files=True,
supports_markdown=True,
supports_voice=True, # NEW
max_message_length=4096,
chunking_strategy="markdown",
)
Add voice sending method:
async def send_voice_message(
self,
channel_id: str,
audio_path: str,
reply_to_id: Optional[str] = None,
thread_id: Optional[str] = None,
caption: Optional[str] = None,
) -> Dict[str, Any]:
"""Send a voice message to Telegram.
Telegram voice notes require OGG/Opus format. If the source is MP3,
we convert it using pydub + ffmpeg. Telegram also accepts MP3 directly
via send_voice() since Bot API 6.0+.
"""
if not self.bot:
return {"success": False, "error": "Bot not started"}
try:
from pathlib import Path
audio_file = Path(audio_path)
if not audio_file.exists():
return {"success": False, "error": f"Audio file not found: {audio_path}"}
chat_id = int(channel_id)
reply_id = int(reply_to_id) if reply_to_id else None
# Attempt OGG/Opus conversion for native voice note display.
# Falls back to sending MP3 directly if pydub/ffmpeg not available.
ogg_path = None
try:
from pydub import AudioSegment
audio = AudioSegment.from_mp3(str(audio_file))
ogg_path = audio_file.with_suffix(".ogg")
audio.export(str(ogg_path), format="ogg", codec="libopus")
voice_file = ogg_path
except Exception:
# pydub or ffmpeg not available; send MP3 directly
voice_file = audio_file
with open(voice_file, "rb") as f:
sent = await self.bot.send_voice(
chat_id=chat_id,
voice=f,
caption=caption,
reply_to_message_id=reply_id,
)
# Clean up temporary OGG file
if ogg_path and ogg_path.exists():
try:
ogg_path.unlink()
except Exception:
pass
return {
"success": True,
"message_id": sent.message_id,
"chat_id": sent.chat_id,
}
except TelegramError as e:
print(f"[Telegram] Error sending voice: {e}")
return {"success": False, "error": str(e)}
except Exception as e:
print(f"[Telegram] Voice send error: {e}")
return {"success": False, "error": str(e)}
5.6 MODIFY: adapters/slack/adapter.py -- Implement Voice Sending
Add to capabilities:
@property
def capabilities(self) -> AdapterCapabilities:
return AdapterCapabilities(
supports_threads=True,
supports_reactions=True,
supports_media=True,
supports_files=True,
supports_markdown=True,
supports_voice=True, # NEW
max_message_length=4000,
chunking_strategy="word",
)
Add voice sending method:
async def send_voice_message(
self,
channel_id: str,
audio_path: str,
reply_to_id: Optional[str] = None,
thread_id: Optional[str] = None,
caption: Optional[str] = None,
) -> Dict[str, Any]:
"""Send a voice/audio file to Slack.
Uses files_upload_v2 (the modern file upload method).
Slack displays MP3 files with an inline audio player.
"""
if not self.app:
return {"success": False, "error": "Adapter not started"}
try:
from pathlib import Path
audio_file = Path(audio_path)
if not audio_file.exists():
return {"success": False, "error": f"Audio file not found: {audio_path}"}
result = await self.app.client.files_upload_v2(
channel=channel_id,
file=str(audio_file),
filename=f"garvis_voice_{audio_file.stem}.mp3",
title="Garvis Voice Message",
initial_comment=caption or "",
thread_ts=thread_id,
)
return {
"success": True,
"file_id": result.get("file", {}).get("id", "unknown"),
}
except SlackApiError as e:
error_msg = e.response["error"]
print(f"[Slack] Error sending voice: {error_msg}")
return {"success": False, "error": error_msg}
except Exception as e:
print(f"[Slack] Voice send error: {e}")
return {"success": False, "error": str(e)}
5.7 MODIFY: adapters/runtime.py -- Voice Postprocessor
Add a voice postprocessor that detects [VOICE:path] markers in the agent's response and triggers audio delivery. Add this function and register it:
import re
from pathlib import Path
def _extract_voice_markers(text: str) -> list:
"""Extract [VOICE:path] markers from text.
Returns list of (marker_string, audio_path) tuples.
"""
pattern = r'\[VOICE:(.*?)\]'
matches = re.findall(pattern, text)
return [(f"[VOICE:{path}]", path.strip()) for path in matches]
# In AdapterRuntime._process_message(), after the agent response is received
# and before sending the text response, add voice handling:
# --- Inside _process_message, after getting `response` from agent.chat() ---
# Handle voice markers in response
voice_markers = _extract_voice_markers(response)
if voice_markers and adapter and adapter.capabilities.supports_voice:
for marker, audio_path in voice_markers:
# Remove the marker from the text response
response = response.replace(marker, "").strip()
# Send the voice message
voice_result = await adapter.send_voice_message(
channel_id=message.channel_id,
audio_path=audio_path,
reply_to_id=(
message.metadata.get("ts")
or str(message.metadata.get("message_id", ""))
),
thread_id=message.thread_id,
)
if voice_result.get("success"):
print(
f"[{message.platform.upper()}] Voice message sent "
f"({Path(audio_path).stat().st_size} bytes)"
)
else:
print(
f"[{message.platform.upper()}] Voice send failed: "
f"{voice_result.get('error')}"
)
# Clean up temp audio file
try:
Path(audio_path).unlink(missing_ok=True)
except Exception:
pass
# If the text response is now empty (was voice-only), add a minimal text fallback
if not response.strip() and voice_markers:
response = "" # Don't send empty text; voice was the response
# Continue with existing text send logic (only if response is non-empty)
5.8 MODIFY: memory_workspace/SOUL.md -- Document Voice Tool
Add to the "Available Tools" section:
### Voice (ElevenLabs - API Cost)
- speak_text (convert text to Garvis voice message, delivered as platform voice note)
**Voice Guidelines**:
- Use voice for: greetings, witty quips, important announcements, when user asks
- Keep voice messages SHORT (1-3 sentences, under 500 chars)
- Budget: ~9,000 chars/month -- be selective
- Always provide text alongside voice (accessibility)
- Voice personality: British AI assistant, dry wit, composed confidence
- Signature phrases in voice: "Right then", "Very good, sir", "I've taken the liberty of..."
- DO NOT use voice for: long explanations, code, lists, weather reports (waste of budget)
5.9 MODIFY: .env.example -- Add ElevenLabs Section
# ========================================
# ElevenLabs Voice (Optional)
# ========================================
# Enables Jarvis-style voice responses
# Sign up: https://elevenlabs.io (free tier: 10,000 chars/month)
# API Key (from Profile + API Key page)
ELEVENLABS_API_KEY=your-api-key-here
# Voice ID - Jarvis Robot (British AI assistant)
ELEVENLABS_VOICE_ID=WWtyH2oxeOp9yZwK8ERD
# Model - eleven_multilingual_v2 (best) or eleven_turbo_v2_5 (faster)
# ELEVENLABS_MODEL_ID=eleven_multilingual_v2
# Monthly character budget (default: 9000, buffer below 10k free limit)
# ELEVENLABS_MONTHLY_BUDGET=9000
# Enable/disable voice globally
# ELEVENLABS_ENABLED=true
5.10 MODIFY: .gitignore -- Add Voice Temp Files
# Voice temp files
temp/voice/
config/elevenlabs_usage.json
6. Voice Trigger Logic
6.1 When Garvis Should Use Voice
The LLM decides when to use the speak_text tool based on SOUL.md instructions. The key triggers:
Explicit triggers (always use voice):
- "Say that out loud"
- "Voice response please"
- "Tell me in your voice"
- "Speak to me, Garvis"
- Any message containing "voice" + "say/tell/speak/read"
Auto triggers (when voice_mode: auto, LLM decides):
- Morning greetings: "Good morning, sir. I trust you slept well."
- Task completion announcements: "All done. Your calendar is updated."
- Witty remarks / personality moments
- Important alerts: "Sir, your budget has exceeded 75%."
Never voice (even in auto mode):
- Code blocks or technical output
- Long responses (> 500 chars)
- Lists, tables, structured data
- Weather reports (text is more useful)
- When budget is low (< 1000 chars remaining)
6.2 Dual Response Pattern
When using voice, Garvis should ALWAYS send both:
- Voice message: The spoken audio
- Text message: The same (or slightly different) text version
This ensures accessibility, searchability, and works even if voice delivery fails. The text accompanies the voice naturally -- Telegram shows it as a caption, Slack shows it as a message with the audio file.
Example agent response with voice:
[VOICE:temp/voice/voice_20260217_143022.mp3]
Good morning, sir. The weather in Centennial looks rather agreeable today -- 72 degrees with clear skies. I'd recommend that light jacket you've been neglecting.
The runtime strips the [VOICE:...] marker, sends the audio, then sends the remaining text as a regular message.
7. Platform Delivery
7.1 Telegram
| Aspect | Details |
|---|---|
| API Method | bot.send_voice() |
| Format | OGG/Opus (converted from MP3 via pydub) or MP3 directly |
| Display | Native voice note player (waveform visualization) |
| Max Size | 50 MB |
| Caption | Supported (text alongside voice note) |
| Duration | Auto-detected from audio metadata |
User experience: The voice message appears as a playable waveform bubble in chat. The user taps to listen. It looks and feels like a standard Telegram voice message.
7.2 Slack
| Aspect | Details |
|---|---|
| API Method | files_upload_v2() |
| Format | MP3 (no conversion needed) |
| Display | Inline audio player with play button |
| Max Size | Determined by workspace plan |
| Caption | Via initial_comment parameter |
| Thread | Supported via thread_ts |
User experience: The audio appears as an uploaded file with Slack's inline audio player. The user clicks play to listen. Less native-feeling than Telegram but functional.
8. Cost Monitoring
8.1 Character Budget System
The ElevenLabsClient tracks usage in config/elevenlabs_usage.json:
{
"month": "2026-02",
"chars_used": 2340,
"requests": 12,
"history": [
{"timestamp": "2026-02-17T14:30:22", "chars": 180},
{"timestamp": "2026-02-17T15:45:10", "chars": 220}
]
}
8.2 Budget Enforcement
| Chars Remaining | Behavior |
|---|---|
| > 3000 | Normal operation |
| 1000 - 3000 | LLM gets usage warning in tool response |
| 100 - 1000 | Only explicit voice requests honored |
| 0 | Voice tool returns error, LLM responds with text only |
8.3 Monthly Budget Reset
The budget resets automatically on the 1st of each month (detected by comparing usage.month with current month string).
8.4 Integration with Daily Cost Report
The existing daily cost report scheduled task (daily-cost-report in scheduled_tasks.yaml) can be extended to include voice usage. The agent can read config/elevenlabs_usage.json using the read_file tool and include voice stats in the report.
9. Testing Strategy
9.1 Unit Tests
# test_elevenlabs.py
def test_budget_tracking():
"""Ensure character budget is tracked correctly."""
client = ElevenLabsClient()
# Reset usage file
# Track 100 chars
# Assert remaining = budget - 100
def test_budget_rejection():
"""Ensure over-budget requests are rejected."""
# Set budget to 50
# Attempt to speak 100 chars
# Assert error returned
def test_monthly_reset():
"""Ensure budget resets on new month."""
# Write usage with month = "2026-01"
# Check remaining in "2026-02"
# Assert full budget available
def test_text_too_long():
"""Ensure 2500 char per-request limit is enforced."""
# Attempt to speak 3000 chars
# Assert error about per-request limit
def test_empty_text():
"""Ensure empty text is rejected."""
def test_temp_file_cleanup():
"""Ensure old temp files are cleaned up."""
9.2 Integration Tests
def test_voice_marker_extraction():
"""Test [VOICE:path] marker parsing."""
text = "Hello [VOICE:temp/voice/v1.mp3] world"
markers = _extract_voice_markers(text)
assert len(markers) == 1
assert markers[0][1] == "temp/voice/v1.mp3"
def test_voice_marker_removal():
"""Test that markers are cleanly removed from text."""
text = "[VOICE:temp/v.mp3]\n\nHello, sir."
markers = _extract_voice_markers(text)
clean = text.replace(markers[0][0], "").strip()
assert clean == "Hello, sir."
def test_telegram_voice_send():
"""Test Telegram voice message delivery (mock)."""
# Mock bot.send_voice
# Call adapter.send_voice_message
# Assert send_voice called with correct params
def test_slack_voice_send():
"""Test Slack audio file upload (mock)."""
# Mock app.client.files_upload_v2
# Call adapter.send_voice_message
# Assert upload called with correct params
9.3 Manual Testing Checklist
- Set
ELEVENLABS_API_KEYin.env - Send "Garvis, say hello in your voice" via Telegram
- Verify voice note appears in Telegram chat
- Verify voice waveform is playable
- Verify text response also appears alongside voice
- Send "Speak to me" via Slack (if configured)
- Verify audio file appears in Slack with player
- Check
config/elevenlabs_usage.jsonfor correct tracking - Test budget exhaustion (set budget to 10, speak > 10 chars)
- Verify graceful fallback to text when voice fails
- Test with
ELEVENLABS_ENABLED=false-- voice tool should return error - Test with missing API key -- voice tool should return error
- Test text > 2500 chars -- should reject with clear message
- Verify temp files are cleaned up after delivery
- Test on slow network (API timeout handling)
10. Edge Cases and Error Handling
10.1 Error Scenarios
| Scenario | Handling |
|---|---|
| No API key configured | Tool returns error, LLM responds with text |
| Invalid API key | Tool returns clear error message |
| API rate limit (429) | Tool returns "wait and retry" message |
| API timeout | Tool returns timeout error after 30s |
| Audio conversion fails (no ffmpeg) | Send MP3 directly (Telegram supports it since Bot API 6.0) |
| Budget exhausted | Tool rejects with remaining chars info |
| Temp file missing at send time | Log error, send text-only response |
| Platform doesn't support voice | Voice marker stays in text (removed by marker cleanup), text still sent |
| Large text (> 2500 chars) | Tool rejects, suggests shortening |
| Empty text | Tool rejects immediately |
| Network error during API call | Tool returns error, LLM falls back to text |
| Concurrent voice requests | Each gets its own timestamp-based temp file |
| Bot restart mid-voice | Orphaned temp files cleaned up by periodic cleanup |
10.2 Graceful Degradation
The system degrades gracefully at every level:
- No ElevenLabs configured: Tool returns error -> LLM uses text only
- Budget exhausted: Tool rejects -> LLM uses text only
- API failure: Tool returns error -> LLM uses text only
- Audio conversion fails: Send MP3 instead of OGG
- Platform doesn't support voice: Text response still delivered
- Voice file cleanup fails: No impact on user; files are small
10.3 Security Considerations
- API key: Stored in
.env(gitignored), never logged - Audio files: Temporary, auto-cleaned, stored in
temp/voice/(gitignored) - Usage data:
config/elevenlabs_usage.json(gitignored) -- no sensitive data - User content: Text sent to ElevenLabs API for synthesis -- same privacy model as sending text to any external API. ElevenLabs has a zero-retention mode (
enable_logging: false) that can be enabled for additional privacy.
11. Troubleshooting
11.1 Common Issues
"ElevenLabs not configured"
- Ensure
ELEVENLABS_API_KEYis set in.env - Ensure
ELEVENLABS_ENABLEDis not set tofalse - Restart the bot after changing
.env
"Invalid ElevenLabs API key"
- Check key at https://elevenlabs.io (Profile + API Key)
- Ensure no trailing whitespace in
.env - Free tier keys work fine; no paid plan needed
Voice note shows as file instead of voice player (Telegram)
- Install
ffmpeg:choco install ffmpeg(Windows) orapt install ffmpeg(Linux) - Install
pydub:pip install pydub - Without these, MP3 is sent directly; Telegram may display it as audio file instead of voice note
No sound / corrupted audio
- Check ElevenLabs dashboard for the request in usage history
- Try changing
ELEVENLABS_OUTPUT_FORMATtomp3_22050_32(smaller, more compatible) - Verify the voice ID is correct:
WWtyH2oxeOp9yZwK8ERD
Budget shows exhausted but it's a new month
- Delete
config/elevenlabs_usage.json-- it will be recreated - The auto-reset checks the month string; manual deletion is safe
Voice works in Telegram but not Slack
- Ensure Slack bot has
files:writescope - Check Slack workspace file upload limits
11.2 Diagnostic Commands
You can ask Garvis directly:
- "What's your voice budget status?" -- reads
elevenlabs_usage.json - "Test your voice" -- triggers a short speak_text
- "Disable voice" -- edit preferences
12. Future Enhancements
12.1 Short Term (After Initial Integration)
- Voice-to-text (STT): Accept Telegram voice messages as input using ElevenLabs Speech-to-Text API or Whisper. User sends voice -> Garvis transcribes -> processes as text.
- Voice preference commands:
/voice on,/voice off,/voice statusas Telegram commands. - Smart budget allocation: Reserve 20% of budget for the last week of the month.
- Audio caching: Cache frequently spoken phrases (greetings, confirmations) to save API calls.
12.2 Medium Term
- Custom voice cloning: Clone a custom Jarvis-like voice using ElevenLabs voice cloning (requires Starter plan at $5/month). Train on MCU JARVIS audio clips for closer personality match.
- Scheduled voice messages: Morning briefing delivered as voice note instead of text. "Good morning, sir. Today's forecast calls for..."
- Emotional voice modulation: Adjust
stabilityandstyleparameters based on message tone (urgent = higher stability, witty = lower stability + more style). - Multi-language support: Use
language_codeparameter for occasional non-English responses.
12.3 Long Term
- Real-time voice conversations: ElevenLabs Conversational AI SDK for live voice chat via Telegram voice calls.
- Voice-based authentication: Recognize Jordan's voice vs. other users.
- Ambient audio: Background music or sound effects for dramatic effect (Iron Man suit sounds).
- Voice journal: Daily summary delivered as a podcast-style voice recording.
Implementation Order
For a smooth rollout, implement in this order:
- Create
elevenlabs_client.py-- standalone, testable, no dependencies on existing code - Add
speak_textMCP tool tomcp_tools.py-- register the tool - Add
speak_texttollm_interface.pyallowed tools -- make it discoverable - Add
send_voice_message()toadapters/base.py-- base interface - Implement Telegram voice in
adapters/telegram/adapter.py-- primary platform - Add voice postprocessor to
adapters/runtime.py-- wire up delivery - Update
SOUL.md-- teach Garvis when/how to use voice - Update
.env.exampleand.gitignore-- configuration - Test end-to-end on Telegram
- Implement Slack voice in
adapters/slack/adapter.py-- secondary platform - Add budget monitoring to daily report -- observability
Estimated effort: 3-4 hours for core implementation, 1-2 hours for testing.
Quick Reference Card
Tool: speak_text
Input: { "text": "Hello, sir." }
API: ElevenLabs TTS v1
Voice: Jarvis - Robot (WWtyH2oxeOp9yZwK8ERD)
Model: eleven_multilingual_v2
Format: MP3 -> OGG/Opus (Telegram) or MP3 (Slack)
Budget: 9,000 chars/month (free tier: 10,000)
Max/request: 2,500 chars
Temp files: temp/voice/voice_*.mp3
Usage file: config/elevenlabs_usage.json
Delivery: [VOICE:path] marker -> runtime postprocessor -> adapter.send_voice_message()
Fallback: Always text, voice is enhancement