Files
Vox/research/overview.md
Spencer 9917d44f5d docs: add HuggingFace cache troubleshooting to README
- Document HF_HOME environment variable for writable cache
- Add systemd service permission guidance for /tmp paths
- Troubleshooting steps for read-only file system errors
2026-02-26 15:56:09 -06:00

4.9 KiB
Executable File

Vox - Discord Text-to-Speech Bot

A Python-based Discord bot that generates neural text-to-speech using voice cloning from reference WAV files.

Project Structure

Vox/
├── bot.py                 # Main entry point, Discord bot implementation
├── config.py              # Configuration management using environment variables
├── voice_manager.py       # Voice discovery, loading, and user preferences
├── audio_effects.py       # Audio post-processing effects (7 effects)
├── audio_preprocessor.py  # Audio preprocessing for voice cloning
├── numba_config.py        # Numba JIT compiler cache configuration
├── requirements.txt       # Python dependencies
├── launch.sh              # Shell script to start the bot
├── pockettts.service      # Systemd service file for Linux deployment
├── README.md             # Comprehensive documentation
├── .env                   # Production environment configuration
├── .env.testing           # Testing environment configuration
├── .env.example           # Environment configuration template
└── voices/               # Directory for voice WAV files
    ├── preferences.json  # User voice/effect preferences (auto-generated)
    └── *.wav             # Voice reference files

Core Functionality

TTS Implementation

  • Engine: Pocket TTS (pocket-tts library) for neural text-to-speech synthesis
  • Voice Cloning: Uses reference WAV files to clone voices via model.get_state_for_audio_prompt()
  • On-demand Loading: Voices are loaded only when first needed, then cached

Discord Integration

  • Monitors a configured text channel for messages
  • Joins the user's voice channel when they speak
  • Uses discord.FFmpegPCMAudio with piped WAV data for streaming

Audio Processing Pipeline

Text Message → Pocket TTS → Audio Effects → Normalize → FFmpeg → Discord VC

Dependencies

Library Purpose
discord.py[voice]>=2.3.0 Discord bot API with voice support
pocket-tts>=0.1.0 Neural TTS engine with voice cloning
scipy>=1.10.0 Scientific computing (audio I/O)
numpy>=1.24.0 Numerical computing
librosa>=0.10.0 Audio analysis and effects
noisereduce>=3.0.0 Noise reduction preprocessing
soundfile>=0.12.0 Audio file I/O
python-dotenv>=1.0.0 Environment variable loading

System Requirements: Python 3.10+, FFmpeg

Key Modules

TTSBot (bot.py)

Main Discord bot class that extends commands.Bot. Handles:

  • Message processing and TTS queue
  • Voice channel connections
  • Slash command registration
  • Startup initialization (loads TTS model, discovers voices)

VoiceManager (voice_manager.py)

Manages voice files and user preferences:

  • Discovers voices from WAV files in voices/ directory
  • On-demand voice loading with caching
  • Per-user voice selection and effect preferences
  • Preferences persistence to JSON

AudioEffects (audio_effects.py)

Provides 7 post-processing effects:

  1. Pitch (-12 to +12 semitones)
  2. Speed (0.5x to 2.0x)
  3. Echo (0-100%)
  4. Robot (0-100%) - Ring modulation
  5. Chorus (0-100%) - Multiple voice layering
  6. Tremolo Depth (0.0-1.0)
  7. Tremolo Rate (0.0-10.0 Hz)

AudioPreprocessor (audio_preprocessor.py)

Prepares voice reference files for cloning:

  1. Load and resample to 22050 Hz
  2. Normalize volume
  3. Trim silence
  4. Noise reduction
  5. Limit length (default 15 seconds)

Config (config.py)

Centralized configuration management with environment-aware loading and validation.

Slash Commands

Command Description
/voice list Show available voices
/voice set <name> Select your voice
/voice current Show current voice
/voice refresh Rescan for new voices
/voice preview <name> Preview before committing
/effects list Show your effect settings
/effects set <effect> <value> Adjust effects
/effects reset Reset to defaults

Features

  • Voice Cloning: Add new voices by placing .wav files in voices/ directory
  • Per-User Customization: Each user can have their own voice and effect preferences
  • Hot-Reload: Rescan for new voices without restart (/voice refresh)
  • Message Queue: Queues messages for sequential playback
  • Inactivity Management: Disconnects after 10 minutes of inactivity
  • Testing Support: Separate .env.testing configuration for safe development

Configuration (.env)

DISCORD_TOKEN=your_bot_token
TEXT_CHANNEL_ID=channel_id_to_monitor
VOICES_DIR=./voices
DEFAULT_VOICE=optional_default_voice_name

Running the Bot

# Production
python bot.py

# Testing (uses .env.testing)
python bot.py testing

# Or use the launch script
./launch.sh

For production deployment on Linux, a systemd service file (pockettts.service) is included.