docs: add HuggingFace cache troubleshooting to README

- Document HF_HOME environment variable for writable cache
- Add systemd service permission guidance for /tmp paths
- Troubleshooting steps for read-only file system errors
This commit is contained in:
2026-02-26 15:56:09 -06:00
parent 85a334a57b
commit 9917d44f5d
36 changed files with 168 additions and 0 deletions

140
research/overview.md Executable file
View File

@@ -0,0 +1,140 @@
# Vox - Discord Text-to-Speech Bot
A Python-based Discord bot that generates neural text-to-speech using voice cloning from reference WAV files.
## Project Structure
```
Vox/
├── bot.py # Main entry point, Discord bot implementation
├── config.py # Configuration management using environment variables
├── voice_manager.py # Voice discovery, loading, and user preferences
├── audio_effects.py # Audio post-processing effects (7 effects)
├── audio_preprocessor.py # Audio preprocessing for voice cloning
├── numba_config.py # Numba JIT compiler cache configuration
├── requirements.txt # Python dependencies
├── launch.sh # Shell script to start the bot
├── pockettts.service # Systemd service file for Linux deployment
├── README.md # Comprehensive documentation
├── .env # Production environment configuration
├── .env.testing # Testing environment configuration
├── .env.example # Environment configuration template
└── voices/ # Directory for voice WAV files
├── preferences.json # User voice/effect preferences (auto-generated)
└── *.wav # Voice reference files
```
## Core Functionality
### TTS Implementation
- **Engine**: Pocket TTS (`pocket-tts` library) for neural text-to-speech synthesis
- **Voice Cloning**: Uses reference WAV files to clone voices via `model.get_state_for_audio_prompt()`
- **On-demand Loading**: Voices are loaded only when first needed, then cached
### Discord Integration
- Monitors a configured text channel for messages
- Joins the user's voice channel when they speak
- Uses `discord.FFmpegPCMAudio` with piped WAV data for streaming
### Audio Processing Pipeline
```
Text Message → Pocket TTS → Audio Effects → Normalize → FFmpeg → Discord VC
```
## Dependencies
| Library | Purpose |
|---------|---------|
| `discord.py[voice]>=2.3.0` | Discord bot API with voice support |
| `pocket-tts>=0.1.0` | Neural TTS engine with voice cloning |
| `scipy>=1.10.0` | Scientific computing (audio I/O) |
| `numpy>=1.24.0` | Numerical computing |
| `librosa>=0.10.0` | Audio analysis and effects |
| `noisereduce>=3.0.0` | Noise reduction preprocessing |
| `soundfile>=0.12.0` | Audio file I/O |
| `python-dotenv>=1.0.0` | Environment variable loading |
**System Requirements**: Python 3.10+, FFmpeg
## Key Modules
### `TTSBot` (bot.py)
Main Discord bot class that extends `commands.Bot`. Handles:
- Message processing and TTS queue
- Voice channel connections
- Slash command registration
- Startup initialization (loads TTS model, discovers voices)
### `VoiceManager` (voice_manager.py)
Manages voice files and user preferences:
- Discovers voices from WAV files in `voices/` directory
- On-demand voice loading with caching
- Per-user voice selection and effect preferences
- Preferences persistence to JSON
### `AudioEffects` (audio_effects.py)
Provides 7 post-processing effects:
1. **Pitch** (-12 to +12 semitones)
2. **Speed** (0.5x to 2.0x)
3. **Echo** (0-100%)
4. **Robot** (0-100%) - Ring modulation
5. **Chorus** (0-100%) - Multiple voice layering
6. **Tremolo Depth** (0.0-1.0)
7. **Tremolo Rate** (0.0-10.0 Hz)
### `AudioPreprocessor` (audio_preprocessor.py)
Prepares voice reference files for cloning:
1. Load and resample to 22050 Hz
2. Normalize volume
3. Trim silence
4. Noise reduction
5. Limit length (default 15 seconds)
### `Config` (config.py)
Centralized configuration management with environment-aware loading and validation.
## Slash Commands
| Command | Description |
|---------|-------------|
| `/voice list` | Show available voices |
| `/voice set <name>` | Select your voice |
| `/voice current` | Show current voice |
| `/voice refresh` | Rescan for new voices |
| `/voice preview <name>` | Preview before committing |
| `/effects list` | Show your effect settings |
| `/effects set <effect> <value>` | Adjust effects |
| `/effects reset` | Reset to defaults |
## Features
- **Voice Cloning**: Add new voices by placing `.wav` files in `voices/` directory
- **Per-User Customization**: Each user can have their own voice and effect preferences
- **Hot-Reload**: Rescan for new voices without restart (`/voice refresh`)
- **Message Queue**: Queues messages for sequential playback
- **Inactivity Management**: Disconnects after 10 minutes of inactivity
- **Testing Support**: Separate `.env.testing` configuration for safe development
## Configuration (.env)
```env
DISCORD_TOKEN=your_bot_token
TEXT_CHANNEL_ID=channel_id_to_monitor
VOICES_DIR=./voices
DEFAULT_VOICE=optional_default_voice_name
```
## Running the Bot
```bash
# Production
python bot.py
# Testing (uses .env.testing)
python bot.py testing
# Or use the launch script
./launch.sh
```
For production deployment on Linux, a systemd service file (`pockettts.service`) is included.