whisper-dictation/README.md

152 lines
4.8 KiB
Markdown

# Whisper Dictation
Local GPU speech-to-text dictation tool. Hold a hotkey to record, release to transcribe and type the result into the active window. Runs fully offline — no cloud, no API key.
## Features
- System tray icon with settings GUI (tkinter)
- Configurable hotkey, model, language, audio device
- Cross-platform: Windows and Linux builds from a single codebase
- Shared config via git (`config.json`, `vocabulary.json`)
- Machine-specific settings stored locally (audio device, GPU settings, model)
- Configurable shared paths for vocabulary and model cache (useful for dual-boot setups)
## Requirements
### Windows
- Python 3.13
- NVIDIA GPU with CUDA 12 drivers
- [PortAudio](http://www.portaudio.com/) (bundled with most Python sounddevice wheels)
- `pywin32` (for system tray and keyboard injection)
- `pyinstaller` (for building a standalone executable)
### Linux
**System packages (install via package manager):**
Arch/CachyOS:
```bash
sudo pacman -S tk libayatana-appindicator wl-clipboard xdotool
```
Debian/Ubuntu:
```bash
sudo apt install python3-tk libayatana-appindicator3-1 wl-clipboard xdotool
```
| Package | Purpose |
|---------|---------|
| `tk` | tkinter GUI (settings, log, vocabulary windows) |
| `libayatana-appindicator` | System tray icon (required for KDE/GNOME on Wayland) |
| `wl-clipboard` | Text injection on Wayland (`wl-copy`) |
| `xdotool` | Simulates Ctrl+V paste on Wayland, text typing on X11 |
**Optional (for GPU acceleration):**
Arch/CachyOS:
```bash
sudo pacman -S nvidia cuda
```
Without CUDA, the app runs on CPU. Use `int8` compute type and a smaller model (`small` or `base`) for acceptable speed on CPU.
**Python:**
- Python 3.10+
- PortAudio (bundled with `sounddevice` wheels)
## Installation
### Windows
```bat
install.bat
```
This creates a `.venv-windows` virtual environment, installs all dependencies and the CUDA 12 DLLs required by faster-whisper.
### Linux
```bash
chmod +x install.sh start.sh build-linux.sh
./install.sh
```
Creates a `.venv-linux` virtual environment with all dependencies and PyInstaller.
## Usage
### Windows
```bat
start.bat
```
### Linux
```bash
./start.sh
```
The app starts in the system tray. Hold the hotkey (default: `Ctrl+Shift+Space`) to record, release to transcribe and type into the active window.
## Build
Builds are platform-specific and output to separate directories:
- Windows: `dist/whisper-dictation-windows/`
- Linux: `dist/whisper-dictation-linux/`
### Windows
```bat
.venv-windows\Scripts\python.exe build.py
```
### Linux
```bash
./build-linux.sh
```
Both use PyInstaller to bundle the app into a standalone folder. The resulting executable can be run without a Python installation.
## Configuration
### Shared config (`config.json`, in app directory)
| Key | Default | Description |
|-----|---------|-------------|
| `hotkey` | `ctrl+shift+space` | Recording trigger |
| `language` | `de` | Transcription language (`de`, `en`, `fr`, `es`, `it`, `null` = auto) |
| `sample_rate` | `16000` | Audio sample rate in Hz |
| `vocab_path` | `""` | Path to vocabulary file (empty = local `vocabulary.json`) |
| `model_dir` | `""` | Path to shared model cache directory (empty = default HuggingFace cache) |
### Local config (`config_local.json`, per machine)
Stored outside the app directory to keep machine-specific settings separate:
- **Windows:** `%LOCALAPPDATA%\WhisperDictation\config_local.json`
- **Linux:** `~/.local/share/WhisperDictation/config_local.json`
| Key | Default | Description |
|-----|---------|-------------|
| `model` | `medium` | Whisper model size (`tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`) |
| `device` | `cuda` | Inference device (`cuda` or `cpu`) |
| `compute_type` | `float16` | Precision (`float16` for GPU, `int8` for CPU, `float32`) |
| `audio_device` | `null` | Microphone (null = system default) |
### Sharing data between Windows and Linux
On a shared drive (e.g. Ventoy USB), both builds can use the same vocabulary and model files. Set `vocab_path` and `model_dir` in the Settings UI to point to a common directory:
```
shared_data/
vocabulary.json <- shared vocabulary
models/ <- shared Whisper model cache
```
Audio settings, model selection, and compute type remain per-platform in `config_local.json`.
## Vocabulary
Custom vocabulary/replacements can be edited via the Settings UI or directly in `vocabulary.json`. Words are passed as initial prompts to improve recognition of domain-specific terms. Replacements are applied as find/replace after transcription.
## Model Download
On first start the selected Whisper model is downloaded automatically from HuggingFace (~500 MB for `medium`). Subsequent starts use the cached model. Set `model_dir` to share the cache between builds.