152 lines
4.8 KiB
Markdown
152 lines
4.8 KiB
Markdown
# Whisper Dictation
|
|
|
|
Local GPU speech-to-text dictation tool. Hold a hotkey to record, release to transcribe and type the result into the active window. Runs fully offline — no cloud, no API key.
|
|
|
|
## Features
|
|
|
|
- System tray icon with settings GUI (tkinter)
|
|
- Configurable hotkey, model, language, audio device
|
|
- Cross-platform: Windows and Linux builds from a single codebase
|
|
- Shared config via git (`config.json`, `vocabulary.json`)
|
|
- Machine-specific settings stored locally (audio device, GPU settings, model)
|
|
- Configurable shared paths for vocabulary and model cache (useful for dual-boot setups)
|
|
|
|
## Requirements
|
|
|
|
### Windows
|
|
- Python 3.13
|
|
- NVIDIA GPU with CUDA 12 drivers
|
|
- [PortAudio](http://www.portaudio.com/) (bundled with most Python sounddevice wheels)
|
|
- `pywin32` (for system tray and keyboard injection)
|
|
- `pyinstaller` (for building a standalone executable)
|
|
|
|
### Linux
|
|
|
|
**System packages (install via package manager):**
|
|
|
|
Arch/CachyOS:
|
|
```bash
|
|
sudo pacman -S tk libayatana-appindicator wl-clipboard xdotool
|
|
```
|
|
|
|
Debian/Ubuntu:
|
|
```bash
|
|
sudo apt install python3-tk libayatana-appindicator3-1 wl-clipboard xdotool
|
|
```
|
|
|
|
| Package | Purpose |
|
|
|---------|---------|
|
|
| `tk` | tkinter GUI (settings, log, vocabulary windows) |
|
|
| `libayatana-appindicator` | System tray icon (required for KDE/GNOME on Wayland) |
|
|
| `wl-clipboard` | Text injection on Wayland (`wl-copy`) |
|
|
| `xdotool` | Simulates Ctrl+V paste on Wayland, text typing on X11 |
|
|
|
|
**Optional (for GPU acceleration):**
|
|
|
|
Arch/CachyOS:
|
|
```bash
|
|
sudo pacman -S nvidia cuda
|
|
```
|
|
|
|
Without CUDA, the app runs on CPU. Use `int8` compute type and a smaller model (`small` or `base`) for acceptable speed on CPU.
|
|
|
|
**Python:**
|
|
- Python 3.10+
|
|
- PortAudio (bundled with `sounddevice` wheels)
|
|
|
|
## Installation
|
|
|
|
### Windows
|
|
|
|
```bat
|
|
install.bat
|
|
```
|
|
|
|
This creates a `.venv-windows` virtual environment, installs all dependencies and the CUDA 12 DLLs required by faster-whisper.
|
|
|
|
### Linux
|
|
|
|
```bash
|
|
chmod +x install.sh start.sh build-linux.sh
|
|
./install.sh
|
|
```
|
|
|
|
Creates a `.venv-linux` virtual environment with all dependencies and PyInstaller.
|
|
|
|
## Usage
|
|
|
|
### Windows
|
|
```bat
|
|
start.bat
|
|
```
|
|
|
|
### Linux
|
|
```bash
|
|
./start.sh
|
|
```
|
|
|
|
The app starts in the system tray. Hold the hotkey (default: `Ctrl+Shift+Space`) to record, release to transcribe and type into the active window.
|
|
|
|
## Build
|
|
|
|
Builds are platform-specific and output to separate directories:
|
|
- Windows: `dist/whisper-dictation-windows/`
|
|
- Linux: `dist/whisper-dictation-linux/`
|
|
|
|
### Windows
|
|
```bat
|
|
.venv-windows\Scripts\python.exe build.py
|
|
```
|
|
|
|
### Linux
|
|
```bash
|
|
./build-linux.sh
|
|
```
|
|
|
|
Both use PyInstaller to bundle the app into a standalone folder. The resulting executable can be run without a Python installation.
|
|
|
|
## Configuration
|
|
|
|
### Shared config (`config.json`, in app directory)
|
|
|
|
| Key | Default | Description |
|
|
|-----|---------|-------------|
|
|
| `hotkey` | `ctrl+shift+space` | Recording trigger |
|
|
| `language` | `de` | Transcription language (`de`, `en`, `fr`, `es`, `it`, `null` = auto) |
|
|
| `sample_rate` | `16000` | Audio sample rate in Hz |
|
|
| `vocab_path` | `""` | Path to vocabulary file (empty = local `vocabulary.json`) |
|
|
| `model_dir` | `""` | Path to shared model cache directory (empty = default HuggingFace cache) |
|
|
|
|
### Local config (`config_local.json`, per machine)
|
|
|
|
Stored outside the app directory to keep machine-specific settings separate:
|
|
- **Windows:** `%LOCALAPPDATA%\WhisperDictation\config_local.json`
|
|
- **Linux:** `~/.local/share/WhisperDictation/config_local.json`
|
|
|
|
| Key | Default | Description |
|
|
|-----|---------|-------------|
|
|
| `model` | `medium` | Whisper model size (`tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`) |
|
|
| `device` | `cuda` | Inference device (`cuda` or `cpu`) |
|
|
| `compute_type` | `float16` | Precision (`float16` for GPU, `int8` for CPU, `float32`) |
|
|
| `audio_device` | `null` | Microphone (null = system default) |
|
|
|
|
### Sharing data between Windows and Linux
|
|
|
|
On a shared drive (e.g. Ventoy USB), both builds can use the same vocabulary and model files. Set `vocab_path` and `model_dir` in the Settings UI to point to a common directory:
|
|
|
|
```
|
|
shared_data/
|
|
vocabulary.json <- shared vocabulary
|
|
models/ <- shared Whisper model cache
|
|
```
|
|
|
|
Audio settings, model selection, and compute type remain per-platform in `config_local.json`.
|
|
|
|
## Vocabulary
|
|
|
|
Custom vocabulary/replacements can be edited via the Settings UI or directly in `vocabulary.json`. Words are passed as initial prompts to improve recognition of domain-specific terms. Replacements are applied as find/replace after transcription.
|
|
|
|
## Model Download
|
|
|
|
On first start the selected Whisper model is downloaded automatically from HuggingFace (~500 MB for `medium`). Subsequent starts use the cached model. Set `model_dir` to share the cache between builds.
|