whisper-dictation/README.md

# Whisper Dictation

Local GPU speech-to-text dictation tool. Hold a hotkey to record, release to transcribe and type the result into the active window. Runs fully offline — no cloud, no API key.

## Features

- System tray icon with settings GUI (tkinter)
- Configurable hotkey, model, language, audio device
- Cross-platform: Windows and Linux builds from a single codebase
- Shared config via git (`config.json`, `vocabulary.json`)
- Machine-specific settings stored locally (audio device, GPU settings, model)
- Configurable shared paths for vocabulary and model cache (useful for dual-boot setups)

## Requirements

### Windows
- Python 3.13
- NVIDIA GPU with CUDA 12 drivers
- [PortAudio](http://www.portaudio.com/) (bundled with most Python sounddevice wheels)
- `pywin32` (for system tray and keyboard injection)
- `pyinstaller` (for building a standalone executable)

### Linux

**System packages (install via package manager):**

Arch/CachyOS:
```bash
sudo pacman -S tk libayatana-appindicator wl-clipboard xdotool
```

Debian/Ubuntu:
```bash
sudo apt install python3-tk libayatana-appindicator3-1 wl-clipboard xdotool
```

| Package | Purpose |
|---------|---------|
| `tk` | tkinter GUI (settings, log, vocabulary windows) |
| `libayatana-appindicator` | System tray icon (required for KDE/GNOME on Wayland) |
| `wl-clipboard` | Text injection on Wayland (`wl-copy`) |
| `xdotool` | Simulates Ctrl+V paste on Wayland, text typing on X11 |

**Optional (for GPU acceleration):**

Arch/CachyOS:
```bash
sudo pacman -S nvidia cuda
```

Without CUDA, the app runs on CPU. Use `int8` compute type and a smaller model (`small` or `base`) for acceptable speed on CPU.

**Python:**
- Python 3.10+
- PortAudio (bundled with `sounddevice` wheels)

## Installation

### Windows

```bat
install.bat
```

This creates a `.venv-windows` virtual environment, installs all dependencies and the CUDA 12 DLLs required by faster-whisper.

### Linux

```bash
chmod +x install.sh start.sh build-linux.sh
./install.sh
```

Creates a `.venv-linux` virtual environment with all dependencies and PyInstaller.

## Usage

### Windows
```bat
start.bat
```

### Linux
```bash
./start.sh
```

The app starts in the system tray. Hold the hotkey (default: `Ctrl+Shift+Space`) to record, release to transcribe and type into the active window.

## Build

Builds are platform-specific and output to separate directories:
- Windows: `dist/whisper-dictation-windows/`
- Linux: `dist/whisper-dictation-linux/`

### Windows
```bat
.venv-windows\Scripts\python.exe build.py
```

### Linux
```bash
./build-linux.sh
```

Both use PyInstaller to bundle the app into a standalone folder. The resulting executable can be run without a Python installation.

## Configuration

### Shared config (`config.json`, in app directory)

| Key | Default | Description |
|-----|---------|-------------|
| `hotkey` | `ctrl+shift+space` | Recording trigger |
| `language` | `de` | Transcription language (`de`, `en`, `fr`, `es`, `it`, `null` = auto) |
| `sample_rate` | `16000` | Audio sample rate in Hz |
| `vocab_path` | `""` | Path to vocabulary file (empty = local `vocabulary.json`) |
| `model_dir` | `""` | Path to shared model cache directory (empty = default HuggingFace cache) |

### Local config (`config_local.json`, per machine)

Stored outside the app directory to keep machine-specific settings separate:
- **Windows:** `%LOCALAPPDATA%\WhisperDictation\config_local.json`
- **Linux:** `~/.local/share/WhisperDictation/config_local.json`

| Key | Default | Description |
|-----|---------|-------------|
| `model` | `medium` | Whisper model size (`tiny`, `base`, `small`, `medium`, `large-v2`, `large-v3`) |
| `device` | `cuda` | Inference device (`cuda` or `cpu`) |
| `compute_type` | `float16` | Precision (`float16` for GPU, `int8` for CPU, `float32`) |
| `audio_device` | `null` | Microphone (null = system default) |

### Sharing data between Windows and Linux

On a shared drive (e.g. Ventoy USB), both builds can use the same vocabulary and model files. Set `vocab_path` and `model_dir` in the Settings UI to point to a common directory:

```
shared_data/
  vocabulary.json    <- shared vocabulary
  models/            <- shared Whisper model cache
```

Audio settings, model selection, and compute type remain per-platform in `config_local.json`.

## Vocabulary

Custom vocabulary/replacements can be edited via the Settings UI or directly in `vocabulary.json`. Words are passed as initial prompts to improve recognition of domain-specific terms. Replacements are applied as find/replace after transcription.

## Model Download

On first start the selected Whisper model is downloaded automatically from HuggingFace (~500 MB for `medium`). Subsequent starts use the cached model. Set `model_dir` to share the cache between builds.