Ka-Note/docs/feature-search.md

137 lines
5.3 KiB
Markdown

# Full-Text Search
## Overview
Ka-Note implements a hybrid full-text search strategy: small in-memory corpora (contexts, page titles) are filtered client-side; the large corpus (history entry text, page body) is indexed server-side using SQLite FTS5 and queried via HTTP.
## Architecture
### Search tiers
| Entity | Where | Method |
|---|---|---|
| Contexts (name) | Client only | Substring on in-memory Svelte store |
| Pages (title) | Client only | Substring on in-memory Svelte store |
| HistoryEntries (text) | Server FTS5 | Debounced HTTP GET /api/search |
| Pages (body) | Server FTS5 | Debounced HTTP GET /api/search |
History entries are the primary scaling concern (years of daily journals → tens of thousands of rows). SQLite FTS5 with BM25 ranking handles this efficiently without additional infrastructure.
### Offline fallback
When the server is unreachable, CommandBar falls back to local results (contexts, page titles) only and shows a notice: "Server nicht erreichbar — nur lokale Ergebnisse".
---
## Server
### FTS5 tables
Migration: `server/drizzle/0013_fts_search.sql`
Two virtual tables using the `unicode61` tokenizer (handles German umlauts correctly, no stemming):
- `fts_history` — content table backed by `history_entries` (columns: `text`)
- `fts_pages` — content table backed by `pages` (columns: `title`, `body`)
Both tables are populated via `INSERT INTO fts_*(...) VALUES('rebuild')` on first migration run.
### Index maintenance
FTS index is updated synchronously after every write, covering all server-side write paths:
| Write path | File | FTS update |
|---|---|---|
| Sync push (primary client sync) | `sync-service.ts``pushChanges()` | after each upsert |
| Trash / soft-delete | `routes/trash.ts` | after batch update |
| AI bundle upload (ZIP) | `ai-export-service.ts``applyOps()` | after each op |
| AI legacy JSON upload | `ai-export-service.ts``applyOps()` | after each op |
| Startup drift recovery | `index.ts` `setImmediate` | full rebuild if mismatch > 10 |
All paths use `better-sqlite3` prepared statements. Shared helper `applyOps()` in `ai-export-service.ts` handles both upload variants. Soft-deleted rows are removed from FTS; active rows are re-indexed via `INSERT OR REPLACE … SELECT`.
**Startup consistency check:** On each server start, row counts of `history_entries` (non-deleted) and `fts_history` are compared. If the difference exceeds 10, both FTS tables are rebuilt via `INSERT INTO fts_*(fts_*) VALUES('rebuild')`. This guards against index drift after DB restores or backup imports.
### Raw SQLite access
File: `server/src/db/connection.ts`
The `better-sqlite3` instance is exported as `sqlite` alongside the Drizzle `db`. This is needed for FTS prepared statements (Drizzle has no FTS5 DSL).
### Search endpoint
```
GET /api/search?q=<query>&limit=<n>
Authorization: Bearer <token>
```
Response:
```json
{
"history": [
{ "id": "...", "topicId": "...", "date": "2025-01-15", "snippet": "...text..." }
],
"pages": [
{ "id": "...", "title": "Page Title", "snippet": "...body text..." }
]
}
```
- `q` must be ≥ 2 characters; shorter queries return empty results.
- `limit` is capped at 20 server-side.
- Each word in `q` is automatically appended with `*` for prefix matching (`"term"*`).
- Results are ranked by BM25 (`ORDER BY rank`).
- FTS5 query errors (invalid syntax from special characters) return empty results instead of HTTP 500.
- Soft-deleted entries are excluded via the FTS delete-on-soft-delete strategy.
File: `server/src/routes/search.ts`
---
## Client
### Settings store
File: `client/src/lib/stores/settings.ts`
Generic key-value settings backed by a Dexie `settings` table (version 13). Provides:
- `getSetting<T>(key, default)` — async one-time read
- `setSetting<T>(key, value)` — async write
- `settingStore<T>(key, default)` — reactive Svelte store backed by `liveQuery`
The `searchResultsLimit` store (default: 3) controls how many server results are requested.
### CommandBar integration
File: `client/src/lib/components/CommandBar.svelte`
In navigate mode (query ≥ 2 chars, not starting with `/`):
1. **Immediately (sync):** Filters `$contextsQuery` and `$pagesQuery` by substring match on name/title.
2. **After 250ms debounce:** Calls `authFetch('/api/search?q=...&limit=...')` using the existing `apiClient` helper.
3. **On success:** Server results are appended after local results. Pages already found by title match are deduplicated.
4. **On error:** `isOffline = true`, a footer notice is shown, local results remain visible.
5. **Total results** are capped at 10.
History results deep-link to `/context/daily-log?date=YYYY-MM-DD`.
---
## Settings
| Key | Type | Default | Description |
|---|---|---|---|
| `searchResultsLimit` | number | 3 | Max server search results per entity type |
To change: write to Dexie via `setSetting('searchResultsLimit', 5)` or add a Settings UI field.
---
## Scaling notes
- FTS5 + BM25 scales to millions of rows. No action needed as data grows.
- The `unicode61` tokenizer handles Unicode correctly. Stemming can be added later by changing `tokenize='unicode61'` to `tokenize='porter unicode61'` in the migration.
- If topic title search needs FTS in future, add `fts_topics` following the same pattern.
- Offline full-text search for history (e.g. via MiniSearch in a Web Worker) is a possible v2 enhancement.