Voice Dictation: Improvement Attempts

Date: 2026-03-30

Context

Explored improvements to voice dictation workflow. Three tools in use: LilySpeech, Whisper push-to-talk (custom implementation), and CC /voice mode.

Findings

LilySpeech

Best tool for Windows apps (Obsidian, Ecco Pro, browsers). Two issues emerged recently:

“and” dropped after spoken comma: A language model post-processing artifact introduced by a LilySpeech update. When “comma” is spoken, the , character creates a clause boundary — modern grammar correctors filter “and” at clause boundaries as a connector filler. Workaround: say “comma and” quickly without pausing. Permanent fix: disable Smart Punctuation / Post-processing in LilySpeech Advanced Settings.
Spaces eliminated in WSL Ubuntu terminal: Windows SendInput APIs don’t translate correctly into WSL terminal windows. Not fixable — LilySpeech is the wrong tool for that context.

Whisper Push-to-Talk (custom)

10-20 second delay with no word-by-word feedback — inherent to batch transcription architecture
Unreliable text insertion in WSL Ubuntu terminal (clipboard/xdotool paste doesn’t reliably target WSL windows)
Works acceptably for Obsidian and Windows apps, but the delay degrades the experience vs LilySpeech
Decision: removed entirely. Freed ~1.4 GB from ~/.voicemode/ (whisper model + Kokoro TTS). Startup shortcut deleted, services disabled.

CC `/voice` Mode

Whisper (STT) and Kokoro (TTS) were running locally and integrated with CC’s voice mode
Hold-Space → inject text into CC input: failed for same root cause as LilySpeech — WSL terminal text injection is not reliable from Windows-side processes
Conversational /voice mode (speak → Claude hears via whisper → Claude responds with Kokoro TTS): functionally worked end-to-end, confirmed by log analysis
Not suitable for dictating long multi-sentence inputs; changes interaction model to back-and-forth conversation

Outcome

Whisper + Kokoro removed from system. /voice mode disabled in Claude Code settings (voiceEnabled: false).

Current dictation strategy:

Claude Code: LilySpeech → dictate into Windows Notepad → paste into CC terminal. Not elegant but reliable.
Obsidian / Windows apps: LilySpeech directly — best-in-class for these contexts.
Claude Code conversational: /voice mode remains available if re-enabled, useful for short back-and-forth.

Root Cause: Why WSL Terminal Blocks All Dictation Tools

All external text injection approaches (LilySpeech, Whisper hotkey, CC Hold-Space) fail in the WSL Ubuntu terminal because they rely on Windows SendInput or clipboard paste, neither of which reliably targets WSL terminal windows. This is a fundamental architectural constraint — not a bug in any individual tool.