SoupaWhisper: How I Replaced SuperWhisper on Linux

SoupaWhisper: How I Replaced SuperWhisper on Linux

I've been using my Linux desktop a lot more recently. The prenatal testing project I'm working on has me spending more time on the machine, and I wanted to make sure my development environment was as productive as possible. When you're deep in a project, small friction points add up quickly.

One thing I missed immediately: voice dictation.

On Mac, I use SuperWhisper constantly. I dictate to Claude, to Grok, and especially when working with AI coding tools like Claude Code. The workflow is simple - hold a key, speak, release, and the text appears. It keeps me in flow without breaking to type out longer prompts or explanations. When you're working with AI tools, you're often explaining context, describing what you want, or talking through problems. Typing all of that out breaks the conversational rhythm.

The problem? SuperWhisper doesn't exist for Linux. And the built-in dictation options are either cloud-based (privacy concerns, latency, requires internet) or ancient tools that barely work. Cursor has some built-in dictation, but Claude Code and any of the CLI tools don't have anything. I needed something that would work everywhere - in the terminal, in the browser, in any application.

So I built my own. It took less than an hour.

The solution

SoupaWhisper is a ~250 line Python script that does exactly what SuperWhisper does, powered by OpenAI's Whisper model running entirely locally via faster-whisper.

The workflow is identical to what I was used to: hold F12, speak, release. The text gets typed into whatever window is active and copied to the clipboard if I want to paste it elsewhere. Nothing fancy - just local speech-to-text that actually works.

I built this using Claude Code, which is part of why it came together so quickly. The combination of Claude Code for the implementation and Linux's straightforward tooling made this surprisingly simple. I was genuinely taken aback by how easy it was to put together. Linux has all the building blocks ready to go - audio capture, clipboard management, keyboard simulation - you just need to wire them together.

Why Python?

I normally reach for Go for almost everything. It's my language of choice and I've written extensively about why. But Python made more sense here for a few reasons.

The faster-whisper bindings are Python-native, so there's no fighting with FFI or CGO bindings. Downloading and managing the Whisper models is trivial with Python's ecosystem. The code stays minimal because you're not dealing with any impedance mismatch between the AI libraries and your application code.

I'm also doing more Python work on the prenatal project right now, so staying in the same language reduces context switching. When you're bouncing between projects, having one less thing to mentally switch on helps.

Technical implementation

The architecture is dead simple. Here's the entire flow:

F12 Press → arecord (ALSA) → temp.wav → faster-whisper → xclip + xdotool

Breaking that down:

  1. Hotkey detection: pynput listens for global keyboard events. When F12 is pressed, recording starts. When it's released, recording stops.
  2. Audio capture: arecord records 16kHz mono WAV, which is exactly what Whisper expects. No resampling or conversion needed.
  3. Transcription: faster-whisper runs the model locally and returns the text.
  4. Output: xclip copies the text to the clipboard, and xdotool types it into the active window.

I use arecord instead of Python audio libraries like pyaudio or sounddevice because it's already installed on most Linux systems via ALSA, it's rock solid, and it handles device selection automatically. No additional Python dependencies to worry about. The tradeoff is it's Linux-only, but that's the target platform anyway.

Background model loading

One thing that would have been annoying: the Whisper model can take several seconds to load, especially the larger models. If you had to wait for that every time you wanted to dictate something, it would kill the workflow.

The solution is loading the model in a background thread on startup:

self.model_loaded = threading.Event()
threading.Thread(target=self._load_model, daemon=True).start()

Then waiting for it only when actually needed:

self.model_loaded.wait()  # Blocks until model is ready

This means you can start recording immediately after launch, even if the model isn't loaded yet. The transcription just waits for the model to be ready. In practice, the model loads fast enough that you rarely hit this, but it's good to have.

Model performance

faster-whisper uses CTranslate2, which is significantly faster than the original OpenAI Whisper implementation. Here's what I found testing the different model sizes:

Model Size CPU Speed Notes
tiny.en 75MB ~1s Basic accuracy, fine for simple phrases
base.en 150MB ~2s Good balance for real-time use
small.en 500MB ~4s Better accuracy, noticeable delay
medium.en 1.5GB ~8s Great accuracy, but you feel the wait
large-v3 3GB ~15s Best accuracy, too slow for dictation

I tried the larger models and they're noticeably more accurate. Technical terms, unusual words, and complex sentences come through better. But the delay becomes painful for real-time dictation. When you're in flow, waiting 8-15 seconds after speaking completely breaks the rhythm.

The base.en model hits the sweet spot for my use case. It's close enough to real-time that the delay doesn't break flow, and accuracy is good enough for dictating prompts and instructions to AI tools. If you're on a machine with a decent GPU and have CUDA set up, even the larger models become viable.

The gotcha that caught me

Here's an interesting bug I hit that taught me something about Linux.

I initially used pynput to detect Ctrl+C for clean exit. Seems reasonable, right? Standard way to kill a running process.

Wrong. pynput captures global keyboard events. So pressing Ctrl+C in any application would kill SoupaWhisper. Copy something from Chrome? Dead. Interrupt a terminal command? Dead. Try to copy a code snippet? Dead.

I didn't realise how global things are on Linux. When you're listening for keyboard events at this level, you're listening to everything happening on the system, not just your application. The fix was simple - don't do that. Just use SIGINT for manual runs and let systemd handle the service lifecycle. But it's something to be aware of if you're building similar tools.

Notification handling

A small quality of life thing: desktop notifications. When you're dictating, you want to know what's happening - recording, transcribing, done. But you don't want notification spam.

The solution is using a synchronous hint to replace previous notifications:

"-h", "string:x-canonical-private-synchronous:soupawhisper"

This means "Recording..." gets replaced by "Transcribing..." which gets replaced by "Copied!" - all in the same notification slot. No notification pile-up.

Configuration

Everything's configurable via ~/.config/soupawhisper/config.ini:

[whisper]
model = base.en
device = cpu
compute_type = int8

[hotkey]
key = f12

[behavior]
auto_type = true
notifications = true

The config file is optional - there are sensible defaults for everything. But if you want to change the hotkey, use a larger model, or disable auto-typing (clipboard only), it's all there.

I use Python's built-in configparser with fallbacks, so missing config values just use defaults. No crashes if your config is incomplete.

Running as a systemd service

I run SoupaWhisper as a systemd user service. This means it starts automatically on login, survives terminal closes, and I get proper logging via journalctl. No need to remember to start it.

The tricky part was X11 access. systemd services don't inherit your environment, so you need to explicitly set DISPLAY and XAUTHORITY:

Environment=DISPLAY=:1
Environment=XAUTHORITY=/run/user/1000/gdm/Xauthority

The installer auto-detects these from the current session, so you don't have to figure out the values yourself.

Multi-distro support

Since I wanted this to work on any Linux system, the installer auto-detects package managers:

if command -v apt &> /dev/null; then
    echo "apt"
elif command -v dnf &> /dev/null; then
    echo "dnf"
elif command -v pacman &> /dev/null; then
    echo "pacman"
...

It then installs the right packages for each distro. Should work on Debian, Ubuntu, Fedora, Arch, and most derivatives.

What's not included

This is intentionally minimal. I built what I needed and nothing more:

  • Wayland support: Currently X11 only. xclip, xdotool, and pynput all need X11. Wayland support would require different tools.
  • Custom vocabulary: No way to add domain-specific terms. Whisper handles most technical terminology reasonably well, but if you're in a very specialised field, you might hit limitations.
  • Punctuation commands: You can't say "period" to insert "." or "new line" to break lines. What you say is what you get.
  • Multiple languages: Hardcoded to English models. The .en models are optimised for English and faster than the multilingual variants.

For my use case - dictating to AI tools and typing out longer thoughts - it does exactly what I need. If you need more features, the code is simple enough to extend.

Try it

git clone https://github.com/ksred/soupawhisper.git
cd soupawhisper
./install.sh

Hold F12, speak, release. That's it.

The whole thing is open source at github.com/ksred/soupawhisper. If you're on Linux and missing proper voice dictation, this might save you some time. I've been using it daily since I built it and it's become part of the workflow I don't think about anymore - which is exactly what good tooling should be.


Need help with your business?

Enjoyed this post? I help companies navigate AI implementation, fintech architecture, and technical strategy. Whether you're scaling engineering teams or building AI-powered products, I'd love to discuss your challenges.

Learn more about how I can support you.

Get practical insights weekly

Real solutions to real problems. No fluff, just production-ready code and workflows that work.
You've successfully subscribed to Kyle Redelinghuys
Great! Next, complete checkout to get full access to all premium content.
Welcome back! You've successfully signed in.
Success! Your account is fully activated, you now have access to all content.
Error! Stripe checkout failed.
Success! Your billing info is updated.
Error! Billing info update failed.