Talk to your local LLM: on-device voice input with Whisper

NativeLM v0.8 lets you dictate your questions — transcribed entirely on-device with Whisper (whisper.cpp), no cloud. Here's why we picked Whisper over Android's built-in recognizer, and how the Whisper model became a first-class 'Audio' entry in the model catalog.

Typing is the wrong default for a lot of NativeLM’s users. If you’re working in Hindi, Tamil, or another Indian-language script, the on-screen keyboard is slow and error-prone, and that friction sits right at the front door of the app. For non-English users, voice is the natural input — arguably the single biggest UX unlock for the India market — and it’s a hard requirement for the kids’-focused product (Curio) we’re building on the same engine, because kids simply can’t type.

So v0.8 adds on-device voice input: tap the mic, speak your question, and it’s transcribed locally and dropped into the chat box. Like everything else in NativeLM, the audio never leaves the phone.

The tempting shortcut we refused

The fast path on Android is the built-in SpeechRecognizer with EXTRA_PREFER_OFFLINE. It’s a few lines of code. We didn’t use it.

The reason is the wound we’d just closed. In v0.6.1 we discovered that a Google library bundled for on-device work had quietly stood up a telemetry uploader and we had to strip it out. SpeechRecognizer is the same class of risk: it’s a Google system component whose offline behavior is patchy before Android 13 and whose “is this truly local?” answer we can’t prove. Adding it back would mean re-litigating the exact telemetry question we’d settled. Not worth it.

Option	Offline?	Multilingual	Brand fit
Android `SpeechRecognizer`	Patchy pre-13; Google component	Limited	🔴 phone-home risk — the v0.6.1 wound
Whisper on-device	✅ Fully	✅ 90+ languages	✅✅ no Google, matches the engine
Vosk	✅	Per-language models	🟢 OK, but lower quality

Whisper won. It’s fully offline, genuinely multilingual (which pairs perfectly with NativeLM’s multilingual answers), and it owes nothing to Google Play Services. We integrated it via whisper.cpp, the mature, well-trodden native build, starting with the multilingual Whisper-tiny (~40 MB) / base (~75 MB) models — small enough to tier onto real phones.

The pipeline

The loop is deliberately simple:

[mic button] → RECORD_AUDIO → AudioRecord (16 kHz mono PCM)
   → Whisper inference (on-device) → transcript
   → populate the chat input  (you review, then send)

The mic button sits in the chat input bar. You hold (or tap) to record, see a listening state, then a “transcribing…” spinner while Whisper runs. The transcript lands in the input field rather than auto-sending — because ASR slips, especially on non-English speech, and a one-glance edit beats firing a wrong question at the model. (Auto-send is an opt-in setting.)

On privacy, the rules are strict and, per the v0.6.1 lesson, verified rather than assumed: RECORD_AUDIO is held only during active dictation — no background listening — and we re-checked logcat to confirm no audio or text leaves the device.

Making “Audio” a real model category

The interesting engineering wasn’t the audio plumbing — it was not hard-coding the Whisper model as a magic blob. NativeLM’s model catalog is the single source of truth for everything downloadable, tiered, and managed in the UI. The LLM and the embedder both live there. Whisper had to join them as a first-class citizen.

The catalog previously only knew two roles:

enum class ModelRole { LLM_PRIMARY, EMBEDDING }

So we taught it about speech:

Added ModelRole.SPEECH_TO_TEXT — the new “Audio” category.
Added a ModelFormat for Whisper (the whisper.cpp GGML variant) alongside the existing LiteRT and MediaPipe embedder formats.
Registered Whisper descriptors (tiny / base, multilingual) with proper minDeviceRamMb tiering and a SHA-256 for integrity.

Because the catalog drives the UI, the models screen now groups by role and an “Audio” section appears automatically next to the LLM and Embedding models — something you can see, download, and delete. Voice input is gated on its model being present and downloads on first use, reusing the engine’s existing Ktor download manager (resume + SHA-256 validation) and the hardware-tier provider that already decides tiny-vs-base for the LLM. No new download stack, no new tiering logic — the Whisper model rides the rails we already built.

One integration, two products

We deliberately built speech-to-text down in the engine (:lib) rather than up in the app. NativeLM gets dictation; Curio inherits it for free. Kids need voice because they can’t type, and the same on-device Whisper path serves both — one native integration, two apps, zero compromise on the “everything runs on your device” promise.

Voice input is the kind of feature that’s easy to ship badly (just call the system recognizer) and a little harder to ship right — fully offline, multilingual, and provably private. For a product whose entire pitch is that your words stay on your phone, “right” was the only option.

NativeLM v0.8 with on-device voice input is live. The entire engine is open-source (AGPL-3.0).

Talk to your local LLM: on-device voice input with Whisper

The tempting shortcut we refused

The pipeline

Making “Audio” a real model category

One integration, two products

Built on-device, in the open

Comments

Talk to your local LLM: on-device voice input with Whisper

The tempting shortcut we refused

The pipeline

Making “Audio” a real model category

One integration, two products

Your data, your key: local encrypted backup without a server

The OCR library that phoned home: restoring NativeLM's zero-telemetry guarantee

AirDrop for your LLM: building cloudless peer-to-peer sync without Google Play Services

Built on-device, in the open

Comments