May 26, 2026
Seeing on-device: multimodal image input for local Gemma
litertlm-kmp v0.2.4 added vision — attach an image and the local Gemma model reasons over it, on-device. Here's how image attachments flow through the engine, why we default to the CPU vision backend, and the model gotcha that bites you on init.
Text-only is a short-lived limitation for an on-device assistant. The whole point of litertlm-kmp is to run capable Gemma models locally — and the multimodal Gemma variants can see. So v0.2.4 wired image input through the engine: attach a photo, and the local model reasons over it without a pixel ever leaving the device.
How an image flows through the engine
The engine advertises its capability up front — descriptor.supportsVision = true on multimodal models — and is initialized with a vision-aware config:
EngineConfig(visionBackend = Backend.CPU(), maxNumImages = 1)
The interesting bit is the dual path through generation. generateStream inspects the request’s attachments:
- No image? It uses the plain-string prompt overload, exactly as before. Text-only requests pay nothing for vision support.
- Image present? It filters
request.attachmentsforAttachment.Imageand sends the prompt as aContentsbundle —Content.Text+Content.ImageBytes— instead of the string overload.
Crucially, both the free-text and the structured-output (function-calling) paths route images through. So you can attach a photo and stream a description, or attach a photo and have the model emit a structured tool call grounded in what it saw.
Why the CPU vision backend is the default
Backend.CPU() is a deliberate choice, not laziness. GPU vision delegates on Android vary wildly by device driver — what runs beautifully on one SoC throws on another. For a library other people ship to a long tail of real devices, that’s an unacceptable support burden. CPU vision is slower but predictable, and predictability is the right default for a dependency. (A consumer who knows their target hardware can opt into a GPU delegate; the library won’t make that gamble for everyone.)
The gotcha that bites you on init
The one thing that will trip you up: the loaded .litertlm model must actually carry vision-encoder weights, or initialization fails. The standard Gemma 4 E2B / E4B multimodal bundles do; a text-only model does not — and pointing a vision-configured engine at a text-only model is an init-time error, not a graceful degrade.
This is exactly why the model catalog later grew a per-model supportsVision flag: a text-only INT4 bundle on a low-end phone must init without a vision backend, while a multimodal bundle inits with one. The capability has to be described per model, not assumed globally.
One more deliberate edge: audio attachments are tolerated by the request API but dropped before inference. The surface accepts them so callers don’t have to special-case, but nothing pretends to process audio here — on-device speech came later, and separately, via Whisper.
The shape of the win
Multimodal vision is a small amount of code sitting on top of a careful interface: capability declared in the descriptor, attachments filtered in one place, both generation paths converging on the same Contents bundle. That’s the payoff of modeling the engine around a clean LocalAiEngine contract — adding a modality is a routing change, not a rewrite. And like everything else in NativeLM, the image is processed entirely on the phone.
litertlm-kmp v0.2.4 multimodal vision is live. The entire engine is open-source (AGPL-3.0).