Seeing on-device: multimodal image input for local Gemma

litertlm-kmp v0.2.4 added vision — attach an image and the local Gemma model reasons over it, on-device. Here's how image attachments flow through the engine, why we default to the CPU vision backend, and the model gotcha that bites you on init.

Text-only is a short-lived limitation for an on-device assistant. The whole point of litertlm-kmp is to run capable Gemma models locally — and the multimodal Gemma variants can see. So v0.2.4 wired image input through the engine: attach a photo, and the local model reasons over it without a pixel ever leaving the device.

How an image flows through the engine

The engine advertises its capability up front — descriptor.supportsVision = true on multimodal models — and is initialized with a vision-aware config:

EngineConfig(visionBackend = Backend.CPU(), maxNumImages = 1)

The interesting bit is the dual path through generation. generateStream inspects the request’s attachments:

No image? It uses the plain-string prompt overload, exactly as before. Text-only requests pay nothing for vision support.
Image present? It filters request.attachments for Attachment.Image and sends the prompt as a Contents bundle — Content.Text + Content.ImageBytes — instead of the string overload.

Crucially, both the free-text and the structured-output (function-calling) paths route images through. So you can attach a photo and stream a description, or attach a photo and have the model emit a structured tool call grounded in what it saw.

Why the CPU vision backend is the default

Backend.CPU() is a deliberate choice, not laziness. GPU vision delegates on Android vary wildly by device driver — what runs beautifully on one SoC throws on another. For a library other people ship to a long tail of real devices, that’s an unacceptable support burden. CPU vision is slower but predictable, and predictability is the right default for a dependency. (A consumer who knows their target hardware can opt into a GPU delegate; the library won’t make that gamble for everyone.)

The gotcha that bites you on init

The one thing that will trip you up: the loaded .litertlm model must actually carry vision-encoder weights, or initialization fails. The standard Gemma 4 E2B / E4B multimodal bundles do; a text-only model does not — and pointing a vision-configured engine at a text-only model is an init-time error, not a graceful degrade.

This is exactly why the model catalog later grew a per-model supportsVision flag: a text-only INT4 bundle on a low-end phone must init without a vision backend, while a multimodal bundle inits with one. The capability has to be described per model, not assumed globally.

One more deliberate edge: audio attachments are tolerated by the request API but dropped before inference. The surface accepts them so callers don’t have to special-case, but nothing pretends to process audio here — on-device speech came later, and separately, via Whisper.

The shape of the win

Multimodal vision is a small amount of code sitting on top of a careful interface: capability declared in the descriptor, attachments filtered in one place, both generation paths converging on the same Contents bundle. That’s the payoff of modeling the engine around a clean LocalAiEngine contract — adding a modality is a routing change, not a rewrite. And like everything else in NativeLM, the image is processed entirely on the phone.

litertlm-kmp v0.2.4 multimodal vision is live. The entire engine is open-source (AGPL-3.0).

Seeing on-device: multimodal image input for local Gemma

How an image flows through the engine

Why the CPU vision backend is the default

The gotcha that bites you on init

The shape of the win

Comments