← Writing

May 26, 2026

Seeing on-device: multimodal image input for local Gemma

litertlm-kmp v0.2.4 added vision — attach an image and the local Gemma model reasons over it, on-device. Here's how image attachments flow through the engine, why we default to the CPU vision backend, and the model gotcha that bites you on init.

on-device-llmandroidmultimodalvisionkotlin

Text-only is a short-lived limitation for an on-device assistant. The whole point of litertlm-kmp is to run capable Gemma models locally — and the multimodal Gemma variants can see. So v0.2.4 wired image input through the engine: attach a photo, and the local model reasons over it without a pixel ever leaving the device.

How an image flows through the engine

The engine advertises its capability up front — descriptor.supportsVision = true on multimodal models — and is initialized with a vision-aware config:

EngineConfig(visionBackend = Backend.CPU(), maxNumImages = 1)

The interesting bit is the dual path through generation. generateStream inspects the request’s attachments:

  • No image? It uses the plain-string prompt overload, exactly as before. Text-only requests pay nothing for vision support.
  • Image present? It filters request.attachments for Attachment.Image and sends the prompt as a Contents bundle — Content.Text + Content.ImageBytes — instead of the string overload.

Crucially, both the free-text and the structured-output (function-calling) paths route images through. So you can attach a photo and stream a description, or attach a photo and have the model emit a structured tool call grounded in what it saw.

Why the CPU vision backend is the default

Backend.CPU() is a deliberate choice, not laziness. GPU vision delegates on Android vary wildly by device driver — what runs beautifully on one SoC throws on another. For a library other people ship to a long tail of real devices, that’s an unacceptable support burden. CPU vision is slower but predictable, and predictability is the right default for a dependency. (A consumer who knows their target hardware can opt into a GPU delegate; the library won’t make that gamble for everyone.)

The gotcha that bites you on init

The one thing that will trip you up: the loaded .litertlm model must actually carry vision-encoder weights, or initialization fails. The standard Gemma 4 E2B / E4B multimodal bundles do; a text-only model does not — and pointing a vision-configured engine at a text-only model is an init-time error, not a graceful degrade.

This is exactly why the model catalog later grew a per-model supportsVision flag: a text-only INT4 bundle on a low-end phone must init without a vision backend, while a multimodal bundle inits with one. The capability has to be described per model, not assumed globally.

One more deliberate edge: audio attachments are tolerated by the request API but dropped before inference. The surface accepts them so callers don’t have to special-case, but nothing pretends to process audio here — on-device speech came later, and separately, via Whisper.

The shape of the win

Multimodal vision is a small amount of code sitting on top of a careful interface: capability declared in the descriptor, attachments filtered in one place, both generation paths converging on the same Contents bundle. That’s the payoff of modeling the engine around a clean LocalAiEngine contract — adding a modality is a routing change, not a rewrite. And like everything else in NativeLM, the image is processed entirely on the phone.

litertlm-kmp v0.2.4 multimodal vision is live. The entire engine is open-source (AGPL-3.0).

Comments