← Writing

May 30, 2026

Stateful KV-cache sessions for on-device Gemma on Android

How litertlm-kmp v0.3 makes multi-turn memory lossless and free — plus what an on-device CPU/GPU/NPU benchmark actually told me.

on-device-llmandroidkotlin-multiplatformgemma

The first version of NativeLM — my on-device chat app — faked multi-turn memory. The engine created a fresh conversation per request, so it had no memory of the previous turn. To make the assistant “remember,” the app re-sent the entire conversation history as a context prompt on every message.

It worked. It was also wrong. Re-sending history means the model re-tokenizes and re-prefills the whole conversation every single turn. Time-to-first-token climbs as the chat grows, and to stay inside the context window you start silently dropping the oldest turns — so the model genuinely forgets things while appearing to remember. v0.3 fixes this properly.

The fix: keep the conversation alive

LiteRT-LM’s Conversation object already holds a KV cache. The bug was that the engine wrapped every call in createConversation().use { … } — creating a new conversation and closing it each turn, throwing the cache away. The fix is almost embarrassingly small: keep one Conversation alive for the life of a chat and call it repeatedly.

I wrapped that in an additive engine API:

val session = engine.openChatSession(history = priorTurns)
session.sendTurn(request).collect { state -> /* stream tokens */ }
session.cancel()   // real native interrupt, not just cancelling the Flow
session.close()

Prior turns are seeded via ConversationConfig.initialMessages and re-prefilled once when the session opens (the app shows a “Building understanding…” state for that moment). After that, each turn prefills only the new message. Multi-turn memory becomes lossless and free, and TTFT stays flat as the conversation grows.

What the benchmark actually said

Before committing to a backend, I benchmarked CPU vs GPU vs NPU on the device, not in theory:

BackendResult
CPU (XNNPACK)works · ~20 tok/s decode
GPUFailed to create engine — can’t compile the community bundle
NPUNOT_FOUND: TF_LITE_AUX not found — the bundle ships no NPU artifact

So for the litert-community Gemma bundles, CPU is the only viable backend — there is no free GPU/NPU win to capture without different model artifacts. A thread-count sweep then showed CPU(6) gave the best decode rate while leaving the two prime cores free for the UI. The whole “should I use the GPU?” question got answered with data in an afternoon instead of assumed.

The constraint that shaped the design

One thing the runtime enforces: only one conversation per engine at a time. A second createConversation() while a session is open fails with FAILED_PRECONDITION: A session already exists. That has a real consequence — generating a conversation title (a one-shot, separate generation) can’t run concurrently with the live chat session. The fix is a small dance: close the session, run the one-shot title, reopen the session seeded with the same history. Not elegant, but honest about the hardware reality.

Result

NativeLM v0.3 now has genuine multi-turn memory, a conversation-history drawer with model-generated titles (ObjectBox-persisted), real native Stop, and a signed, R8-minified release build — verified end-to-end on a real phone, where LiteRT-LM’s JNI, ObjectBox’s reflection, and the Gson tool-calling path all survive minification (the engine now ships its own consumer ProGuard rules so that’s true for anyone who depends on it).

It’s all open source: litertlm-kmp. The engine is the asset; NativeLM is the proof.