May 30, 2026
Stateful KV-cache sessions for on-device Gemma on Android
How litertlm-kmp v0.3 makes multi-turn memory lossless and free — plus what an on-device CPU/GPU/NPU benchmark actually told me.
The first version of NativeLM — my on-device chat app — faked multi-turn memory. The engine created a fresh conversation per request, so it had no memory of the previous turn. To make the assistant “remember,” the app re-sent the entire conversation history as a context prompt on every message.
It worked. It was also wrong. Re-sending history means the model re-tokenizes and re-prefills the whole conversation every single turn. Time-to-first-token climbs as the chat grows, and to stay inside the context window you start silently dropping the oldest turns — so the model genuinely forgets things while appearing to remember. v0.3 fixes this properly.
The fix: keep the conversation alive
LiteRT-LM’s Conversation object already holds a KV cache. The bug was that the
engine wrapped every call in createConversation().use { … } — creating a new
conversation and closing it each turn, throwing the cache away. The fix is
almost embarrassingly small: keep one Conversation alive for the life of a
chat and call it repeatedly.
I wrapped that in an additive engine API:
val session = engine.openChatSession(history = priorTurns)
session.sendTurn(request).collect { state -> /* stream tokens */ }
session.cancel() // real native interrupt, not just cancelling the Flow
session.close()
Prior turns are seeded via ConversationConfig.initialMessages and re-prefilled
once when the session opens (the app shows a “Building understanding…” state
for that moment). After that, each turn prefills only the new message. Multi-turn
memory becomes lossless and free, and TTFT stays flat as the conversation grows.
What the benchmark actually said
Before committing to a backend, I benchmarked CPU vs GPU vs NPU on the device, not in theory:
| Backend | Result |
|---|---|
| CPU (XNNPACK) | works · ~20 tok/s decode |
| GPU | Failed to create engine — can’t compile the community bundle |
| NPU | NOT_FOUND: TF_LITE_AUX not found — the bundle ships no NPU artifact |
So for the litert-community Gemma bundles, CPU is the only viable backend —
there is no free GPU/NPU win to capture without different model artifacts. A
thread-count sweep then showed CPU(6) gave the best decode rate while leaving
the two prime cores free for the UI. The whole “should I use the GPU?” question
got answered with data in an afternoon instead of assumed.
The constraint that shaped the design
One thing the runtime enforces: only one conversation per engine at a time.
A second createConversation() while a session is open fails with
FAILED_PRECONDITION: A session already exists. That has a real consequence —
generating a conversation title (a one-shot, separate generation) can’t run
concurrently with the live chat session. The fix is a small dance: close the
session, run the one-shot title, reopen the session seeded with the same history.
Not elegant, but honest about the hardware reality.
Result
NativeLM v0.3 now has genuine multi-turn memory, a conversation-history drawer with model-generated titles (ObjectBox-persisted), real native Stop, and a signed, R8-minified release build — verified end-to-end on a real phone, where LiteRT-LM’s JNI, ObjectBox’s reflection, and the Gson tool-calling path all survive minification (the engine now ships its own consumer ProGuard rules so that’s true for anyone who depends on it).
It’s all open source: litertlm-kmp. The engine is the asset; NativeLM is the proof.