Flagship project · open source

litertlm-kmp + NativeLM

A Kotlin Multiplatform engine for running Gemma-class LLMs fully on-device, and NativeLM — a private local-AI chat app built on top of it to prove the whole stack works end-to-end. Offline. No account. No telemetry.

On-device streaming chatConversation history drawerPrivate by designModel management

What it is

Shipping a production on-device LLM is much harder than the samples make it look: a clean abstraction over the native SDK, resumable multi-GB model downloads with integrity checks, hardware-tier gating, OEM-RAM quirks, and a function-calling layer — all shaped to run identically across platforms. The engine solves that once; NativeLM consumes it like any real app would.

v0.3 — the engineering

  • Stateful KV-cache sessions. One live conversation per chat, so multi-turn memory is lossless and free — prefill is paid per new message, not re-paid for the whole history. Time-to-first-token stays flat as the chat grows.
  • Backend chosen from on-device data. CPU / GPU / NPU were benchmarked on a real device; GPU/NPU can't load the community Gemma bundles, so the engine runs CPU(6) — measured, not assumed.
  • Real native cancellation, conversation history, model-generated titles, and a signed, R8-minified release build (the engine ships its own consumer ProGuard rules).
// the additive engine API
val session = engine.openChatSession(history = priorTurns)
session.sendTurn(request).collect { state -> /* stream */ }
session.cancel()   // real native interrupt
session.close()

Backends, measured on-device

BackendResult
CPU (XNNPACK)works · ~20 tok/s decode
GPUcan't compile the community bundle
NPUbundle ships no NPU artifact

→ CPU is the only viable backend for these bundles; CPU(6) won the thread sweep.

Stack

Kotlin MultiplatformLiteRT-LMGemma 4Jetpack ComposeObjectBoxkotlin-injectKtorR8 / ProGuard