Wrapping Google's LiteRT-LM into a Kotlin Multiplatform engine

The engine origin story: how litertlm-kmp turns Google's LiteRT-LM into a clean KMP library — four core abstractions, a resumable SHA-256 download manager, typed-Kotlin-to-OpenAPI function calling, and the thread discipline that keeps a non-thread-safe native runtime honest.

Before NativeLM was an app, it was an engine. litertlm-kmp is a Kotlin Multiplatform wrapper around Google’s LiteRT-LM for running Gemma-family models on-device, published as com.sagar:litertlm-kmp. This post is the origin story: the abstractions the library is built on, and the gnarly details it handles so you don’t have to.

Four core abstractions

The whole library is four interfaces, each with one job.

1. LocalAiEngine — generation. It yields a hot Flow<EngineState> so callers stream tokens and observe lifecycle without blocking:

interface LocalAiEngine {
    val descriptor: EngineDescriptor
    suspend fun initializeEngine(modelPath: String): EngineState<Unit>
    fun generateStream(request: AiEngineRequest): Flow<EngineState<String>>
    fun formatPrompt(userQuery: String, retrievedContext: String, systemInstruction: String?): String
    fun releaseResources()
}

EngineState is a sealed hierarchy — Idle, Generating, TokenGenerated, ToolCallEmitted, Error — so the free-text path emits a TokenGenerated per delta, while the structured-output path emits ToolCallEmitted instead of streaming text.

2. EmbeddingEngine — vectors. A thin wrapper over MediaPipe’s TextEmbedder, returning a FloatArray per string for cosine similarity in a RAG pipeline. Dimension follows the bundled embedder (512 for USE, 768 for EmbeddingGemma).

3. EngineRegistry — RAM-tier-aware selection. At init it consults HardwareProvider.effectiveRamMb() and hands back the right engine: E2B for 6–9 GB, E4B for 10 GB+, and nothing under 6 GB (so the consumer surfaces DeviceNotSupported instead of letting the OS kill a doomed load).

4. ModelManager — resumable download + integrity. The unglamorous one that matters most in the field (more below).

The download manager nobody thinks about until it breaks

Multi-gigabyte model downloads on mobile networks fail constantly. So ModelManager is Ktor-backed and built for hostile conditions:

Resume via HTTP Range headers when a partial file already exists — you don’t re-pull 2 GB because the user walked into an elevator.
SHA-256 validation post-download (lowercase hex); a mismatch deletes the file and emits DownloadState.Error, so a corrupt model never reaches the engine.
Atomic temp → final move, so a half-downloaded file is never visible as a usable model.
A Flow<DownloadState> for progress UI.

This same machinery later carried every model the product added — embedders, the Whisper speech model — without change. Boring, resumable, verified downloads are load-bearing infrastructure.

Function calling: typed Kotlin, not hand-written JSON

LiteRT-LM consumes tool definitions as OpenAPI 3.0 JSON. Hand-writing that is error-prone, so the library exposes an engine-agnostic ToolSchema.Definition and converts internally via toOpenApiJson():

val def = ToolSchema.Definition(
    name = "extract_event_details",
    description = "Extract structured event details.",
    parameters = listOf(
        ToolParameter("title", ToolParameterType.StringT, "Event title.", required = true),
        ToolParameter("attendees", ToolParameterType.ArrayT(ToolParameterType.StringT), "Names.", required = true),
        ToolParameter("duration_minutes", ToolParameterType.IntegerT, "Length.", required = true),
    ),
)

Two quirks worth knowing when you read the tool calls back: LiteRT-LM converts camelCase Kotlin parameter names to snake_case schema keys, and integer arguments can surface as Double (JSON number ambiguity). So coerce: arguments["duration_minutes"]?.let { (it as Number).toInt() }.

Thread discipline: keeping a non-thread-safe runtime honest

LiteRT-LM is not thread-safe, and that single fact dictates the engine’s concurrency design. All native calls are serialized behind a Mutex — and, the subtle part, the mutex is held across LiteRT-LM’s async callback, so two racing generateStream calls can’t interleave their tokens into each other’s streams. ModelManager does its file I/O on Dispatchers.IO, and every public suspend function is safe to call from Dispatchers.Main.

There’s one production trap worth its own paragraph. When you collect the token flow inside a try/catch, a broad catch (e: Exception) will also swallow the CancellationException thrown when the user taps “Stop” — turning a cancelled generation into a fake engine fault and breaking structured-concurrency semantics:

try {
    engine.generateStream(request).collect { state -> /* ... */ }
} catch (ce: CancellationException) {
    throw ce                 // never swallow — let cancellation propagate
} catch (e: Exception) {
    showError(e)             // real faults only
}

That’s flow exception-transparency: catch only what you actually mean to handle.

The one gotcha to read if you read nothing else

The hardest-won lesson lives in AndroidHardwareProvider: OEM “virtual RAM” features inflate the RAM the OS reports, so an 8 GB phone can claim 14 GB and happily load a model that physically can’t fit. The library reads physical RAM from /proc/meminfo and caps accordingly. It’s important enough that it got its own post.

litertlm-kmp shipped with the Android target production-ready and iOS bindings deferred. But the shape was right from v0.1: a clean engine contract, a paranoid download manager, typed function calling, and honest hardware tiering — the foundation everything since (RAG, OCR, Studio, sync, voice) was built on.

litertlm-kmp is open-source (AGPL-3.0).