What's new in NativeLM v0.10.0: answering from the right document

v0.10 is a retrieval release — an optional EmbeddingGemma embedder tiered to your device, hybrid dense + lexical search, a flagship reranker, and a set of grounding fixes that stop the model answering from the wrong file. Still fully local, no account, no upload, no telemetry.

NativeLM is on-device document chat: ask questions grounded in your own files, get answers with citations, all running locally on your phone. v0.9 made it richer to use — charts in chat, an adaptive multi-device UI — and pulled the AI core out into a reusable Kotlin Multiplatform library. v0.10 goes after the part that matters most for document chat: getting the answer from the right document, and making the retrieval that finds it both smarter and aware of the device it’s running on.

Still local-only — no account, no upload, no telemetry.

The bug that started it: grounding on the wrong document

Here’s a real failure from testing. Ask “what’s my car insurance premium?” with both a car policy and a life policy imported, and NativeLM answered ₹41,799 — the life policy’s figure — instead of the actual car premium of ₹8,504. The life PDF’s wording simply out-scored the car PDF on the shared tokens (“insurance”, “premium”, “policy”), so lexical retrieval pulled the wrong chunks and the model dutifully grounded on them.

The same class of bug showed up for “who is the insurer of my car policy?”, which answered from a health policy whose formal “…insurer” phrasing out-scored the car document — when the right answer was sitting in the car policy (TATA AIG).

v0.10 fixes this with two guards in the retriever:

A document-level dominance gate — when one source genuinely dominates the candidate set, grounding stays on it instead of getting polluted by a few high-scoring stray chunks from another file.
A title-match override — when a distinctive query term names a document by its title, retrieval grounds on that document.

Wrong-document answers like the ₹41,799 case now resolve correctly.

Hybrid retrieval: dense + lexical, fused

Underneath those guards, retrieval itself got stronger. Document search now fuses dense vector similarity with BM25 lexical scoring using Reciprocal Rank Fusion (RRF), so a query is matched both by meaning and by exact terms. A per-document cap keeps any single large source from filling every top-k slot, the candidate pools are wider, and the grounding budget handed to the model is larger.

This ships independent of the embedder upgrade below — even on the lightweight default embedder, hybrid fusion and the per-document cap make retrieval noticeably more robust.

EmbeddingGemma, tiered to your device — and still telemetry-free

The default retrieval embedder has been the 100-dimension USE-Lite. v0.10 adds an optional upgrade to EmbeddingGemma-300M, run on ONNX Runtime — deliberately not on a stack that drags in Google/Play telemetry dependencies, so the zero-telemetry promise holds.

The neat part is that one downloaded model serves every device tier via Matryoshka truncation — the same weights produce a 256- or 512-dimension embedding depending on how much headroom the device has. A recommendation engine picks the embedder and dimension by RAM:

< 6 GB → USE-Lite (fast, tiny)
6–9 GB → EmbeddingGemma @ 256 dims
≥ 10 GB → EmbeddingGemma @ 512 dims + reranker

The recommended option is surfaced as a “Recommended” badge in the in-app model catalogue. As with every model in NativeLM, EmbeddingGemma and its companion files (the weights blob and tokenizer) download on-device through the catalogue — nothing is bundled into the APK.

A reranker for the high end

On flagship-class devices, v0.10 adds an optional cross-encoder reranker (ms-marco-MiniLM-L6) as a second stage: after fusion produces a shortlist, the reranker re-scores the top candidates directly against the query. It’s gated to high-RAM devices because it’s the most expensive stage — but where it runs, it earns its keep. In testing, enabling it recovered a relevant chunk that first-stage fusion had ranked just out of reach on a health-insurer question.

Pure-Kotlin tokenizers, validated bit-for-bit

EmbeddingGemma needs BPE tokenization; the reranker needs BERT WordPiece. Rather than pull in a native tokenizer library, v0.10 ships both tokenizers in pure Kotlin, reading the standard HuggingFace tokenizer.json directly. They’re validated bit-for-bit against the reference transformers tokenizer, so retrieval quality matches the model’s training-time assumptions — and because there’s no native dependency, the tokenizers stay KMP-portable along with the rest of the engine.

Embeddings are now also task-aware: the engine distinguishes a query from a document using instruction prefixes, which EmbeddingGemma’s asymmetric retrieval requires to perform well.

Answers that stay full-length

A subtle but maddening bug: grounded replies would collapse to 1–2 tokens after a few turns of conversation. The cause was the stateful LiteRT-LM KV cache accumulating each turn’s grounding block, until the context was so saturated the model had nothing left to say.

The fix is a per-grounded-turn session reset that re-prefills only a bounded slice of visible history (MAX_PREFILL_TURNS = 16). Grounded answers stay full-length no matter how long the chat runs.

Backups and migrations that just work

Changing your embedder used to be a problem for existing data. Now it’s handled automatically:

Self-healing re-index — when the embedder or dimension changes, each source re-indexes into the active embedder’s index on next open, straight from the stored chunk text. No re-import, no re-running OCR. The old USE-Lite index is kept as a fallback.
Backups carry every dimension (schema v2) — a backup now includes chunk embeddings from every embedder dimension and tags each chunk with its dim. A cross-embedder restore that older versions would have rejected now simply re-indexes from the included text instead.

Try it

NativeLM v0.10.0 is live — open source, AGPL-3.0, no telemetry, no account, no upload.

Source: the litertlm-kmp repository.
Grab the signed APK and try it on your own documents — import a couple of overlapping PDFs and ask the kind of question that used to trip retrieval up.
Building something on-device? The retrieval stack — hybrid fusion, the embedders, the pure-Kotlin tokenizers, the reranker — lives in the engine library (com.sagar:litertlm-kmp), independently consumable from your own Kotlin Multiplatform app.