← Writing

Jun 2, 2026

Chatting with scanned documents: on-device OCR (no cloud)

NativeLM v0.5.0 reads scanned PDFs and photos with on-device OCR, and blends keyword + vector search so exact terms actually get retrieved — all without an image ever leaving the phone.

on-device-llmandroidocrragvector-searchprivacy

NativeLM can already chat with your documents, fully on-device. But there was a quiet gap: it only worked on documents with a real text layer. Drop in a scanned contract, a photo of a page, or a screenshot and the app extracted… nothing.

That’s a problem, because for the people NativeLM is built for — anyone whose documents are too sensitive to upload — the confidential material is very often exactly that: scans, not clean digital PDFs. So v0.5.0 adds OCR, and does it the only way consistent with the product’s promise: on the device.

Two paths, one OCR core

NativeLM answering questions grounded in OCR'd sources — a photographed memo and an image-only PDF

OCR shows up in two places, sharing the same recognizer:

  • Images (JPG/PNG/HEIC…): the picker now accepts images; we decode the bitmap (downsampled so a 12 MP photo doesn’t blow up memory) and run it through OCR.
  • Scanned PDFs: for each page we first try PDFBox’s text layer. If a page comes back empty, we render it to a bitmap with Android’s built-in PdfRenderer and OCR that. So a mixed PDF — some digital pages, some scanned — just works, and a fully-digital PDF pays nothing because the render/OCR path only opens lazily when there’s actually a scan to read.

The recognizer is ML Kit’s on-device text recognizer with the bundled model. It runs locally, downloads nothing at runtime, and the image never touches the network.

class MlKitOcrEngine : OcrEngine {
    private val recognizer =
        TextRecognition.getClient(TextRecognizerOptions.DEFAULT_OPTIONS)

    override suspend fun recognize(bitmap: Bitmap): String =
        suspendCancellableCoroutine { cont ->
            recognizer.process(InputImage.fromBitmap(bitmap, 0))
                .addOnSuccessListener { cont.resume(it.text) }
                .addOnFailureListener { cont.resumeWithException(it) }
        }
}

Once we have the text, it flows into the same pipeline as everything else — chunk, embed with USE-Lite, store in the ObjectBox HNSW index — and the scanned page becomes just another grounded, citable source.

Why on-device OCR, not a cloud API

This is the part that matters. The easy way to add OCR is to call Google Vision or AWS Textract — a few lines, excellent accuracy. But it means uploading the exact confidential documents the user came to NativeLM specifically so they wouldn’t have to upload. A scanned contract sent to a third-party OCR endpoint is the same privacy breach as sending it to a cloud LLM. Doing OCR on-device keeps the guarantee whole: scanned or not, your documents never leave the phone.

The bug OCR exposed: retrieval that buries exact matches

Testing OCR surfaced a different problem. I imported a scanned memo (codename “ZEPHYR NINE”, lead engineer “Anita Desai”) into a project that already held a thousand-odd chunks from other documents, then asked “Who is Anita Desai and what is ZEPHYR NINE?”

The answer: “The context provided does not mention Anita Desai or ZEPHYR NINE.”

The OCR was perfect — the text was indexed correctly. The failure was retrieval. USE-Lite is a tiny 100-dimensional embedder; it’s semantically fuzzy, and in a large project a single chunk containing an exact term can rank below loosely-related text. Pure vector search was losing the needle.

Hybrid retrieval: keyword + vector

The fix is to stop relying on vectors alone. v0.5.0 runs hybrid retrieval:

  • a vector arm — the existing HNSW nearest-neighbor search, gated on cosine distance, and
  • a keyword arm — BM25 over the chunks that actually contain the query terms (we only load those, via a contains query, so it scales regardless of corpus size).

The two rankings are merged with Reciprocal Rank Fusion, which blends ranks rather than raw scores — so we never have to reconcile cosine distance against BM25 magnitudes:

// each arm contributes 1/(k + rank); agreement across arms compounds
fun reciprocalRankFusion(rankings: List<List<Long>>, k: Int = 60): List<Long> {
    val fused = HashMap<Long, Double>()
    for (ranking in rankings)
        ranking.forEachIndexed { i, id -> fused.merge(id, 1.0 / (k + i + 1), Double::plus) }
    return fused.entries.sortedByDescending { it.value }.map { it.key }
}

Same question, same project, same data, after the change:

Anita Desai is the Lead engineer for the project codename ZEPHYR NINE. The approved budget for the project is 47 million rupees.

The keyword arm recovered the exact match the embedder missed, and normal semantic questions still ground correctly — both arms stay relevance-gated, so an off-topic question still falls back to ordinary chat instead of citing junk.

The honest edges

  • Latin/English in v1. Other scripts are a model swap away, not shipped yet.
  • OCR’d pages show the cited passage as a callout, not an in-page highlight — a rendered scan has no underlying glyph boxes to draw on (digital PDFs do get the in-place highlight).
  • Hybrid retrieval closes a lot of the ranking gap, but the embedder itself is still the small USE-Lite. A stronger on-device embedder (EmbeddingGemma) is the next step.

Try it out

NativeLM v0.5.0 is live — open source, AGPL-3.0, no telemetry, no account, no upload, and now no text layer required.

Comments