← Writing

Jun 1, 2026

Shipping on-device RAG: Building NativeLM for Android

How we implemented fully offline document RAG using MediaPipe's USE-Lite and ObjectBox HNSW vector search to ground Gemma's chat answers in imported PDFs.

on-device-llmandroidragvector-searchgemma

Running Gemma 4 on-device is great, but LLMs are only as useful as the context they have access to. The most requested feature for NativeLM since the initial launch has been document grounding — the ability to import a PDF and ask questions about it, creating a fully private, on-device AI assistant.

With the latest release (v0.4.0), we’ve shipped NativeLM’s core Document RAG pipeline. Here is how we assembled an entirely on-device RAG (Retrieval-Augmented Generation) pipeline using LiteRT-LM, MediaPipe, and ObjectBox.

The Architecture: Import → Embed → Grounded Chat

NativeLM Document RAG Demo

A traditional cloud RAG architecture sends documents to a server to be chunked, calls the OpenAI Embeddings API, stores vectors in Pinecone, and retrieves them for an OpenAI chat completion.

Our pipeline does all of this locally on the phone.

1. Extraction and Chunking

First, we use PDFBox to extract text from imported PDFs or raw text files. The AndroidTextExtractor parses the document and feeds it into our TextChunker, which splits the text into 500-character chunks with a 50-character overlap to preserve context across boundaries.

2. Embedding with USE-Lite

To perform semantic search, we need to convert those text chunks into vector embeddings. We wired up MediaPipe’s TextEmbedder running the Universal Sentence Encoder Lite (USE-Lite) model. This model is incredibly lightweight (only ~6 MB to download) and generates 100-dimensional embeddings that are perfect for mobile memory constraints, while still capturing strong semantic meaning.

3. Vector Search with ObjectBox HNSW

The embeddings need to be queried instantly during a chat. We used ObjectBox, which natively supports HNSW (Hierarchical Navigable Small World) vector search on edge devices.

Our database entity looks like this:

@Entity
data class DocumentChunkEntity(
    @Id var id: Long = 0,
    var documentId: Long = 0,
    var text: String = "",
    
    @HnswIndex(dimensions = 100)
    var embedding: FloatArray? = null
)

When a user asks a question, we embed the query using the same USE-Lite model, then run a kNN (k-Nearest Neighbors) search against ObjectBox to retrieve the top matching chunks across their imported documents.

4. Grounding and Citations

Finally, the retrieved passages are injected into the prompt via the RagContextFormatter. We impose a strict 4 KB cap and fence the context heavily to prevent prompt injection. The engine then prompts Gemma to answer the user’s question using only the provided context, and the UI renders the retrieved chunks as citations at the bottom of the message.

Everything stays on the phone. No API keys, no cloud syncing, zero data leakage.

The R8 Minification Trap

Shipping this uncovered a gnarly issue: it worked perfectly in debug, but crashed instantly in the minified release build.

The culprit? MediaPipe logs via FluentLogger (Flogger). Flogger’s forEnclosingClass() method attempts to walk the stack trace to find the calling class. In a minified R8 release build, class names are obfuscated (e.g., a.b.c), causing the stack-walker to fail and throw an exception deep inside MediaPipe’s initialization.

The fix required targeted ProGuard keep rules for MediaPipe, Flogger, and Protobuf. We’ve now added these to the library’s consumer-rules.pro so that any app consuming litertlm-kmp will automatically inherit them and survive minification safely.

Try it out

The document RAG feature is live in NativeLM right now.