The OCR library that phoned home: restoring NativeLM's zero-telemetry guarantee

Google's ML Kit gave NativeLM on-device OCR — and quietly bundled a datatransport pipeline that uploaded diagnostics to firebaselogging.googleapis.com on startup. Here's how we found it and stripped it out with a three-line manifest merge.

The headline promise of NativeLM is blunt: no cloud, no accounts, zero telemetry. Your documents never leave the phone. That promise is the entire product — for the people NativeLM is built for (lawyers, clinicians, anyone whose files are too sensitive to upload), it’s the only thing that matters.

In v0.6.1 we discovered we had quietly broken it. Not in our code — in a dependency. This is the story of how a Google library we added for on-device OCR started phoning home behind our back, and how we proved it and shut it off.

The feature that opened the hole

Back in v0.5.0 we shipped on-device OCR so NativeLM could read scanned PDFs and photos. The recognizer is ML Kit’s on-device text recognizer with the bundled model — it runs locally, downloads nothing at runtime, and the image never touches the network. The OCR itself is genuinely on-device. That part was never the problem.

The problem is what came along for the ride.

ML Kit transitively depends on com.google.android.datatransport — Google’s general-purpose telemetry runtime (the same “CCT” / Firebase logging pipeline used across Google’s SDKs). And datatransport doesn’t wait to be asked. It registers itself through the merged Android manifest and auto-initializes on app startup, then schedules background jobs that upload usage and diagnostic events to firebaselogging.googleapis.com.

So the shape of the bug was nasty: we wrote zero telemetry code, we added a library to do on-device work, and that library silently stood up a background uploader to a Google endpoint. Every NativeLM launch was capable of beaconing out — exactly the thing we told users could never happen.

Catching it in the act

Dependency telemetry is invisible if you only read your own source. You have to watch the device. We caught this the boring, reliable way: drive the app while watching the network and logcat, and look for any connection we didn’t initiate.

firebaselogging.googleapis.com is not an endpoint NativeLM has any business talking to. Once it showed up, the merged manifest told the rest of the story — datatransport had injected a TransportBackendDiscovery service and a pair of schedulers (a JobScheduler service and an AlarmManager broadcast receiver) that wake up and flush events.

The fix: remove the nodes at merge time

You can’t patch a dependency’s manifest by asking nicely, and we didn’t want to fork ML Kit. Android’s manifest merger has exactly the tool for this: tools:node="remove". You declare the components the library injected and tell the merger to delete them from the final, merged manifest. No component, no auto-init, no uploader.

Three entries did it — straight from sample-app/src/main/AndroidManifest.xml:

<!-- Zero-telemetry guarantee: ML Kit (on-device OCR) transitively bundles
     Google's datatransport pipeline, which auto-initializes on startup and
     uploads usage/diagnostics to firebaselogging.googleapis.com. Strip the
     CCT backend + its upload schedulers so nothing ever leaves the device.
     On-device OCR does not depend on this telemetry pipeline. -->
<service
    android:name="com.google.android.datatransport.runtime.backends.TransportBackendDiscovery"
    tools:node="remove" />
<service
    android:name="com.google.android.datatransport.runtime.scheduling.jobscheduling.JobInfoSchedulerService"
    tools:node="remove" />
<receiver
    android:name="com.google.android.datatransport.runtime.scheduling.jobscheduling.AlarmManagerSchedulerBroadcastReceiver"
    tools:node="remove" />

Remove the discovery service and the runtime can’t find a backend to send to; remove the two schedulers and there’s nothing left to wake up and flush a queue. The pipeline is decapitated at the manifest level — before any of it gets a chance to run.

What this didn’t break

The obvious worry: did we just break OCR by ripping out part of the library it depends on? No — and we verified it on-device. On-device OCR does not depend on the telemetry pipeline. After the removal, importing an image still OCRs it and indexes the text into the RAG store exactly as before. We’re deleting the analytics plumbing, not the text recognizer.

While we were here we also fixed an embarrassing smaller bug: the app’s versionName still read 0.5.0 in the v0.6.0 build. v0.6.1 corrects it.

The lesson we’re keeping

The takeaway isn’t “ML Kit bad.” It’s that a privacy guarantee is only as strong as your dependency graph, and the only way to know what your app actually does on the network is to watch the wire, not the source. “We didn’t write any telemetry” is not a proof. “We watched every launch and nothing leaves the device” is.

This is now a standing rule for NativeLM: every dependency that touches Google’s SDK surface gets the logcat-and-network treatment before it ships, and anything that tries to auto-init an uploader gets the tools:node="remove" treatment. It’s also exactly why, when we later built device-to-device sync, we refused Google’s Nearby Connections API — pulling Play Services back in would have re-opened this very wound.

NativeLM v0.6.1 is live. The entire engine is open-source (AGPL-3.0).

The OCR library that phoned home: restoring NativeLM's zero-telemetry guarantee

The feature that opened the hole

Catching it in the act

The fix: remove the nodes at merge time

What this didn’t break

The lesson we’re keeping

Built on-device, in the open

Comments

The OCR library that phoned home: restoring NativeLM's zero-telemetry guarantee

The feature that opened the hole

Catching it in the act

The fix: remove the nodes at merge time

What this didn’t break

The lesson we’re keeping

Your data, your key: local encrypted backup without a server

Talk to your local LLM: on-device voice input with Whisper

AirDrop for your LLM: building cloudless peer-to-peer sync without Google Play Services

Built on-device, in the open

Comments