The low-end gauntlet: running a local LLM on budget Android phones

A local LLM that only runs on flagships isn't private AI for everyone — it's a toy for people with expensive phones. Here's how NativeLM tiers models across devices, why budget phones break in two different ways (RAM and the navigation bar), and what's still hard about the 4–6 GB tier.

There’s an uncomfortable truth about on-device AI: if your model only runs on a flagship, you’ve built privacy for the privileged. The people who most need a local, no-cloud assistant are often on budget phones — exactly the devices a multi-gigabyte model is most likely to choke. So a real chunk of NativeLM’s engineering goes into the unglamorous low-end gauntlet. It turns out budget phones break in two completely different ways.

Break #1: RAM — give every device a model that fits

The naive design ships one model. That fails immediately, because device memory spans a 3× range. So NativeLM ships a cross-device model catalogue: Qwen3-0.6B, Gemma 3 1B, DeepSeek-R1 1.5B, Gemma 4 E2B / E4B, Phi-4-mini, Qwen3-4B — across graduated RAM tiers, with input-type chips and a license popup for gated models. The catalogue is validated end-to-end on real devices, including a non-Gemma model (Qwen3-0.6B), and reasoning models get their <think> span hidden in chat (Qwen3, DeepSeek-R1).

The models screen only offers what the device can actually run, which avoids the worst failure mode: handing a 6 GB phone a model that OOMs on the first message. From on-device testing, the line is sharp:

Device	Effective RAM	Gemma 4 E2B (2.6 GB)
Realme (8 GB + expansion, capped)	~9 GB	✅ loads + chats
Samsung M55 (8 GB + RAM Plus)	~7 GB	✅ download → activate → chat, no OOM
genuine 4–6 GB	4–6 GB	❌ OOM at session start

E2B is comfortable at ≥7 GB. The “Chat session is not ready” failure bites at ≤6 GB, where the model, the KV cache, and an (often unused) vision encoder simply don’t fit together.

And before any of this tiering works, you have to know the device’s real RAM — which on many budget phones is a lie, because OEM “virtual RAM” features inflate the reported number. That detection is the load-bearing trick behind the whole catalogue, and it has its own post.

Here’s the one that surprised us. A model can load and chat perfectly, and the app can still be unusable on a budget phone — because of the navigation bar.

MainActivity calls enableEdgeToEdge(), but edge-to-edge means you are responsible for insetting content around the status bar and the system navigation bar. Miss it, and content draws under those bars. It’s worst on 3-button navigation devices, whose nav bar is taller — and 3-button nav is exactly what a lot of budget phones default to.

Confirmed on a Samsung M55 in 3-button mode:

❌ The Models screen’s “Continue to chat” button rendered under the nav bar and was physically untappable. (Material 3’s bottomBar slot does not auto-inset arbitrary content — a sharp edge.)
❌ Onboarding drew its top text under the status bar and its Skip/Next buttons under the nav bar.
✅ The Chat screen was fine — its Scaffold content padding already included the bottom inset.

The fix is a systematic insets pass: wrap non-Scaffold roots in Modifier.windowInsetsPadding(WindowInsets.safeDrawing), add .navigationBarsPadding() to custom bottomBar content, and add .imePadding() to the chat input so the keyboard never covers it. Low-risk, self-contained, and the difference between “works” and “the button doesn’t exist” on a budget phone.

What’s still hard: the genuine 4–6 GB tier

Honesty matters here, because this is the frontier we’re still working. Today the catalogue’s smallest fully-validated path leans on E2B, which needs ≥7 GB. Genuinely closing the 4–6 GB tier means a few things we’re actively building toward rather than claiming done:

A small INT4 model (e.g. Gemma 3 1B INT4, ~557 MB) that fits 4 GB with headroom — which requires threading a per-model supportsVision flag through engine init, because text-only bundles fail if a vision backend is configured.
Sizing the KV cache to RAM instead of allocating the model’s full context on every device (6 GB → 2048 tokens, 8 GB → 4096), trading some context window for fit.
Skipping the vision encoder on text-only / low-RAM loads — the chat path doesn’t need it, yet it costs memory on exactly the devices that OOM.
Honest failure UX — a SessionState.Failed that blocks input and says “couldn’t start, free memory, Retry,” instead of a cryptic inline [error: Chat session is not ready.].

None of these ship to users without validation on a genuine 6 GB device — “it works on my 8 GB phone” is how you ship a broken budget experience.

Why bother

Because “on-device AI for everyone” is either true or it’s marketing. Every tier we add, every inset we fix, every megabyte we shave off the minimum is one more person who gets a private assistant on the phone they actually own. That’s the whole point.

NativeLM’s model catalogue and low-end UI fixes are live; the 4–6 GB tier is in active development. The entire engine is open-source (AGPL-3.0).

The low-end gauntlet: running a local LLM on budget Android phones

Break #1: RAM — give every device a model that fits

Break #2: the navigation bar — the UI that hides under the system bars

What’s still hard: the genuine 4–6 GB tier

Why bother

Comments