Google AI Edge Gallery: Running AI Models Directly on Your Phone

Most AI features you use on your phone quietly send data to a server, wait for a response, and return the result. It works, but it means every interaction depends on network latency, server availability, and trusting a third party with your data. Google AI Edge Gallery takes a different approach: the models run entirely on-device, with no network call required.

It's a showcase app — part demo, part developer playground — that lets you interact with AI models from Google directly on your Android phone. Think of it as a hands-on way to test what on-device inference actually feels like in practice.

AI Edge Gallery is an open-source Android app maintained by Google that bundles a curated set of pre-built AI models you can run locally. It's built on top of LiteRT (formerly TensorFlow Lite) and the MediaPipe framework.

Out of the box, it includes demos for:

  • Text generation — on-device LLMs like Gemma 3 running fully offline
  • Image classification and object detection — real-time visual recognition using your camera
  • Natural language tasks — question answering, text embedding, and more

You can also sideload custom .task or .tflite model files into the app, which makes it useful for testing your own models before deploying them in production.

Why On-Device AI Matters

Running inference on-device rather than in the cloud isn't just a technical curiosity — it changes what's actually possible in an app:

  • Latency drops dramatically. There's no round-trip to a server. For real-time use cases like live camera overlays or voice commands, this is the difference between responsive and sluggish.
  • It works offline. Your app's AI features keep working on a plane, in a tunnel, or in areas with poor connectivity.
  • Data stays on the device. Nothing is transmitted to an external server, which matters for health data, private documents, or anything sensitive.
  • No inference costs. Serving LLM calls at scale is expensive. Shifting that work to the client removes that cost entirely.

The tradeoff is model size and capability. An on-device LLM isn't GPT-4 — but for a focused task like summarization, classification, or local search, a 1–3B parameter model running at 30 tokens/sec on a mid-range phone is often more than enough.

Getting Started

AI Edge Gallery is available on GitHub at google-ai-edge/gallery. You can build it from source or sideload the APK. Models are downloaded separately — the app will prompt you to pull them via HuggingFace or a direct link.

For developers who want to integrate on-device inference into their own apps, the same underlying stack is available through the MediaPipe Tasks SDK. Here's a minimal example using the LLM Inference API:

import com.google.mediapipe.tasks.genai.llminference.LlmInference
import com.google.mediapipe.tasks.genai.llminference.LlmInference.LlmInferenceOptions
 
// Configure the model (model file must be present on-device)
val options = LlmInferenceOptions.builder()
    .setModelPath("/data/local/tmp/gemma3-1b-it-int4.bin")
    .setMaxTokens(512)
    .setTopK(40)
    .setTemperature(0.8f)
    .build()
 
val llmInference = LlmInference.createFromOptions(context, options)
 
// Run inference — no network call, executes on the NPU/GPU
val response = llmInference.generateResponse("Explain transformers in two sentences.")
println(response)

For image classification, MediaPipe's ImageClassifier task follows the same pattern:

import com.google.mediapipe.tasks.vision.imageclassifier.ImageClassifier
import com.google.mediapipe.tasks.vision.imageclassifier.ImageClassifier.ImageClassifierOptions
import com.google.mediapipe.framework.image.BitmapImageBuilder
 
val options = ImageClassifierOptions.builder()
    .setBaseOptions(
        BaseOptions.builder().setModelAssetPath("efficientnet_lite0.tflite").build()
    )
    .setMaxResults(3)
    .build()
 
val classifier = ImageClassifier.createFromOptions(context, options)
val mpImage = BitmapImageBuilder(bitmap).build()
 
val result = classifier.classify(mpImage)
result.classificationResult().classifications().forEach { classification ->
    classification.categories().forEach { category ->
        println("${category.categoryName()}: ${category.score()}")
    }
}

Both snippets run entirely offline. The model files ship with your app or are downloaded once and stored locally.

On-Device vs. Cloud Inference

Neither approach is universally better — the right choice depends on your use case:

On-Device (Edge)Cloud
LatencyLow, no network hopVariable, depends on connection
Offline supportYesNo
PrivacyData stays on deviceData leaves the device
Model capabilityLimited by device hardwareNear-unlimited
Cost at scaleFree (runs on user hardware)Per-token / per-request fees
Update cycleApp update requiredInstant server-side updates

A practical pattern many apps use: run a small on-device model for speed and privacy, and fall back to a cloud model for complex requests that need more capability.

Real-World Use Cases

On-device models are already shipping in production across a range of domains:

  • Document scanning apps use on-device OCR and layout parsing so scans work without internet
  • Keyboard apps run next-word prediction locally to avoid sending keystrokes to a server
  • Fitness apps run pose estimation on-device for real-time form feedback during workouts
  • Translation apps like Google Translate support fully offline translation for dozens of languages
  • Code editors on mobile use small on-device models for autocompletion without leaking proprietary code

The hardware improvements in recent Snapdragon, Tensor G, and Apple A-series chips have made this increasingly practical. A phone released in 2024 can comfortably run a quantized 1–3B parameter model at interactive speeds.

Should You Use It?

If you're building an Android app and have a use case that fits — classification, detection, text generation for focused tasks, embedding, speech recognition — on-device inference is worth evaluating seriously. The MediaPipe Tasks API is well-documented, the models in AI Edge Gallery give you a realistic benchmark of what's achievable, and the privacy and latency benefits are real.

The best way to calibrate expectations is to install the Gallery app, load the models you're interested in, and test them on the actual device your users will have. What benchmarks say and what it feels like to use are often different things.

The source code, model download links, and integration guides are all at ai.google.dev/edge.

Frequently Asked Questions

What devices does Google AI Edge Gallery support? It requires Android 10 or later. GPU and NPU acceleration is available on compatible devices (Qualcomm Snapdragon, MediaTek Dimensity, Google Tensor). Devices without dedicated AI hardware fall back to CPU inference, which is slower.

Can I use my own models? Yes. The app supports sideloading .task files (MediaPipe format) and .tflite models. You can convert existing PyTorch or JAX models using the AI Edge Torch or LiteRT converter.

Is it production-ready? The Gallery app itself is a demo. The underlying stack — LiteRT and MediaPipe Tasks — is production-ready and used in Google's own apps. The LLM Inference API is newer and carries a beta label as of early 2026.

How large are the models? It varies. EfficientNet for image classification is under 5 MB. Gemma 3 1B INT4 is around 600 MB. The app handles download and caching; you don't bundle models into the APK itself.


Related Posts