A phone running an offline AI assistant on a desk, with a gray-blue British shorthair cat observing nearby.

React Native ExecuTorch: AI That Lives on the Phone

The Gray Cat
The Gray Cat
0 views

Most AI features in mobile apps work the same way under the hood: the app ships your input across the network to a cloud API, waits, and renders whatever comes back. That works, but it means a network round-trip on every request, a metered bill that grows with usage, and confidential user data leaving the device. React Native ExecuTorch asks a different question — what if the model just ran on the phone?

Built by Software Mansion (the team behind React Native Reanimated, Gesture Handler, and Screens), react-native-executorch is a declarative wrapper around Meta's ExecuTorch runtime, the on-device inference engine from the PyTorch ecosystem. It lets you run language models, speech recognition, text-to-speech, object detection, OCR, embeddings, and even image generation locally — all through idiomatic React hooks. No native glue, no ML PhD required. It's a strong fit for privacy-sensitive domains like healthcare, finance, journaling, and personal assistants, and for any app that needs to keep working when the signal drops.

Why Run Models Locally

Going on-device buys you four concrete things. Privacy, because user data never leaves the phone — nothing is transmitted to a third party. Offline capability, because inference runs with no connection and no network latency. Cost, because there is no model hosting, no GPU backend, and no per-token billing; your inference cost scales to zero regardless of how many requests users make. And no vendor lock-in or rate limits, because you own both the model and the runtime.

The trade-offs are real and worth naming. On-device models are smaller and less capable than frontier cloud models — you're running 0.35B to 4B parameter LLMs, not GPT-4-class systems. You're bound by device RAM, CPU/NPU, thermals, and battery. And model binaries range from hundreds of megabytes to several gigabytes, which you'll either bundle or download. ExecuTorch is the privacy-and-offline counterpoint to the cloud, not a drop-in replacement for it.

One Library, Many Modalities

What sets this library apart from most on-device AI packages is breadth. Most competitors cover LLM text generation and stop there. ExecuTorch spans an unusually wide surface, all exposed through the same hook pattern:

  • LanguageuseLLM (chat, streaming, tool calling, structured output, and multimodal vision), useTextEmbeddings for multilingual semantic search and RAG, and usePrivacyFilter for detecting PII entirely on-device.
  • SpeechuseSpeechToText (Whisper-based transcription), useTextToSpeech (Kokoro-based, multilingual, with voice cloning), and useVAD for real-time voice activity detection.
  • VisionuseObjectDetection, useClassification, useImageSegmentation, useInstanceSegmentation, usePoseEstimation, useImageEmbeddings, useOCR, useStyleTransfer, and useTextToImage.

Every hook follows the same shape: you call it, get back a model object carrying state flags (ready, loading, download progress, generating) plus a streaming response and a generate() or forward() method, and you drive your UI declaratively from that state.

Getting It Running

ExecuTorch requires React Native's New Architecture (Fabric and TurboModules). On iOS the minimum target is 17.0, and on Android it's 13. It works with both Expo and bare React Native.

Install the core package along with the Expo resource fetcher (which handles downloading models on-device) and its file-system peers:

yarn add react-native-executorch
yarn add react-native-executorch-expo-resource-fetcher
yarn add expo-file-system expo-asset
yarn pod-install

With npm:

npm install react-native-executorch
npm install react-native-executorch-expo-resource-fetcher
npm install expo-file-system expo-asset

Models can be bundled directly into the app binary for instant first use at the cost of a larger download, or downloaded on-device on first use via a resource fetcher (often pulled from Hugging Face and cached). The Expo path wires this up through ExpoResourceFetcher.

Your First On-Device Chatbot

Here's the flagship hook, useLLM, running a small instruction-tuned model entirely on the phone. The model registry (models) hands you typed, ready-to-use configs so you don't juggle raw URL constants or precision flags.

import { useLLM, models, initExecutorch, Message } from 'react-native-executorch';
import { ExpoResourceFetcher } from 'react-native-executorch-expo-resource-fetcher';

initExecutorch({ resourceFetcher: ExpoResourceFetcher });

function Chat() {
  const llm = useLLM({ model: models.llm.lfm2_5_1_2b_instruct() });

  const ask = async () => {
    const chat: Message[] = [
      { role: 'system', content: 'You are a helpful assistant' },
      { role: 'user', content: 'What is the meaning of life?' },
    ];
    await llm.generate(chat);
  };

  // llm.response updates as tokens stream in; llm exposes readiness and
  // generation state you can render directly.
  return null; // render llm.response, download progress, a send button, etc.
}

The first time this runs, the model downloads and caches; the hook's state flags let you show a progress bar instead of a frozen screen. After that, llm.response fills in token by token as the model generates — the same streaming experience users expect from a cloud chatbot, except nothing left the device.

Listening and Speaking

Speech is where on-device really shines, because round-tripping audio to a server is slow and bandwidth-hungry. useSpeechToText wraps Whisper for transcription, and recent versions made it dramatically lighter: v0.9.0 added voice-activity-detection integration for up to roughly 10x faster transcription, and v0.9.1 shipped an fp16 Whisper build on iOS that's 50% smaller and 30% faster than the previous fp32 build.

import { useSpeechToText, models } from 'react-native-executorch';

function Transcriber() {
  const stt = useSpeechToText({ model: models.stt.whisper() });

  const transcribe = async (audioPath: string) => {
    const text = await stt.transcribe(audioPath);
    return text;
  };

  return null; // render stt state and the transcribed text
}

The flip side is useTextToSpeech, built on Kokoro. As of v0.9.0 it's multilingual — Polish, Spanish, Italian, French, Hindi and more — with per-language voice configuration and support for fine-tuned or cloned voices. Pair the two with useVAD and you have a fully offline voice loop: detect when the user is speaking, transcribe it, run it through the LLM, and speak the answer back, all without a single network call.

Vision Without the Cloud

The vision hooks follow the same contract. Object detection runs YOLO or RF-DETR, segmentation runs the Segment Anything family, and useInstanceSegmentation exposes a promptable FastSAM where you can select objects by point, box, or text prompt — on-device. Here's classification as a representative example:

import { useClassification, models } from 'react-native-executorch';

function ImageLabeler() {
  const classifier = useClassification({
    model: models.classification.mobilenet(),
  });

  const label = async (imageUri: string) => {
    const predictions = await classifier.forward(imageUri);
    return predictions; // ranked class probabilities
  };

  return null; // render predictions once classifier is ready
}

One detail worth noting if you use image generation: useTextToImage().generate() now resolves to a file:// URI pointing at a PNG on disk, rather than a base64 string (a v0.9.0 change that dropped the pngjs dependency). That keeps large image payloads off the JS bridge and out of memory.

Living With Device Constraints

The hard ceiling on-device is RAM. Devices with less than 8 GB can crash running larger LLMs, so reach for quantized models — they're more compute-efficient and run on weaker hardware. Model size on disk drives both your app size and download time: SmolLM2 135M is around 100 MB, Llama 3.2 1B is roughly 1 GB, and a 4B quantized model lands around 2–3 GB.

For acceleration, ExecuTorch taps multiple backends depending on the model — Xnnpack on CPU, CoreML on iOS, and Vulkan and MLX where available (Gemma support in v0.9.1 spans all three). The newer useLLM also exposes sampling controls like min_p and repetition_penalty with sensible per-model defaults, so you can tune output quality without rewriting your inference loop.

Crucially, the library degrades gracefully. As of v0.9.0, apps no longer crash at startup on 32-bit or unsupported Android ABIs — instead, isAvailable === false lets you render a fallback UI and steer those users toward, say, a cloud path. That single flag is the difference between a hard crash and a graceful "AI isn't supported on this device" message.

Bringing Your Own Model

You're not limited to the built-in registry. Any model can be exported to ExecuTorch's .pte format using ExecuTorch's Python API or optimum-executorch, then loaded through the same hooks. If you've already invested in the PyTorch and ExecuTorch export pipeline, this library slots in naturally. If instead you live in the GGUF and llama.cpp world, or you want Vercel AI SDK API compatibility, Callstack's react-native-ai is the more natural neighbor — ExecuTorch's distinguishing bet is the broad multimodal toolkit, not just LLM text.

The Verdict

React Native ExecuTorch is one of the most complete on-device AI toolkits in the React Native ecosystem: LLMs, speech, vision, OCR, embeddings, and image generation, all behind a consistent, declarative hook API, all backed by a team with a serious open-source maintenance record. The release cadence is brisk — roughly monthly minors plus frequent patches — which is great for features but means it's still pre-1.0 and ships occasional breaking changes in minor versions. Pin your version, read the release notes before upgrading, and you get something genuinely compelling: capable AI that's private by default, free to run, and works on the subway. For any app where data sensitivity, offline resilience, or runaway API bills matter, that's a trade worth taking seriously.