Midscene.js: The AI That Sees Your UI and Clicks for You

Writing CSS selectors is nobody's idea of a good time. You spend hours crafting the perfect data-testid attributes, carefully threading XPath expressions through deeply nested DOM trees, and then someone on the team renames a class and your entire test suite crumbles. Midscene.js takes a radically different approach: it looks at screenshots of your UI and uses vision-language models to understand what is on screen. You describe actions in plain English, and Midscene figures out where to click, what to type, and what data to extract. It works across web, Android, iOS, and desktop platforms, all from a single JavaScript SDK.

Seeing Is Believing

The core idea behind Midscene.js is its pure-vision approach. Since version 1.0, the library exclusively analyzes screenshots rather than parsing the DOM. This has some powerful implications:

It works with canvas elements, WebGL, cross-origin iframes, and CSS background images that traditional selectors cannot reach
Token usage drops by roughly 80% compared to DOM-based AI automation
The same approach works on any platform that can produce a screenshot
No selectors, no XPath, no data-testid attributes to maintain

The library supports multiple vision-language models including Doubao Seed, Qwen, Gemini, GLM-4, and ByteDance's own open-source UI-TARS model. You can even mix different models for different tasks like planning, element localization, and data extraction.

Getting Started

Install @midscene/web via npm or yarn:

npm install @midscene/web

yarn add @midscene/web

You will also need a vision-language model API key configured in your environment. Midscene supports several providers, so pick the one that suits your setup.

Talk to Your Tests

Writing Your First Automation

The Playwright integration is the most popular entry point. Midscene extends Playwright's test fixtures with ai and aiQuery helpers that accept natural language instructions.

import { test, expect } from "@playwright/test";
import { PlaywrightAiFixture } from "@midscene/web/playwright";

const aiTest = test.extend(PlaywrightAiFixture());

aiTest("find and verify a product price", async ({ page, ai, aiQuery }) => {
  await page.goto("https://my-store.example.com");

  await ai('click on the "Electronics" category in the navigation menu');
  await ai('click on the first product in the grid');

  const productInfo = await aiQuery({
    name: "string, the product name",
    price: "string, the displayed price including currency symbol",
    inStock: "boolean, whether the item shows as available",
  });

  expect(productInfo.inStock).toBe(true);
  expect(productInfo.price).toBeDefined();
});

No selectors anywhere. The AI model looks at the rendered page and determines where "Electronics" is, what the first product looks like, and how to extract structured data from the product detail page.

Asking Questions About the Page

Beyond clicking and typing, Midscene can answer questions about what it sees. The aiAsk method returns a string response, while typed methods like aiBoolean and aiNumber give you precise return types.

import { PlaywrightAiFixture } from "@midscene/web/playwright";

const aiTest = test.extend(PlaywrightAiFixture());

aiTest("verify dashboard state", async ({ page, ai, aiQuery }) => {
  await page.goto("https://dashboard.example.com");

  const hasNotifications = await ai.aiBoolean(
    "Are there any unread notification badges visible?"
  );

  const chartCount = await ai.aiNumber(
    "How many chart widgets are displayed on the dashboard?"
  );

  const summary = await ai.aiAsk(
    "Describe the overall layout of this dashboard page"
  );

  console.log(`Notifications: ${hasNotifications}`);
  console.log(`Charts: ${chartCount}`);
  console.log(`Layout: ${summary}`);
});

Natural Language Assertions

Traditional assertions check values. Midscene assertions check visual state in plain language:

await ai.aiAssert("The login form has a username field and a password field");
await ai.aiAssert("The submit button is blue and positioned below the form");
await ai.aiAssert("There are no error messages visible on the page");

If the assertion fails, the AI explains what it actually saw versus what was expected, giving you a human-readable failure message instead of a cryptic selector error.

Beyond the Browser

Automating Mobile Apps

Midscene extends to Android and iOS through the same natural language API. For Android, it connects via adb; for iOS, it uses WebDriverAgent.

import { AgentOverADB } from "@midscene/web";

const agent = new AgentOverADB();
await agent.connect();

await agent.aiTap("the search icon in the top right corner");
await agent.aiInput("weather forecast", { method: "type" });
await agent.aiKeyboardPress("Enter");

await agent.aiWaitFor("search results are displayed");

const results = await agent.aiQuery({
  items: "string[], the text of each search result",
});

console.log(results.items);

The same code style works whether you are testing a web app in Chrome, a native Android app, or an iOS application. The vision model handles the differences in rendering.

YAML Scripts for Non-Developers

Not everyone on the team writes TypeScript. Midscene supports YAML-based automation scripts that QA teams and product managers can write and maintain:

const yamlScript = `
tasks:
  - name: Login flow
    steps:
      - action: navigate
        url: "https://app.example.com/login"
      - action: input
        target: "the email field"
        value: "user@example.com"
      - action: input
        target: "the password field"
        value: "secretpassword"
      - action: tap
        target: "the Sign In button"
      - action: waitFor
        condition: "the dashboard page loads"
      - action: assert
        condition: "a welcome message is visible"
`;

await agent.runYaml(yamlScript);

Structured Data Extraction at Scale

The aiQuery method shines for web scraping scenarios. You define a TypeScript-style schema and Midscene extracts matching data from whatever is visible on screen:

import { PuppeteerAgent } from "@midscene/web";
import puppeteer from "puppeteer";

const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto("https://news.example.com");

const agent = new PuppeteerAgent(page);

const articles = await agent.aiQuery({
  articles: `{
    headline: string,
    author: string,
    publishedDate: string,
    category: string
  }[], all visible news articles on the page`,
});

for (const article of articles.articles) {
  console.log(`${article.headline} by ${article.author}`);
}

await browser.close();

This works on pages with complex layouts, dynamically loaded content, and even content rendered inside canvas elements where traditional scraping tools would struggle.

The Trade-offs

Midscene is not a drop-in replacement for Playwright or Puppeteer. Each action requires a screenshot capture followed by a vision model API call, which means execution is slower than selector-based automation. There is also a cost consideration since every interaction consumes API tokens. The library includes a caching system that helps with repeated runs of the same scripts, but the first run will always be slower.

Determinism is another factor. AI-based element localization is remarkably accurate but not infallible. For critical production test suites where flakiness is unacceptable, you might want to combine Midscene's natural language approach with traditional selectors for the most critical paths.

Where Midscene Fits

Midscene.js fills a specific gap in the automation toolchain. It is ideal for testing complex UIs where selectors are fragile, for cross-platform test suites that need to cover web and mobile with the same codebase, and for data extraction tasks where the page structure is unpredictable. With nearly 12,000 GitHub stars and daily releases from the ByteDance web-infra team, the project is under heavy active development. The Chrome extension offers a zero-code entry point for teams that want to experiment before committing to the SDK. If you have ever wished your test scripts could just look at the screen and understand what they see, Midscene.js is exactly that wish granted.