LinYiBing's Blog

Posted 2026-06-03Updated 2026-06-04Engineering12 minutes read (About 1779 words)

Turning Vibecoding Debt into a Harness (00): Building a Local Harness in Public, on a Real Repo

If you also write real product code with AI, this feeling will be familiar: a feature ships fast, but a few days later you change something else and the earlier thing quietly breaks.

Every release needs a human in the loop — deciding by hand which tests to run, which logs to read, which failures to ignore, and which ones must be fixed. There’s no real confidence underneath it.

I’m building a harness now, not out of engineering perfectionism, but because these pain points have shown up too many times. So I decided to do this in public — a Build in Public series with one theme: standing up a harness inside a real monorepo.

Posted 2026-05-27Updated 2026-06-04Engineering8 minutes read (About 1155 words)

Why Midscene's Action Space Is a Protocol

How many places do you have to touch to add a new action to Midscene?

This is my rough-and-ready test for whether a framework is well designed. If adding a “pull to refresh” gesture means changing the locating logic, changing the planning prompt, changing the execution dispatch, and then patching in a pile of if-else along the way, then sooner or later the framework gets crushed under its own action space.

Midscene’s answer is: you add it in one place. This post is about how it does that, and about the part it does not do.

The action space as a Zod protocol

Posted 2026-05-27Updated 2026-06-04Engineering6 minutes read (About 933 words)

How Does Midscene Stay Model-Agnostic?

People often ask me: when Midscene says it is model-agnostic, does that just mean swapping base_url and model_name?

If a model were only text in, text out, then yes, that is about it. Swap the endpoint and move on.

But a visual UI Agent asks the model to do something different: it asks the model to look at a screenshot and tell us where some element sits on the screen. And that is exactly where the trouble starts, because every model family reports “where” in its own way. So for us, being model-agnostic was never as light as switching an API. It means quietly absorbing all those differences inside the framework, so the ai('click login') line you wrote does not have to change by a single character.

The same script running across different models

Posted 2026-05-26Updated 2026-06-04Engineering20 minutes read (About 2986 words)

Why Does Midscene Split Locate and Action into Two Steps?

The previous post, What Actually Happens Inside a Single Midscene aiAct Call?, walked through the plan-execute loop inside aiAct, but one stop was deliberately left unopened — “finding the element”.

That stop is arguably the most technically distinctive part of Midscene. Most vision Agents either trust the coordinates the AI gives them, or fire one more AI request to refine the location. Midscene takes a different path: separate the locate step out, and try four fallback layers in order from cheapest to most expensive.

This post is about that.

Posted 2026-05-26Updated 2026-06-04Engineering15 minutes read (About 2199 words)

What Actually Happens Inside a Single Midscene aiAct Call?

The previous post, Why Does Midscene’s UI Agent Need to See the Screen?, explained why Midscene puts “look at the screenshot” at the very front of every UI action. Right after that explanation, I usually get the next question from coworkers:

“OK, but what actually runs inside aiAct? When I write a single agent.aiAct('log in and place an order'), what really happens? Is it just one model call?”

It is not one call. It is a loop with feedback.

This post takes that loop apart: how the screenshot is grabbed, what the AI returns, when the loop stops, and how context flows across rounds.

Midscene core architecture

Posted 2026-05-26Updated 2026-06-04Engineering13 minutes read (About 1951 words)

Why Does Midscene's UI Agent Need to See the Screen?

While working on Midscene, I often run into the same question: why does a UI Agent need screenshots? Why not keep using DOM, selectors, XPath, accessibility trees, and the other things traditional automation has already made mature?

It is a fair question. For more than a decade, UI automation has mostly grown around structured interface data. But if we are not trying to build just a smarter Web testing framework, and instead want a UI Agent that can operate Web pages, mobile apps, desktop apps, Canvas, and custom devices, the default input has to shift a little: see the screen first, then decide what to do.

A UI Agent should see the screen first

Posted 2023-07-16Updated 2026-06-04Second Brain10 minutes read (About 1468 words)

I Built a Plugin for My Obsidian Practice

Building an Obsidian plugin for A Practical Approach to Building My Second Brain with Obsidian!

The content of this article is out of date. Please refer to the official website LifeOS for more information.

Posted 2023-07-08Updated 2026-06-04第二大脑18 minutes read (About 2717 words)

Building my second brain 🧠 with Obsidian

This article will take Obsidian as an example to share my practice of using Obsidian to build a second brain!

The content of this article is out of date. Please refer to the official website LifeOS for more information.

Posted 2023-06-17Updated 2026-06-04Engineering25 minutes read (About 3705 words)

Frontend Engineering Practices at ByteDance

Invited to speak at the 2023 WOT Global Technology Innovation Conference organized by 51CTO.

Posted 2022-12-31Updated 2026-06-04Engineering27 minutes read (About 4073 words)

Frontend Monorepo Practices at ByteDance

Invited to speak at the 11th Top100 Summit. For more details, see the article Year-end Review: How These 100 Tech Innovation Leaders Do Retrospectives

follow.it

Recents

Categories

Archives

Tags