Turning Vibecoding Debt into a Harness (00): Building a Local Harness in Public, on a Real Repo

If you also write real product code with AI, this feeling will be familiar: a feature ships fast, but a few days later you change something else and the earlier thing quietly breaks.

Every release needs a human in the loop — deciding by hand which tests to run, which logs to read, which failures to ignore, and which ones must be fixed. There’s no real confidence underneath it.

I’m building a harness now, not out of engineering perfectionism, but because these pain points have shown up too many times. So I decided to do this in public — a Build in Public series with one theme: standing up a harness inside a real monorepo.

Read more

Why Midscene's Action Space Is a Protocol

How many places do you have to touch to add a new action to Midscene?

This is my rough-and-ready test for whether a framework is well designed. If adding a “pull to refresh” gesture means changing the locating logic, changing the planning prompt, changing the execution dispatch, and then patching in a pile of if-else along the way, then sooner or later the framework gets crushed under its own action space.

Midscene’s answer is: you add it in one place. This post is about how it does that, and about the part it does not do.

The action space as a Zod protocol

Read more

How Does Midscene Stay Model-Agnostic?

People often ask me: when Midscene says it is model-agnostic, does that just mean swapping base_url and model_name?

If a model were only text in, text out, then yes, that is about it. Swap the endpoint and move on.

But a visual UI Agent asks the model to do something different: it asks the model to look at a screenshot and tell us where some element sits on the screen. And that is exactly where the trouble starts, because every model family reports “where” in its own way. So for us, being model-agnostic was never as light as switching an API. It means quietly absorbing all those differences inside the framework, so the ai('click login') line you wrote does not have to change by a single character.

The same script running across different models

Read more

Why Does Midscene Split Locate and Action into Two Steps?

The previous post, What Actually Happens Inside a Single Midscene aiAct Call?, walked through the plan-execute loop inside aiAct, but one stop was deliberately left unopened — “finding the element”.

That stop is arguably the most technically distinctive part of Midscene. Most vision Agents either trust the coordinates the AI gives them, or fire one more AI request to refine the location. Midscene takes a different path: separate the locate step out, and try four fallback layers in order from cheapest to most expensive.

This post is about that.

Read more

What Actually Happens Inside a Single Midscene aiAct Call?

The previous post, Why Does Midscene’s UI Agent Need to See the Screen?, explained why Midscene puts “look at the screenshot” at the very front of every UI action. Right after that explanation, I usually get the next question from coworkers:

“OK, but what actually runs inside aiAct? When I write a single agent.aiAct('log in and place an order'), what really happens? Is it just one model call?”

It is not one call. It is a loop with feedback.

This post takes that loop apart: how the screenshot is grabbed, what the AI returns, when the loop stops, and how context flows across rounds.

Midscene core architecture

Read more

Why Does Midscene's UI Agent Need to See the Screen?

While working on Midscene, I often run into the same question: why does a UI Agent need screenshots? Why not keep using DOM, selectors, XPath, accessibility trees, and the other things traditional automation has already made mature?

It is a fair question. For more than a decade, UI automation has mostly grown around structured interface data. But if we are not trying to build just a smarter Web testing framework, and instead want a UI Agent that can operate Web pages, mobile apps, desktop apps, Canvas, and custom devices, the default input has to shift a little: see the screen first, then decide what to do.

A UI Agent should see the screen first

Read more