Understanding AI Before Scaling It

Imagine a team ships a distributed database with impressive benchmark results and a warning in the README: we do not really understand the failure modes yet, but the demo is excellent.

Nobody serious would say: great, then the only responsible thing is to stop all database work forever. But nobody serious would say: excellent, let us put it under the payments system and discover the edge cases socially.

This is roughly how I feel about AI risk arguments once you remove the theatre. We are building useful systems whose behavior is not fully understood. That is not a moral panic. It is also not nothing.

The public argument often jumps straight from toy demos to destiny. One side says: it failed this puzzle, therefore no real issue. The other side says: it solved this benchmark, therefore the future is decided. Both moves are too clean. Real systems usually become important while they are still embarrassing in several ways.

A service can be flaky and still be load-bearing. A database can have ugly edge cases and still hold the company together. A model can be wrong about simple things and still be good enough to route tickets, write code, summarize documents, call tools, or persuade someone who is not checking carefully.

The boring work still matters:

Keep evaluations from becoming ceremonial.
Log enough context to reconstruct failures.
Know what the system is allowed to do before giving it tools.
Put humans in the loop where error costs are high.
Make rollback boring.

The hard part is that model behavior is not code behavior. You can read a service and still miss a race condition, but at least the code is the thing being deployed. A model's behavior is compressed out of data, training choices, architecture, prompts, tools, and the user distribution you have not met yet.

This changes the shape of testing. With ordinary code, a good test can often pin down behavior directly: given this input, expect that output. With models, the exact output is sometimes the wrong thing to test. You end up testing distributions, tendencies, refusal behavior, tool-use boundaries, calibration, and whether the system keeps working after the prompt, model, retrieval corpus, or user population changes.

That is uncomfortable engineering. It is also engineering.

The first operational questions I care about are not the science fiction ones. I would rather know:

Can the system take actions, or only suggest them?
Who notices when it is confidently wrong?
What happens when a user steers it around a boundary?
Does it know when it is outside the domain?
How much damage can one bad output cause before a human sees it?

This is why I distrust both lazy answers. "It will definitely kill everyone" is too confident. "It is just autocomplete" is also too confident, and somehow less curious.

"Just autocomplete" was a better dismissal when the system completed text in a box and stopped there. It becomes less satisfying when the text is connected to a terminal, a calendar, a trading account, a hiring workflow, a customer-support queue, or a lab robot. At that point the question is not whether the core mechanism can be described dismissively. Lots of dangerous things have simple mechanisms. The question is what the loop can do.

Suppose we spend 95% of the effort making models more capable and 5% understanding where they fail. Is that enough? I do not know. Maybe the right number is higher. Maybe it depends on domain, deployment, and whether the model can act in the world.

I am not arguing for paralysis. Good safety work should make useful deployment easier because it replaces vibes with measured limits. The team that knows where the system fails can deploy it in narrower, better-designed places. The team that only has a demo has to choose between overconfidence and fear.

But zero is not a serious number.

Zero means the plan is to learn from production incidents and call that empiricism.

Back to writing