Proofs, Not Prayers: Scalable Formal Oversight
A formal methods approach to AI safety that actually scales
Dear SoTA,
I’m writing to tell you about scalable formal oversight (SFO), a new research agenda which aims to avoid catastrophic fallout from accelerating AI capabilities by leveraging (mostly) formal methods, perhaps the driest and most boring subject in computer science. My wife tells me I have a PhD in Boring.
To set the stage, we first need to briefly discuss AI timelines. In San Francisco, everyone has their own AI timeline, which is, their personal (hazy, probabilistic) perspective on how quickly AI capabilities will improve. Two things unify all empirically justifiable timelines. First, everyone agrees that AI capabilities are rapidly improving and generalizing. Second, if current trends continue, AI capabilities will have world-altering effects (read: utopia, or armageddon, or some really weird third option we can’t yet envision).
Figure 1: The AI timeline of the Dallas Federal Reserve
There are, of course, AI skeptics, who point out that, e.g., the pace of silicon hardware development followed Moore’s Law for a very long time, before leveling off. Why not assume AI will do the same? The problem is that there is no good way of measuring AI capabilities which shows any leveling off yet, and banking the fate of humanity on this leveling-off thing happening on a convenient enough horizon seems foolish. Skeptics also point to various flaws in the benchmarks we use to measure AI progress -- however, most software engineers who actually use AI for their day-to-day coding work view this argument as foolish, since their own personal, anecdotal experience tends to support the claim that AI capabilities are rapidly improving.
But your personal timeline almost doesn’t really matter when we consider that current AI capabilities are absolutely “good” enough to negatively impact us. Current open-weight models like GLM-4.5 are good enough to build super-powerful, self-evolving computer viruses, automated penetration testers, superhuman prediction market agents, and more -- and can be run locally on a powerful desktop computer. These use-cases, today, are opaque, sophisticated, and uncontrollable. Their second or third-order consequences could absolutely lead to the next financial collapse, armed conflict, etc.
So what can we do about this? There are, as I see it, three major ideas. The first is a policy response. For example, we could get a multilateral “pause AI” policy, where everyone agrees to “just stop it”. This proposal has at least two flaws. First, it cannot stop the illegal use of open-weight models, some of which are small enough to run on a smartphone. I’d call this the “gun control” problem. Second, state actors will be incentivized to train and use frontier models in secret even if they say they’re no longer doing so, out of fear their rivals are doing the same. I’d call this the “nuclear nonproliferation” problem. Depending on your threat model, these are pretty significant problems.
The second idea is to “align” the models by screwing with their weights. When we talk about an AI, we’re talking about a system powered by a large language model, or LLM. The LLM is a parameterized model, and its parameters are numbers called weights, of which it’ll have up to tens of billions. A model’s “intelligence” boils down to this choice of weights. We can use fancy statistics to single out the weights that seem to correlate with bad behaviors, such as agreeing to modify a computer virus, and try to predict misbehavior or screw with the weights in order to diminish the probability of these behaviors without impacting the useful capabilities of the model. This work occupies the minds of hundreds of sharp folks at institutions like MATS, Anthropic, Google DeepMind, etc. However, it’s a flawed endeavor, because any behavior you might think of as “bad” in one context is “good” in another. For example, the ability to hack a system is “bad” when used by a criminal to hack your bank, but “good” when used by an engineer at Schwab to identify, and fix, vulnerabilities in your bank. The AI cannot tell the difference between these two situations (the criminal can just lie, and say they’re an engineer performing an audit). Since the capitalist incentive is to sell a useful model that can actually do stuff, alignment work will never yield models that reliably say “no” to all misaligned requests. (There are also some philosophical reasons to doubt alignment research, but that’s another conversation entirely.)
The third approach, which is also flawed but, in my view, less- or at least differently-so than/from the other two, is to put the AI in a box. We ask the AI to perform some task, say, “build me an iPhone calculator app”. Then, we ask it to prove that it did so in an aligned manner -- that is to say, the app doesn’t have any vulnerabilities or unintended behaviors, it’s reliable and performant, and so forth. The properties for which we require proof can be quite broad, in essence encoding the general values we want the AI to align with. (There are philosophical reasons to prefer this approach.) The proof is then submitted to a software called an interactive theorem prover, which is a symbol-wrangling machine that definitively checks the proof. Formal methods is the academic field that does this kind of work (interactive theorem proving, and other “formal” mathematical techniques for software assurance). This kind of approach -- boxing the AI using formal methods -- is scalable formal oversight (SFO).
Note, in this short letter I described one SFO technique, called secure program synthesis, which you can read more about here. SFO actually encompasses a number of other formal-methods-based techniques as well. E.g., one might build formal runtime monitors that enforce fine-grained access control policies on AI actions. For a more comprehensive explanation I would point you to my blog post on SFO, linked in the first sentence of this letter.
SFO does not solve all our safety woes. It’s limited to things we can prove or disprove, which means it won’t help with, e.g., stopping the model from teaching someone how to make a chemical weapon. Also, it’s more expensive than having no safety system, so there is the capitalism problem of getting buy-in from people who host models. This challenge is especially salient in the context of open-weight models since you need to convince the customer, not just the model company. It also doesn’t solve the second- or third-order consequences of companies using AI to do things that make money but are risky for society, such as using prediction markets to gamble on house prices or combat operations.
We are working on SFO not because it solves all the safety problems posed by AI, but because it solves some of them, and we are not aware of any better solution to the problems it does solve. We need your help! We encourage you to read the (sparse) existing literature on SFO and start working on this space. There are an abundance of low-hanging-fruit problems to work on with very serious safety implications — and we have some ideas for how to address the limitations of SFO — but we have absolutely no clue how much time we have to do it all before something catastrophic happens.
Cheerfully,
Max von Hippel
Max is the co-founder and CTO of Benchify (YC S24), an autoformalization startup. Previously, he completed a PhD at Northeastern University Khoury College of Computer Science, advised by Dr. Cristina Nita-Rotaru in the NDS2 group, as well as a BS in pure mathematics from the University of Arizona, an International IB diploma from UWCCR, and most importantly, a Blue Belt from 10th Planet Jiu Jitsu. He also co-organizes the Boston Computation Club, created the FM x AI hub, and grew up in Anchorage, Alaska.



