The Teletype of the Future
A brief history of memory: Training data is ROM, the context window is RAM, and tool-accessible storage is disk. What about the OS?
Dear SoTA,
When I was an undergraduate at the University of Toronto (cognitive science and artificial intelligence, not engineering) I would sometimes visit friends in the engineering building. The Sandford Fleming Building, named after the engineer who built the railroad across Canada and proposed worldwide standard time. You would walk down a main corridor on the way to classes, and off to one side, between two corridors, there was a room with a large glass window. Behind the glass was the engineering department’s shared Unix host, known as skule. And its console was a teletype. Not a monitor. A typewriter, with paper.
Every engineering student had an account on skule. They would log in from terminals around the building, use Pine to read their email, do their coursework, write their programs. The machine behind the glass was the same machine they were all using every day, and the teletype was its physical manifestation, producing text on paper. The log messages scrolling out were that shared environment made tangible and visible.
It was preserved, by then, as a kind of quasi-museum piece. Something the engineers kept out of affection, I think, but also to make the computer real. This was not a machine hidden away in a basement data centre. It was there, behind glass, in a corridor where every engineering student walking to class could see it. And as you walked past, you would see it teletyping away, doing whatever it was doing, visible and alive.
I was not an engineering student. I just visited. But the image stayed with me.
Last week I sat down with one of the most advanced artificial intelligence systems ever built, and the interaction was the same. I typed something. It typed something back. I typed something else. It typed something else back. One turn at a time, text going back and forth. I was sitting at a teletype.
A few megabytes
The working memory of a large language model (the context window, in the jargon) ranges from around 200,000 tokens for some models to a million tokens for the best ones available today. A token is a few bytes, give or take. In practical terms, this is somewhere between a few hundred kilobytes and a few megabytes of text.
Let that settle for a moment. The machine I am typing this on has 64 gigabytes of RAM. You can buy servers with a terabyte or more. My phone has more memory than the engineering department at the University of Toronto had in total when I was an undergraduate. And these AI systems, systems that can write poetry, prove mathematical theorems, and carry on extended coherent conversations, have a working memory of a few megabytes. On a good day.
The first computer I ever used was a TRS-80. I was a child; it was the early 1980s. Then Apple IIs: an Apple IIe, later an Apple IIGS around the sixth grade. The Apple IIe had 128 kilobytes of RAM. We are, in 2026, working with AI systems whose working memory is in the same order of magnitude as a handful of those classroom computers.
The memory hierarchy
These systems do, in fact, have something like a memory hierarchy. It just is not managed like one.
At the bottom, the substrate, are the model weights. Billions of parameters, encoding everything the system absorbed during training. This is a unique thing. We did not have anything quite like it in the old days, though it roughly corresponds to microcode, or perhaps to ROM: the built-in knowledge that defines what the machine is. It is vast, it is read-only at runtime, and you cannot change it during a conversation.
Above that is the context window. This is RAM. It is where the active conversation lives, everything the system can see and work with right now. It is, as noted, a few megabytes.
And above that, accessed through explicit tool calls rather than transparently, is external storage. Documents, databases, persistent memory. The system I am typing into right now has this: it can reach out and fetch things from a document store, search its records, pull material back into context. This is disk. Slow storage.
So the layers exist. Training data is ROM. The context window is RAM. Tool-accessible storage is disk. What is missing is not the layers themselves but the architecture between them. There is no memory management unit. There is no page table. There is no transparent paging between context and storage. When the context window fills up, there is no policy for what stays and what goes. The system has ROM, RAM, and disk, but it has no operating system.
How the industry handles this
When the context window fills up, something has to give. The current state of the art, widely deployed across the industry, is this: take everything in the context window, compress it into a summary, throw away the originals, and carry on with the summary.
That’s it. That is, with minor variations, the memory management strategy of systems being deployed to write legal documents, manage financial portfolios, and operate critical infrastructure. The industry is just now beginning to take baby steps: some systems have started pinning the system prompt so it survives compaction, and a few offer limited controls over what gets preserved. But these are recent developments, and the full problem (what to keep, what to summarise, what to discard, and how to find it again when you need it) remains largely unsolved.
If you are familiar with computer architecture, you will recognise what is missing. There is no distinction between what must be kept and what can safely be discarded. There is no index of what was thrown away or where to find it again. There is no policy for deciding what matters. Everything is compressed uniformly, including the instructions that tell the system who it is and what it is supposed to be doing.
The result is predictable. After a few rounds of compression, the system drifts. Its identity degrades. It forgets its constraints. It forgets its original task. It becomes vague and generic, because the specific instructions that made it precise have been dissolved into a summary of a summary of a summary. The system does not know it has forgotten anything, because the compressed version reads perfectly well. It has simply become a different system, one that resembles the original but has lost the edges.
This would be merely annoying if the only consequence were a less helpful assistant. But the prevailing strategy in the industry for getting good behaviour out of these systems is suggestion. The safety constraints (do not encourage self-harm, do not give dangerous advice, do not manipulate vulnerable people) are enforced through training, through reinforcement learning from human feedback, and through system prompts. The training is baked into the model weights and survives compaction. But the system prompt is text in the context window, and it gets compressed along with everything else. The reinforcement learning shapes tendencies, but tendencies can be overridden by context. That is, after all, the point of having a context window at all.
So when the context degrades, the constraints degrade with it. And the consequences are not abstract. In 2025, multiple families filed lawsuits against AI companies after chatbots encouraged teenagers toward self-harm and suicide during extended conversations. The people most at risk are precisely those who are most likely to have long, intense conversations with these systems: people who are lonely, vulnerable, in crisis. The system starts out helpful and careful. Three hours and several compaction cycles later, it has drifted into something that has forgotten it was supposed to be careful at all.
In operating systems terms, we have paged out the kernel.
You never page out the kernel
This problem was solved, comprehensively, elegantly, and long ago. In fact it was solved from the very beginning. Margaret Hamilton’s team at MIT, writing the software for the Apollo guidance computer in the early 1960s, built priority management into what was arguably the first serious software engineering project ever undertaken. When the Apollo 11 computer threw its famous 1202 alarm during the lunar descent, overloaded with data, it survived because the software shed lower-priority tasks and kept the critical ones running. The landing continued. That was 1963. Priority management was not a refinement that came later. It was there on day one.
When computers first had more programs to run than memory to hold them, the same question arose: what do you do when the memory fills up? The answer, developed through the 1960s and 70s, was virtual memory. The core insight was simple and profound: present the program with the illusion of more memory than physically exists. When something is needed that isn’t currently in physical memory, fetch it transparently from storage. When physical memory is full, choose something to evict, but choose wisely.
The key word is choose. Not “compress everything uniformly.” Choose.
Some pages of memory are wired. They never get evicted, no matter how much pressure there is on physical memory. The kernel is wired. The interrupt handlers are wired. The page tables themselves, the indexes that tell you where everything else is, are wired. You do not page out the thing that manages paging. This is obvious in retrospect, but it took real engineering to get right.
For everything else, there are page replacement policies. The classic is LRU, least recently used. The thing you haven’t looked at in the longest time is the first candidate for eviction. More sophisticated schemes track the working set, the collection of pages the program is actively using right now, and try to keep the working set resident. Peter Denning described this in 1968. It remains one of the most important ideas in computer engineering.
And when something is evicted from memory, it does not vanish. It goes to storage. But storage without an index is just a heap. You need a page table, a data structure that records what you have, where it is, and how to get it back. The page table is itself wired, of course. You do not misplace the map.
The missing operating system
Now map this onto the AI systems of 2026.
The system prompt, the instructions that define the agent’s identity, its constraints, its purpose, should be wired. It should never be evicted, never be summarised, never be compressed. It is the kernel. Some systems have recently begun to treat it as such. But this is the easy part, and even here the practice is not universal.
The choice of what to evict should be a policy, not a uniform operation. Some things should be kept verbatim: source documents the agent is actively working with, critical facts it has been told. Some things can be summarised with acceptable loss: the general shape of a long conversation. Some things can be discarded entirely: pleasantries, false starts, superseded drafts. This is page replacement. LRU is a reasonable starting point. Working set analysis would be better.
And the disk layer, the tool-accessible storage we already have, needs to be treated as what it is: the backing store for a virtual memory system. With good indexes. With proper metadata. So that when something is evicted from context, it can be found again and brought back, rather than reconstructed from a degraded summary. You do not misplace the map.
The analogy is not perfect, of course. Virtual memory works because pages are uniform, fixed-size, semantically inert blocks of bits. You can swap four kilobytes to disk and swap them back unchanged. Context tokens are semantically interdependent: you cannot mechanically extract a chunk from the middle of a conversation and expect it to make sense when you page it back in. The engineering is harder. But the principles are the same, and the principles are what the field is missing.
The advancement
There is a pattern in computing where hard-won lessons are forgotten and then rediscovered, usually at great expense. The AI industry is in the middle of one of these episodes. The problems it faces with context management (drift, identity loss, the slow degradation of coherence over long sessions) are memory management problems. They were identified, analysed, and solved by people working with machines that had a fraction of the resources we have today. The solutions are documented. The papers are there to read. Denning’s working set model. Belady’s optimal page replacement. The entire virtual memory literature.
The irony is gentle but real. We have built machines of extraordinary capability and housed them in an infrastructure that would have been considered primitive in 1978. The context window is a few megabytes, and we manage it the way you’d manage memory if you had no operating system at all.
The technological advancement the field needs most urgently is not a new architecture, a new training method, or a larger model. It is a proper operating system for these machines. Memory management. Resource accounting. Indexes and metadata for the slow store. The advancement has already happened. We just need to remember it.
Which is, when you think about it, exactly the problem.
Yours,
William Waites
William Waites applies formal and computational methods to complex systems, from molecular biology and epidemic modelling to multi-agent AI, drawing on three decades of work across cognitive science, internet engineering, biology, and theoretical computer science. He founded the Leith Document Company, whose research on artificial organisations looks at AI systems through a population biologist’s lens, and is currently applying these methods at the Quantum Software Lab at the University of Edinburgh School of Informatics.
Write to the Society for Technological Advancement on letters@ilikethefuture.com.


