The Therac-25: When Control Flow Kills

A machine that healed thousands and irradiated a handful — a COMP 1150 case study

Author

Brendan Shea, PhD

Published

May 26, 2026

Who: Atomic Energy of Canada Limited (AECL), the engineers who built the Therac-25 radiation-therapy machine, the radiotherapy technicians who ran it, and at least six cancer patients in the United States and Canada
What: A medical linear accelerator delivered radiation overdoses roughly a hundred times the intended dose; at least three patients died and others were maimed, because of how the machine’s software handled the order and timing of operator input
Where / When: Marietta GA, Hamilton ON, Yakima WA, and Tyler TX, between June 1985 and January 1987
Why it matters: Therac-25 is the canonical example of software killing people. Not through one dramatic “bug” — through a chain of ordinary control-flow decisions, reused code, and organizational confidence. Two hard questions collide here. Who is responsible when correctness fails under delivery pressure? And is writing software that can kill an act of engineering at all?
Concepts at play: conditionals, loops, shared mutable state, race conditions, integer overflow, and the order of operations in a program

The Case

In the spring of 1986, the East Texas Cancer Center in Tyler ran one of the most advanced radiation-therapy machines in the world. The Therac-25, built by the Canadian Crown corporation AECL, could treat tumors two ways from one device:

Electron mode fired a low-current beam straight at shallow tumors.
X-ray mode generated a beam roughly a hundred times more powerful. A thick metal target slid into the beam’s path. The target absorbed the raw beam and emitted the gentler X-rays that reached the patient.

The whole therapeutic idea depended on one physical fact: the high-energy beam must never reach a patient with the target out of the way.

On March 21, 1986, a technician we’ll call the operator set up a routine electron treatment for a man named Ray Cox. She was fast — she ran many patients a day and typed the prescription from memory. She entered “x” for X-ray by habit, noticed the mistake, moved the cursor up, changed it to “e” for electron, and pressed return to proceed. The screen said the prescription was set. She pressed the beam-on key.

Cox, alone in the shielded treatment room, felt what he later described as an electric burning in his back. It was far more pain than any treatment he had had. He tried to get up. In the control room the console displayed MALFUNCTION 54 and a “treatment pause.” The operator had seen this many times. It usually meant the machine had delivered too little dose. The Therac-25 produced dozens of numbered malfunction codes, and the manual explained almost none of them. A “pause,” unlike a “halt,” let the operator continue with one keystroke. The machine paused so often that proceeding through one had become muscle memory. The intercom and video monitor to the room were not working that day. She did what the workflow trained her to do with a low-dose pause. She pressed “P” to proceed. The machine fired again. Cox was struck a second time before he could reach the door and pound on it.

Ray Cox received something on the order of 100 times his prescribed dose to a small area of his body. He lost the use of an arm, then the function of organs in the path of the beam, and died about five months later. He was not the first. The pattern had already shown up at several other clinics:

A patient in Marietta, Georgia, had been burned in June 1985.
A woman in Hamilton, Ontario, was overdosed in July 1985.
Yakima, Washington, would see two accidents (1985 and 1987).

Three weeks after Cox, at the same Tyler clinic, a man named Verdon Kidd was overdosed in nearly identical circumstances. He died within weeks. He was the first death attributed to a software-driven radiation accident.

Why typing speed could kill. Several pieces of the Therac-25’s software talked to each other through shared variables. One was a flag variable: a single value that one task sets to mean “data entry is complete.” Another task reads it before moving the beam hardware. Setting up X-ray mode meant rotating the heavy target into place and adjusting magnets. That took about eight seconds.

The bug was a race condition — a defect where the outcome depends on the timing of operations the programmer assumed would happen in a fixed order. Here is how it played out:

The operator selected X-ray. The target started moving into place.
Within eight seconds, the operator edited the prescription to electron.
The data-entry task set the “complete” flag.
The treatment proceeded — while the target was still moving.

The console showed electron mode. The machine was still set to fire the X-ray-strength beam with the target retracted. A slow typist never hit it. An expert typist hit it often. In software that shares state, the order and timing of operations is itself a safety property.

Figure 1: The Therac-25 edit race. Read it as one operator action that forks on a timing question (gold diamond): did the prescription edit finish before the eight-second magnet/target setup? On the slow path the steps stay in their assumed order and the patient is treated safely (green). On the fast path the ‘data entry complete’ flag is set while the target is still retracted, so the console reports electron mode while the X-ray-strength beam fires unattenuated — a ~100× overdose (red). The defect is not a wrong calculation; it is an unenforced ordering. Adapted from Leveson & Turner (1993).

What makes Therac-25 more than a horror story is what AECL did with these reports. For more than a year the company held that an overdose was effectively impossible. Its hazard analysis had assigned the probability of “computer selects wrong energy” a value like 10⁻¹¹ — a number with no basis. The analysis treated software as something that does not fail.

Tyler’s own physicist, Fritz Hager, reproduced the fault on the clinic’s machine. He called AECL. He was told that no other accidents had occurred, and that the machine could not overdose.

Then a second, unrelated defect surfaced. A one-byte counter bypassed a position check every 256th time an operator pressed a key. It caused another accident in Yakima in 1987. The Therac-25 was finally declared defective by the U.S. Food and Drug Administration — only after the body count and an independent investigation forced the issue.

The machine had treated thousands of patients successfully. The earlier Therac-20 had run the same core code for years with no such deaths. Both facts were true, and both, it turned out, were part of the trap. The story everyone could agree on ended here. The argument about what it meant did not.

How It Worked: A Race and an Overflow

Therac-25’s failures came from two technically ordinary defects that almost any introductory programmer can be made to see. Both turn on small choices about when and how often code runs — exactly the kind of choice that looks innocent on the page and is not.

The edit race is the one in the Cox case. Two pieces of code shared a single boolean flag:

One task set it to True when the operator finished entering the prescription.
Another task read it to decide whether the beam-shaping hardware was safely in position.

The flag was never locked between writers and readers. If a fast operator finished editing before the magnet/target task finished moving, the “ready” flag was set while the machine was not yet ready. The bug is not in any one line. It is in the missing rule that the two lines have to happen in order.

The counter overflow is the second defect. It caused the 1987 Yakima accident. A subroutine ran a setup test once per operator keypress. It used a one-byte variable — call it safety_counter — to track how many passes it had made. A position check was skipped whenever the counter equaled zero. The assumption was that zero meant “first time through, nothing to check yet.”

But a one-byte counter wraps. 255 + 1 rolls back to 0. Every 256th pass, the counter was zero mid-treatment. The safety check was skipped exactly when it was most needed. We can watch the failure happen:

# safety_counter is a one-byte (8-bit) variable, so it wraps at 256.
# The position check is skipped whenever the counter is 0 — fine
# on the very first pass, lethal every 256 passes after that.
safety_counter = 0

def setup_pass():
    global safety_counter
    safety_counter = (safety_counter + 1) % 256        # 8-bit rollover
    if safety_counter == 0:
        return "SKIP position check"                   # the bug
    return "verify collimator position"

for press in range(260):
    result = setup_pass()
    if result.startswith("SKIP"):
        print(f"keypress {press:>3}: {result}")

Run this and the loop prints exactly once, at keypress 255 — the 256th increment, when the counter wraps to zero. In a hospital control room, that is one keypress in 256 where the machine quietly stops checking whether the radiation-shaping hardware is in position. The bug is the conditional if safety_counter == 0: combined with a counter whose integer overflow behavior the conditional silently relies on. Each piece — the loop, the condition, the wrap-around — is something a first-week programmer can write. The defect is in how the three meet.

Both bugs share a deeper feature worth naming. They are not arithmetic mistakes. Every individual operation produces the value its author intended. They are mistakes about control flow — which code runs, in what order, how often, and on what assumption about state it has not itself established.

The same shape recurs across most software disasters: not a wrong answer, but a right answer computed at the wrong moment — or skipped at the wrong moment — by code whose author could not see the gap their own assumptions had left.

The Argument Therac-25 Started

Therac-25 did not just enter the engineering curriculum; it created a fault line in how the field talks about failure. The positions below were built, attacked, and repaired over thirty years, mostly through the work of Nancy Leveson, whose 1993 investigation with Clark Turner remains the primary source (Leveson and Turner 1993). Read them as a single argument, each move answering the one before.

The programmer-blame argument

The first instinct — the newspapers’, the lawyers’, and many engineers’ — was to find the line of code that did it and, with it, the person at fault. There was a racy flag and an overflowing counter; one programmer, working largely alone in PDP-11 assembly, had written them.

The Programmer-Blame Argument

The overdoses were caused by specific defects in specific lines of code (the edit race and the counter overflow).
Those lines were written by an identifiable programmer who failed to handle ordering and overflow correctly.
Whoever authored the defective code that caused the harm is the party responsible for it.
Therefore, responsibility for the Therac-25 deaths rests with the programmer who wrote the faulty code.

Premise 1 is essentially true and premise 2 is largely true. The whole weight falls on premise 3 — the assumption that locating the faulty line locates the responsibility. That is exactly where Leveson attacked.

The systems reply: there was no single cause

Leveson’s central finding was not “a better programmer would have prevented this.” It was that accidents in complex systems are not caused by single components — they emerge from the interaction of many decisions, most of them not coding decisions at all.

The Systems Reply

The same core software ran for years on the Therac-20 with no deaths — because the Therac-20 kept independent hardware interlocks that physically stopped a mispositioned beam.
The Therac-25 removed those hardware interlocks and asked software alone to enforce safety, a management/engineering decision, not a coding one.
Further independent decisions — no independent safety review, a hazard analysis that excluded software, cryptic error messages, a workflow that trained operators to “proceed” through pauses, and a year of denied incident reports — were each necessary links in every accident.
When harm requires the conjunction of many independent failures, no single component (or person) is “the cause.”
Therefore, blaming the programmer is a category error: Therac-25 was a system failure, not a software bug.

This is now close to orthodoxy in safety engineering, and Leveson spent the next two decades formalizing it (Leveson 2011, 2017). The reused code is the sharpest illustration. The code was trusted because it had a track record — but its track record had been silently underwritten by hardware that the Therac-25 threw away. The history that made the software look safe was the very thing that made reusing it dangerous.

What “software engineering” was in 1985. The phrase software engineering was coined at a 1968 NATO conference. The conference had been called because large software projects routinely failed — the so-called “software crisis” (Naur and Randell 1969). By the mid-1980s, several things still did not exist:

licensure for software engineers
a requirement for independent safety review of medical-device code
meaningful FDA regulation of software in particular

Therac-25 is one of the cases that changed that. The reuse fallacy — “this code is proven because it ran before” — had no professional check against it. An engineer signing off on a bridge worked inside a regime of liability and review that the Therac-25 software did not.

But the systems reply has a soft spot, and Leveson herself flagged it. If “the system did it,” it becomes dangerously easy to slide to “no one did it.” Diffuse causation can become diffused accountability. That is a comfort to every organization that would prefer no name attached to a death.

The accountability objection: diffuse cause is not no cause

The Accountability Objection

Granting the Systems Reply, the accidents still required specific people to make specific decisions: to delete the hardware interlocks, to assign 10⁻¹¹ to a failure mode no one had analyzed, and — after the first injuries — to tell clinics that overdose was impossible.
A decision can be blameworthy even when it is only one link in a causal chain.
“It was the system” describes how the harm happened; it does not discharge the duty of those who made the system what it was.
Therefore, rejecting programmer-blame does not dissolve responsibility — it relocates it to the engineering and management decisions that built and defended the system.

This is the move that keeps the Systems Reply honest. It accepts premise-for-premise that no single line of code “did it,” and still refuses the conclusion that therefore no one is answerable. The most damning material in Leveson’s account is not the race condition; it is the year in which a manufacturer, holding injury reports, continued to assure hospitals the machine could not do what it had already done. No buffer overflow explains that. It reframes the whole debate: the question is not coder vs. system but which decisions, by whom, under what duty.

The professional-engineering reframe. If the failures were decisions, the next question is what standard those decisions should have been held to. We license and legally bind the engineers who design bridges, elevators, and aircraft. Their errors kill people who never consented to the risk. Therac-25 killed people in exactly that way. The reframe says life-critical software should sit inside the same regulated profession — with enforceable standards, independent review, and personal accountability — not be treated as an unregulated craft.

The counter is the “software is different” claim, and it is not frivolous:

Software has no continuous physics to reason over.
Its state space is astronomically larger than a bridge’s.
You cannot test every path.

Defenders of the reframe answer that this is an argument for more rigor and humility, not less. It is precisely the inference AECL declined to make when it modeled software as incapable of failure. Most jurisdictions still do not require licensure to write medical-device or avionics code. “Software is different” is still doing load-bearing work on both sides.

The argument that hasn’t ended: did we actually learn?

The tidy version of Therac-25 — don’t trust reused code, keep independent safety checks, don’t blame the operator — is taught everywhere. The uncomfortable question is whether the field absorbed the lesson or just the slogan.

The pattern recurs with unsettling fidelity.

Boeing 737 MAX. The MCAS system trusted a single sensor. It was added to avoid a costly redesign. It produced two crashes. The Congressional investigation describes removed redundancy, optimistic hazard assessment, and pilots blamed before the design (U.S. House Committee on Transportation and Infrastructure 2020).
Uber, Tempe. An automated test vehicle killed a pedestrian. The system had repeatedly reclassified her. A safety driver was treated as the backstop for a known gap (National Transportation Safety Board 2019).

Each is Therac-25 with new nouns: reused or under-reviewed logic, optimistic risk numbers, redundancy removed under cost pressure, the human operator positioned to absorb the blame.

Now the argument turns one degree further. Imagine the racy flag or the overflowing counter is not written by a programmer at all. It is suggested by a code assistant, trained on millions of repositories. A developer merges it under deadline because it looked plausible and the tests passed — the same way the Therac-20 history made the original code look proven.

Who authored it? The model has no professional duty and no license to revoke. The litigation over what such models may even learn from is unresolved (Butterick and Joseph Saveri Law Firm 2022). The Programmer-Blame Argument assumed you could at least find the author. Forty years after Tyler, the argument Therac-25 started is not closed. It has lost even that starting assumption. The duty the Professional-Engineering Reframe wants to assign now has no obvious place to land.

Discussion Questions

Explain to a friend with no programming background what a “race condition” is, using an analogy from outside computing — kitchens, traffic, sports, anything you know well. What does your analogy get right about what happened with the Therac-25? Where does the analogy break down?
The edit race happened because two pieces of code ran in the wrong order. Name one way to stop the bug — for example, a lock, a hardware interlock, or a “wait until ready” signal. Which would you choose? Why?
Write The Programmer-Blame Argument and The Systems Reply in your own words. What is the one thing they really disagree about? What evidence could settle it?
You are the operator in Tyler. The console says MALFUNCTION 54. Your training says this means underdose. The intercom to the patient is broken. Do you press “P” to continue? Why? What does your answer say about a system that puts this choice on you?
Pick one modern case: the 737 MAX, the Uber self-driving crash in Tempe, or AI-suggested code merged under deadline. Who is responsible? Is your answer easier or harder than for Therac-25?

References

Butterick, Matthew, and Joseph Saveri Law Firm. 2022. GitHub Copilot Litigation: Class-Action Complaint (Doe v. GitHub). Githubcopilotlitigation.com. https://githubcopilotlitigation.com/.

Jacky, Jonathan. 1989. “Programmed for Disaster: Software Errors That Imperil Lives.” The Sciences 29 (5): 22–27.

Leveson, Nancy G. 2011. Engineering a Safer World: Systems Thinking Applied to Safety. MIT Press.

Leveson, Nancy G. 2017. “The Therac-25: 30 Years Later.” IEEE Computer 50 (11): 8–11. https://doi.org/10.1109/MC.2017.4041349.

Leveson, Nancy G., and Clark S. Turner. 1993. “An Investigation of the Therac-25 Accidents.” IEEE Computer 26 (7): 18–41. https://doi.org/10.1109/MC.1993.274940.

National Transportation Safety Board. 2019. Collision Between Vehicle Controlled by Developmental Automated Driving System and Pedestrian, Tempe, Arizona, March 18, 2018. HAR-19/03. NTSB. https://www.ntsb.gov/investigations/AccidentReports/Reports/HAR1903.pdf.

Naur, Peter, and Brian Randell, eds. 1969. Software Engineering: Report of a Conference Sponsored by the NATO Science Committee, Garmisch, 1968. NATO Scientific Affairs Division.

U.S. House Committee on Transportation and Infrastructure. 2020. The Design, Development, and Certification of the Boeing 737 MAX. U.S. House of Representatives. https://transportation.house.gov/committee-activity/boeing-737-max-investigation.