The Therac-25 Incident

48 points by lemper 2 hours ago

benrutter 42 minutes ago

> software quality doesn't appear because you have good developers. It's the end result of a process, and that process informs both your software development practices, but also your testing. Your management. Even your sales and servicing.

If you only take one thing away from this article, it should be this one! The Therac-25 incident is a horrifying and important part of software history, it's really easy to think type-systems, unit-testing and defensive-coding can solve all software problems. They definitely can help a lot, but the real failure in the story of the Therac-25 from my understanding, is that it took far too long for incidents to be reported, investigated and fixed.

There was a great Cautionary Tales podcast about the device recently[0], one thing mentioned was that, even aside from the catasrophic accidents, Therac-25 machines were routinely seen by users to show unexplained errors, but these issues never made it to the desk of someone who might fix it.

[0] https://timharford.com/2025/07/cautionary-tales-captain-kirk...

AdamN 29 minutes ago

This is true but there also needs to be good developers as well. It can't just be great process and low quality developer practices. There needs to be: 1/ high quality individual processes (development being one of them), 2/ high quality delivery mechanisms, 3/ feedback loops to improve that quality, 4/ out of band mechanisms to inspect and improve the quality.
- Fr3dd1 3 minutes ago
  
  I would argue that a good process always has a good self correction mechanism built in. This way, the work done by a "low quality" software developer (this includes almost all of us at some point in time), is always taken into account by the process.

isopede 36 minutes ago

I strongly believe that we will see an incident akin to Therac-25 in the near future. With as many people running YOLO mode on their agents as there are, Claude or Gemini is going to be hooked up to some real hardware that will end up killing someone.

Personally, I've found even the latest batch of agents fairly poor at embedded systems, and I shudder at the thought of giving them the keys to the kingdom to say... a radiation machine.

the-grump 23 minutes ago

The 737 MAX MCAS debacle was one such failure, albeit involving a wider system failure and not purely software.
Agreed on the future but I think we were headed there regardless.
Maxion 19 minutes ago

> Personally, I've found even the latest batch of agents fairly poor at embedded systems
I mean even simple crud web apps where the data models are more complex, and where the same data has multiple structures, the LLMs get confused after the second data transformation (at the most).
E.g. You take in data with field created_at, store it as created_on, and send it out to another system as last_modified.

rossant 9 minutes ago

The first commenter on this site introduces himself as "a physician who did a computer science degree before medical school." He is now president of the Ray Helfer Society [1], "an honorary society of physicians seeking to provide medical leadership regarding the prevention, diagnosis, treatment and research concerning child abuse and neglect."

While the cause is noble, the medical detection of child abuse faces serious issues with undetected and unacknowledged false positives [2], since ground truth is almost never knowable. The prevailing idea is that certain medical findings are considered proof beyond reasonable doubt of violent abuse, even without witnesses or confessions (denials are extremely common). These beliefs rest on decades of medical literature regarded by many as low quality because of methodological flaws, especially circular reasoning (patients are classified as abuse victims because they show certain medical findings, and then the same findings are found in nearly all those patients—which hardly proves anything [3]).

I raise this point because, while not exactly software bugs, we are now seeing black-box AIs claiming to detect child abuse with supposedly very high accuracy, trained on decades of this flawed data [4, 5]. Flawed data can only produce flawed predictions (garbage in, garbage out). I am deeply concerned that misplaced confidence in medical software will reinforce wrongful determinations of child abuse, including both false positives (unjust allegations potentially leading to termination of parental rights, foster care placements, imprisonment of parents and caretakers) and false negatives (children who remain unprotected from ongoing abuse).

[1] https://hs.memberclicks.net/executive-committee

[2] https://news.ycombinator.com/item?id=37650402

[3] https://pubmed.ncbi.nlm.nih.gov/30146789/

[4] https://rdcu.be/eCE3l

[5] https://www.sciencedirect.com/science/article/pii/S002234682...

michaelt 41 minutes ago

I'd be interested in knowing how many of y'all are being taught about this sort of thing in college ethics/safety/reliability classes.

I was taught about this in engineering school, as part of a general engineering course also covering things like bathtub reliability curves and how to calculate the number of redundant cooling pumps a nuclear power plant needs. But it's a long time since I was in college.

Is this sort of thing still taught to engineers and developers in college these days?

lgeek 4 minutes ago

It was taught in a first year software ethics class on my Computer Science programme. Back in 2010. I'm wondering if they still do
wocram 13 minutes ago

This was part of our Systems Engineering class, something like this: https://web.mit.edu/6.033/2014/wwwdocs/assignments/therac25....
aDyslecticCrow 12 minutes ago

Im too curious, I made a poll. I for sure wasnt in computer science uni. I only heard about it vaguely online.
https://strawpoll.com/NMnQNX9aAg6
BoxOfRain 13 minutes ago

I was taught about it in university as a computer science undergrad, thought about it often since I ended up working in medtech.

vemv 15 minutes ago

My (tragically) favorite part is, from wikipedia:

> A commission attributed the primary cause to generally poor software design and development practices, rather than singling out specific coding errors.

Which to me reads as "this entire codebase was so awful that it was bound to fail in some or other way".

autonomousErwin 34 minutes ago

This reminds me of the Belgium 2003 election that was impossibly skewered by a supernova light years away sending charged particles which manage to get through our atmosphere (allegedly) and flipping a bit. Not the only case it's happened.

jve 19 minutes ago

On the bright side, wow, those computers are really sturdy: takes a whole supernova to just flip a bit :)
- kijin 6 minutes ago
  
  Well the thing is, millions of stars go supernova in the observable universe every single day. Throw in the daily gamma ray burst as well, and you've got bit flips all over the place.

mellosouls 19 minutes ago

TIL TheDailyWTF is still active. I'd thought it had settled to greatest hits only some years ago.

rokkamokka an hour ago

I was taught this incident in university many years ago. It's undeniably an important lesson that shouldn't be forgotten

haddonist 33 minutes ago

Well There's Your Problem podcast, Episode 121: Therac-25

https://www.youtube.com/watch?v=7EQT1gVsE6I

napolux 43 minutes ago

The most deadly bug in history. If you know any other deadly bug, please share! I love these stories!

kgwgk 21 minutes ago

Several people killed themselves over this: https://www.wikipedia.org/wiki/British_Post_Office_scandal
- BoxOfRain a minute ago
  
  More heads should have rolled over this in my opinion, absolutely despicable that they cheerfully threw innocent people in prison rather than admit their software was a heap of crap. It makes me so angry this injustice was allowed to prevail for so long because nobody cared about the people being mistreated and tarred as thieves as long as they were 'little people' of no consequence, while senior management gleefully covered themselves in criminality to cover for their own uselessness.
  It's an archetypal example of 'one law for the connected, another law for the proles'. The CEO had her CBE stripped for 'bringing the honours system into disrepute' but she brought the entire justice system and the country itself into disrepute. At the bare minimum the Post Office should have been stripped of its right to bring its own prosecutions given it's clearly breathtakingly unfit as an organisation to wield such power. In my opinion they belong in prison themselves.
benrutter 33 minutes ago

Probably many rather than a single bug, but the botched London Ambulance dispatch software from the 90s, is probably one of the most deadly software issues of all time, although there aren't any estimates I know of that try to quantify the number of lives lost as a result.
http://www0.cs.ucl.ac.uk/staff/a.finkelstein/papers/lascase....
NitpickLawyer 37 minutes ago

The MCAS related bugs @ Boeing led to 300+ deaths, so it's probably a contender.
- solids 34 minutes ago
  
  Was that a bug or a failure to inform pilots about a new system?
  - thyristan 27 minutes ago
    
    In the same vein one could argue that Therac-25 was not actually a software bug but a hardware problem. Interlocks, that could have prevented the accidents and that where present in earlier Therac models, were missing. The software was written with those interlocks in mind. Greedy management/hardware engineers skipped them for the -25 version.
    It's almost never just software. It's almost never just one cause.
    
    actionfromafar 11 minutes ago
    
    Just to point it out even clearer - there's almost never a root cause.
  - AdamN 27 minutes ago
    
    Both - and really MCAS was fine but the issue was the metering systems (Pitot tubes) and the handling of conflicting data. That part of the puzzle was definitely a bug in the logic/software.
    
    kijin 10 minutes ago
    
    Remember the Airbus that crashed in the middle of the Atlantic because one of the pilots kept pulling on his yoke, and the computer decided to average his input with normal input from the other pilot?
    Conflict resolution in redundant systems seems to be one of the weakest spots in modern aircraft software.
echelon 34 minutes ago

The 737 Max MCAS is arguably a bug. That killed 346 people.
Not a "bug" per se, but texting while driving kills ~400 people per year in the US. It's a bug at some level of granularity.
To be tongue in cheek a bit, buggy JIRA latency has probably wasted 10,000 human years. Those are many whole human lives if you count them up.

rvz 37 minutes ago

We're more likely to get a similar incident like this very quickly if we continue with the cult of 'vibe-coding' and throwing away basic software engineering principles out of the window as I said before. [0]

Take this post-mortem here [1] as a great warning and which also highlights exactly what could go horribly wrong if the LLM misreads comments.

What's even more scarier is each time I stumble across a freshly minted project on GitHub with a considerable amount of attention, not only it is 99% vibe-coded (very easy to detect) but it completely lacks any tests written for it.

Makes me question the ability of the user prompting the code in the first place if they even understand how to write robust and battle-tested software.

[0] https://news.ycombinator.com/item?id=44764689

[1] https://sketch.dev/blog/our-first-outage-from-llm-written-co...

auggierose 27 minutes ago

Wondering if that "one developer" is here on HN.