Therac-25 software development and design
We know that the software for the Therac-25 was developed by a single person,
using PDP 11 assembly language, over a period of several years. The software
"evolved" from the Therac-6 software, which was started in 1972. According
to a letter from AECL to the FDA, the "program structure and certain subroutines
were carried over to the Therac 25 around 1976."
Apparently, very little software documentation was produced during
development. In a 1986 internal FDA memo, a reviewer lamented, "Unfortunately,
the AECL response also seems to point out an apparent lack of documentation
on software specifications and a software test plan."
The manufacturer said that the hardware and software were "tested
and exercised separately or together over many years." In his deposition
for one of the lawsuits, the quality assurance manager explained that testing
was done in two parts. A "small amount" of software testing was done on
a simulator, but most testing was done as a system. It appears that unit
and software testing was minimal, with most effort directed at the integrated
system test. At a Therac-25 user group meeting, the same quality assurance
manager said that the Therac-25 software was tested for 2,700 hours. Under
questioning by the users, he clarified this as meaning "2,700 hours of
use."
The programmer left AECL in 1986. In a lawsuit connected with
one of the accidents, the lawyers were unable to obtain information about
the programmer from AECL. In the depositions connected with that case,
none of the AECL employees questioned could provide any information about
his educational background or experience. Although an attempt was made
to obtain a deposition from the programmer, the lawsuit was settled before
this was accomplished. We have been unable to learn anything about his
background.
AECL claims proprietary rights to its software design. However,
from voluminous documentation regarding the accidents, the repairs, and
the eventual design changes, we can build a rough picture of it.
The software is responsible for monitoring the machine status,
accepting input about the treatment desired, and setting the machine up
for this treatment. It turns the beam on in response to an operator command
(assuming that certain operational checks on the status of the physical
machine are satisfied) and also turns the beam off when treatment is completed,
when an operator commands it, or when a malfunction is detected. The operator
can print out hard-copy versions of the CRT display or machine setup parameters.
The treatment unit has an interlock system designed to remove
power to the unit when there is a hardware malfunction. The computer monitors
this interlock system and provides diagnostic messages. Depending on the
fault, the computer either prevents a treatment from being started or,
if the treatment is in progress, creates a pause or a suspension of the
treatment.
The manufacturer describes the Therac-25 software as having a
stand-alone, real-time treatment operating system. The system is not built
using a standard operating system or executive. Rather, the real-time executive
was written especially for the Therac-25 and runs on a 32K PDP 11/23. A
preemptive scheduler allocates cycles to the critical and noncritical tasks.
The software, written in PDP 11 assembly language, has four major
components: stored data, a scheduler, a set of critical and noncritical
tasks, and interrupt services. The stored data includes calibration parameters
for the accelerator setup as well as patient-treatment data. The interrupt
routines include
-
a clock interrupt service routine,
-
a scanning interrupt service routine,
-
traps (for software overflow and computer-hardware-generated interrupts),
-
power up (initiated at power up to initialize the system and pass control
to the scheduler),
-
treatment console screen interrupt handler,
-
treatment console keyboard interrupt handler,
-
service printer interrupt handler, and
-
service keyboard interrupt handler.
The scheduler controls the sequences of all noninterrupt events and coordinates
all concurrent processes. Tasks are initiated every 0.1 second, with the
critical tasks executed first and the noncritical tasks executed in any
remaining cycle time. Critical tasks include the following:
-
The treatment monitor (Treat) directs and monitors patient setup and treatment
via eight operating phases. These are called as subroutines, depending
on the value of the Tphase control variable. Following the execution of
a particular subroutine, Treat reschedules itself. Treat interacts with
the keyboard processing task, which handles operator console communication.
The prescription data is cross-checked and verified by other tasks (for
example, the keyboard processor and the parameter setup sensor) that inform
the treatment task of the verification status via shared variables.
-
The servo task controls gun emission, dose rate (pulse-repetition frequency),
symmetry (beam steering), and machine motions. The servo task also sets
up the machine parameters and monitors the beam-tilt-error and the flatness-error
interlocks.
-
The housekeeper task takes care of system-status interlocks and limit checks,
and puts appropriate messages on the CRT display. It decodes some information
and checks the setup verification.
Noncritical tasks include
-
Check sum processor (scheduled to run periodically).
-
Treatment console keyboard processor (scheduled to run only if it is called
by other tasks or by keyboard interrupts). This task acts as the interface
between the software and the operator.
-
Treatment console screen processor (run periodically). This task lays out
appropriate record formats for either displays or hard copies.
-
Service keyboard processor (run on demand). This task arbitrates non-treatment-related
communication between the therapy system and the operator.
-
Snapshot (run periodically by the scheduler). Snapshot captures preselected
parameter values and is called by the treatment task at the end of a treatment.
-
Hand-control processor (run periodically).
-
Calibration processor. This task is responsible for a package of tasks
that let the operator examine and change system setup parameters and interlock
limits.
It is clear from the AECL documentation on the modifications that the software
allows concurrent access to shared memory, that there is no real synchronization
aside from data stored in shared variables, and that the "test" and "set"
for such variables are not indivisible operations. Race conditions resulting
from this implementation of multitasking played an important part in the
accidents.