Friday, October 08, 2004


Quote With Almost No Comment Department

Courtesy of Jerry Pournelle, we find this cute article on software disasters being people problems. The money quotes (emphasis and selective editing all mine):

Too often, he said, programmers are handed a lengthy document explaining the business requirements for a software project and left to interpret it.

"Developers are least qualified to validate a business requirement. They're either nerds and don't get it, or they're people in another culture altogether," said Michelsen, referring to cases where development takes place offshore.

The lack of robust testing during and after such a project likely contributed to the September 14 radio system outage over the skies of parts of California, Nevada and Arizona. Though there were a handful of close calls, all 403 planes in the air during the incident managed to land safely, said FAA spokesman Donn Walker. A handful violated rules that dictate how close they are allowed to fly to each other -- but the FAA maintains there were no "near misses."

The genesis of the problem was the transition in 2001 by Harris Corp. of the Federal Aviation Administration's Voice Switching Control System from Unix-based servers to Microsoft Corp.'s off-the-shelf Windows Advanced Server 2000.

By most accounts, the move went well except the new system required regular maintenance to prevent data overload. When that wasn't done, it turned itself off as it was designed to do. But the backup also failed. Michelson said the failure was in inadequate testing. "On a regular basis, the FAA should have been downing that primary system and watching that backup system come up," he said. "If it doesn't go up and stay up, they would have known they had a problem to fix long before they needed to rely on it."

To be fair about this, part of the problem did arise out of Harris using a timer function in the Windows API that resets itself to zero every 49 days, but geez, handing developers requirements docs without analysts to tell them what it means? Everyone and their mother knows that MSFT servers have to be rebooted periodically - that's what we have maintenance windows for, boys and girls. In my neck of the woods, very few business people are going to apprive putting a system that requires true HA on a Wintel server platform. Everyone from the architects and analysts to the project managers here should be held accountable for this fiasco.


<< Home

This page is powered by Blogger. Isn't yours?

Technorati search