Funny Debugging Story #1

original post from http://patrickthomson.tumblr.com

…One customer had a problem. In the middle of a print run, one particular A drive would stop working, causing the entire print run to stop. To restore the drive the attendants had to reboot the entire drive – and if this happened in the middle of a six-hour print job, there’d be a ton of expensive computer time lost and the whole operation would fall behind schedule.

So Storage Technologies sent out technicians. The technicians, despite their best efforts, could not reproduce the bug in test settings: this bug seemed only to happen in the middle of large print jobs. So, on the off chance that this was a hardware issue, they replaced everything they could – the RAM, the microcontroller, the disk drive, every conceivable part of the tape drive – but the problem kept happening.

So the technicians phoned up headquarters and called in The Expert.

The Expert got a chair and a cup of coffee and sat in the computer room – these were the days when they had rooms specifically dedicated to computers, after all – and watched it as the attendants queued up a large print job. He waited until it crashed – which it did. Everybody looked to The Expert – and he didn’t have a clue what was causing it. So he ordered that the job be queued up again, and all the attendants and technicians went back to work.

The Expert sat down in his chair again, waiting for it to crash. It took something like six hours of waiting, but it crashed again. He still had no idea what was causing it, other than the fact that it happened when the room was crowded. He ordered that the job be restarted, and he sat down again and waited.

By the third crash, he had noticed something. The crash occurred when the attendants were changing the tapes on an unrelated drive. And furthermore, he realized that the crash occurred as soon as one of the attendants walked across a certain tile on the floor.

This type of floor was made of aluminum tiles propped up by posts about 6 to 8 inches tall. The massive amount of wires that these computers needed were threaded under the floor tiles so that an unwary attendant wouldn’t trip over a crucial cable. The tiles were put together very tightly so that no debris would fall into the space where the wiring went.

The Expert figured out that one of the aluminum tiles was warped. When an attendant stood on the corner of the warped tile, the edges of the tiles rubbed together. As the plastic connecting the tiles rubbed together, they produced microsparks, which in turn caused RF interference.

Nowadays, RAM is much more thoroughly shielded from RF interference. But back then, this was not the case. The Expert figured out that the RF interference was corrupting the RAM and, in turn, the operating system.

The Expert called the maintenance office, got a new tile, installed it himself, and the problem went away.

Leave a Comment