A week after 5 million telephone customers in Maryland and neighboring states were hit by a nine-hour service breakdown, hundreds of computer sleuths from the C&P; Telephone Co., AT&T;, contractors and manufacturers are still struggling to pinpoint the cause of the failure.
While they search, similar failures continue to occur elsewhere.
The latest was a two-hour collapse that began at 10:15 a.m. yesterday, affecting a million Bell of Pennsylvania customers in Pittsburgh. It was a recurrence of a larger breakdown the day before affecting 1.3 million customers.
Bell Atlantic spokesman Larry Plumb said that in each case, investigators have found minor equipment problems. Each in turn has triggered an "avalanche" of automatic computer maintenance activity that floods critical call-switching equipment, crowding out telephone calls.
The troubles that trigger these events, Plumb said, "are itty-bitty. They occur all the time. They should never . . . result in this avalanche of maintenance messages. But the root cause is still not identified."
Experts are poring through reams of printouts and miles of cryptic computer language messages looking for the problem.
Although there has been much speculation about sabotage and computer "viruses," Plumb said "the prime suspect is not something like a virus. We believe it's some sort of software glitch, an inadequacy in the code somewhere. But we haven't ruled out anything."
Here's some of what investigators have learned so far about last week's Baltimore phone breakdown:
C&P; has found a faulty circuit board in a computer in Baltimore it believes touched off the breakdown. The failure should have been contained by switching to a backup board. Instead, the big call-routing computer -- Signaling System 7 -- began flooding itself with internal maintenance messages.
As designed, the SS7 computer shut itself down, asked its twin FTC to take over. Instead of taking over, however, the twin became overloaded with similar internal messages and it, too, shut down.
The whole machine tied itself up in a knot," Plumb said.
In the space of an hour, the trouble cascaded through four identical computers in Baltimore and Washington serving 5 million customers in Maryland, Virginia , part of West Virginia and the District of Columbia.
WHAT CUSTOMERS HEARD
Some calls went through, some were cut off; others produced "fast" busy signals, and some yielded "dead" lines.
Plumb said calls to phones served by the caller's own local switch point, or central office, went through because they didn't go through the SS7 computers. So a customer calling a neighbor Annapolis got through. People making long-distance calls also got through, because those calls involve no SS7 switching. So a Baltimore customer could reach Washington or California.
But calls to more distant parts of the "Local Access Transport Area" (your toll-free dialing area) ran into a shut-down SS7 computer and disconnected. The trouble compounded as frustrated callers repeated their attempts. The backed-up calls overloaded circuits, and callers began getting "dead" lines, or "fast busy" signals indicating overloaded trunks.
Some lucky customers in the rea -- a few in Washington but most in rural areas -- are not yet linked to the SS7 computers. Their service was unaffected.
Callers improvised. One woman trying to reach her day-care provider found she could call her grandmother longdistance in Canada, then asked her to call the day care center in Maryland. It worked.
HOW WAS IT FIXED?
As soon as the alarms began going off in C&P;'s control center in Baltimore, technicians at computer terminals began trying to rescue the failing SS7s. The system's manufacturer in Texas and AT&T; experts in Philadelphia pitched in via long-distance computer links. Others got on planes for Baltimore.
As the trouble spread, the rescuers decided that rather than struggle to keep the computers working, they would shut them down and restart them, much like personal computer users do when their system "freeze."
But each time they tried to bring them back up, the SS7s would start overloading again with internal messages to each other. Finally, Plumb said, the operators shut off that function and gradually the computer restarted.
Affected phone companies have installed new instructions to control the explosion of computer messages that have followed minor failures.
Meanwhile, hundreds of analysts from C&P;, AT&T;, Bellcore (the research arm of Bell Atlantic) and Digital Switch Communications (SS7's manufacturer) continue to pore over reams of computer printouts looking for a programming error, a "virus" or anything else that might explain the outages. They are also in touch with teams investigating the outages since June 26 in California and Pittsburgh.
They are picking through the SS7's own reports on its actions last week, looking for "decision points" at which computer considered options and made the wrong choices.
Government regulators, meanwhile, have begun to examine whether the industry's adoption of the SS7 technology has left consumers vulnerable to costly system-wide failures.
Phone companies nationwide began installing SS7s in 1986 in an effort to improve call-switching efficiency, and to expand line capacity to allow such services as Caller ID and Call Waiting.
C&P; has apologized for any "inconvenience" due to the breakdown, and has canceled the $1.55 surcharge normally added to customer phone bills for calls made with operator assistance. The cancellation applies only to calls made during the outage on June 26.