From: anon@jpl.nasa.gov Date: Jan 7, 1996 As many of you know by now, a network anomaly occurred on December 27, 1995. The remainder of this document is devoted to 1) Illustrating the sequence of events at JPL in detail, and an overview of the Cisco and Grand Junction event sequence. 2) Describing the actual bug and its ultimate resolution. 3) Presenting a discussion of what can be done to prevent such failures in the future. In the following chronology, all dates and times are local and approximate. ------------------------------------------------------------------------------- The Chronology 1995-12-27T00:00 JPL: People working on the project flight LANs in 230-3xx noticed that some of the installed Grand Junction hubs were failing. A Ulysses station handoff was occurring at roughly the same time; the GIF routing tables were being updated concurrently. The crew attempted to reconfigure workstation on the network by connecting them to properly functioning hubs. As time progressed more hubs failed. Grand Junction: Hubs were beginning to fail throughout the world. Engineering personnel were summoned to work and began examining the problem. 1995-12-27T00:13 Grand Junction: Howard Charney (President and CEO) was alerted that there was a serious problem with the hubs. He was at the office by 00:45. 1995-12-27T01:30 JPL: Tom Boreham was contacted by the flight project personnel. Tom came in and orchestrated further efforts. At this time Ulysses was scheduled to perform commanding; workstations were still being moved between hubs but LOS prevented successful commanding. By this time 28 of the 31 operational hubs had failed. Efforts to configure and install off-line spares failed as the hubs failed as soon as power was supplied. 1995-12-27T03:30 Grand Junction: The hub problem had been identified and a workaround was developed. This workaround was sent to the Cisco TACs (Technical Assistance Centers) so that they could support customers. 1995-12-27T05:00 JPL: Tom started calling Grand Junction and Cisco. He was unable to log a case number with Cisco; he was unable to reach anyone at Grand Junction. 1995-12-27T05:30 JPL: Tom called ILAN but was not able to reach anyone. Tom called for assistance from Pam Ray. 1995-12-27T07:00 JPL: Pam Ray called Michael Rodrigues. 1995-12-27T07:20 JPL: Glenn Dollar (Cassini network support) was called by Ivy Kakamasu and arrived at 230-3xx by 07:30. 1995-12-27T07:30 JPL: Michael Rodrigues called Pat Kleinhammer. 1995-12-27T07:40 JPL: Pat Kleinhammer called David Stegman and John Dundas. David Stegman called Don Wetton and Mike Young. 1995-12-27T08:00 JPL: Pat Kleinhammer, David Stegman, Don Wetton, Mike Young, Don Gallop, and John Dundas arrived at B601; tried to perform preliminary damage assessment. The Grand Junction 2802 hubs in the HiNet lab (601-230C) and 601-1xx exhibited the same behavior as the flight hubs. The Grand Junction 10/100ES in the HiNet lab, which had been powered off for the holiday, was activated; it did not exhibit any problems. 1995-12-27T08:30 JPL: Glenn Dollar called the Cisco TAC; logged a case and was told a FAX containing a workaround would be sent immediately and that an SE would follow up with a phone call. Glenn notified Pat of status. Grand Junction: Software changes had been identified and initial testing of new firmware started. 1995-12-27T08:45 JPL: Glenn hadn't received the FAX from Cisco yet, so he called again. The FAX was received by 09:00. JPL: John Dundas logged a case with the Cisco TAC (C37841) and was told a FAX containing a workaround would be sent and that an SE would follow up with a phone call; TAC support indicated that only the Grand Junction 2800 series switches were affected. 1995-12-27T09:00 JPL: Glenn received his FAX from Cisco. Glenn attempted to perform the procedure described in the FAX and was unsuccessful with the first hub tried; subsequent attempts met with success. 1995-12-27T09:14 JPL: Pat, Don, and John received a FAX from Glenn that was a copy of the information from Cisco. Pat and John tried the described procedure on the hub in the HiNet lab; the procedure was successful. Pat and John proceeded to perform the same procedure on the hub on the first floor; this too was successful. 1995-12-27T09:50 JPL: Pat and John collected information on where Grand Junction hubs were located throughout the Lab and proceeded to meet the people already at work in 230-3xx. 1995-12-27T10:15 JPL: Don Gallop stayed in the HiNet area (601-230x) to coordinate phone support. Pat and John arrived at 230-3xx to coordinate mobilization of forces at the Lab. Tony Swanson (Grand Junction/Cisco sales rep) arrived in 230-3xx and helped the team fixing the hubs. A cast of approximately 6 people continued to work their way through all the hubs in the flight area. Don (Wetton) and Mike proceeded to fix 3 hubs in 230-3xx, 4 hubs in 301-3xx, and 1 hub in 190; they surveyed other building with Grand Junction hubs to verify proper operation. Pat and John proceeded to fix 1 hub in 264-4xx, 4 hubs in 238-4xx, and 2 hubs in 238-3xx. 1995-12-27T14:30 JPL: All Grand Junction hubs repaired and back online. 1995-12-27T19:38 JPL: John received the FAX from Cisco that had been requested earlier. ------------------------------------------------------------------------------- The BUG Howard Charney (formerly President and CEO of Grand Junction; now Division Director at Cisco) and I had a lengthy conversation on 1996-01-02. While we covered a number of topics, perhaps the most important was a discussion of the actual bug that led to the hub failure. Howard assured me that the bug was caused by faulty software (firmware) within the hub. The fault was purely accidental, NOT DELIBERATE, and not an intentional "trojan horse" or "time bomb" as some have portrayed it. The engineer responsible for the faulty routine was identified and was a member of the team that developed the workaround. This engineer is one of their senior software developers and remains employed by the company today. (Some have speculated that, due to the buyout by Cisco, a disgruntled employee might have planted a bug. This simply is not the case. Grand Junction has not had any employee termination in the past year and a half. Furthermore, most of the equipment at JPL was purchased long before the Cisco buyout was announced.) The bug itself was the result of a faulty date conversion routine. The hub maintains an internal time that is used to compute resource utilization over various periods of time. The routine had a bug present when converting internal time to human readable time (i.e., month, day, year, hour, minute, second) for the month of December. Rather than resulting in month 12, the result was month 13; this caused other parts of the routine to fail at different points. In particular, the bug would cause a processor trap to occur whenever the date conversion routine was called between dates December 27 at 12:01 to December 31 at 12:01, in any year. The trap caused the hub to reboot. Unfortunately the routine was also called as part of the boot sequence; consequently the hub would continuously reboot during the interval above. The effect of the bug is to disable ALL communication between ports of the hub. That is, no data is passed between any of the ports while the bug (i.e., boot code loop) is operating. The console (serial, RS-232) port is also disabled during the reboot process; it is not possible to reset the date via the console once the hub has entered the interval. The workaround developed by Grand Junction and distributed by Cisco requires two people. First the hub must be depopulated and then removed from any rack. The cover must be removed (around 12 screws). One person is then required to cause a short between a capacitor and a resistor on the hub motherboard. Concurrently the second person applies power to the hub and waits until the initial boot sequence is completed; at this point the console should become active. The first person may remove the jumper now. The second person must change the date to be something outside the range that activated the bug. The hub may then be reassembled, reinstalled, rebooted and used normally. Note that this procedure required a minimum of 20 minutes per hub; some hubs took as long as 45 minutes. JPL has approximately 50 of these hubs. As noted above, not all hubs failed at the same time. In fact, during the day of December 27, some hubs were still functioning properly. This was due to the fact that some hubs were initialized with wildly different time clocks and had not reached the critical interval yet. ------------------------------------------------------------------------------- The Followup (this is mostly CYA statements by JPL HiNet people. A few interesting nuggets:) To prevent a catastrophic failure of a similar nature in the future, a multi-vendor network could be designed. However, while this lessens the probability of a failure similar to what we already experienced, the following caveats must be kept in mind. 1) Multiple vendors will increase network costs as more kinds of equipment needs to be supported and purchased. We loose economies of scale on purchase when halving the size of the acquisition. We increase maintenance cost as multiple vendors must be contacted for maintenance. 2) Vendors license code from one another. While using more than one vendor of network gear we must be certain that they are not both using the same code. This can be difficult and will certainly complicate procurements. 3) Interoperability testing is complicated. While these networks are required to be "standards based" all vendors add their own value-added touches. Interoperability is a key for the Lab's networks. It is still possible to use single vendor networks designed for fault tolerance (through redundancy) if the designing and operating engineers can increase the probability that the network devices will not fail at the same instant in the same mode. For example, in retrospect if the hubs had their dates offset (e.g., one month) only half of the flight LAN hubs would have failed at one time. There are other operating parameters that can be altered between hubs that increase the probability that the hubs will not have exactly the same failure mode at the same instant while maintaining proper network operation and performance. There parameters need to be carefully evaluated and clearly documented. > * Since the problem did not surface in the Cisco-labeled Grand > Junction products (Catalyst 2000), there must have been a firmware > update between our purchase of the Grand Junction switches in > summer-fall and Cisco's acquisition later in the fall. If there was a > firmware revision at least one engineer knew of the problem. Why was > this not publicized? Why wasn't the fix made available before the > "time bomb" went off? This was a miscommunication between Grand Junction, Cisco, and their customers. Twenty-one Cisco-labeled switches were produced before the bug occurred; none of the switches were operating at the time of the bug. Had any of the switches been operating, they too would have experienced the same symptoms. There was no software revision between the Grand Junction and Cisco products. The bug affected Grand Junction series 2800 switches and Cisco Catalyst series 2000 switches only. ... > * We need to identify a more convenient mechanism for bypassing > the password and boot code, should the need ever arise again. There > must be a way to do this without removing rack-mounted and fully > populated boxes. Please work on this. This has been noted by Grand Junction engineering. ------------------------------------------------------------------------------- Summary 1) The failure of the Grand Junction hubs, while massive, did not compromise any spacecraft data. 2) The response from the vendors, the flight project network support personnel, and HiNet personnel was rapid and massive. 3) A temporary fix has been applied to all hubs at JPL. A permanent fix should be available next week. Note that application of the flash ROM download requires that the hubs be rebooted; installation of new EPROMs requires signification downtime for each hub. HiNet engineering proposes that all hubs have their flash ROM upgraded as soon as the software becomes available. Further, to prevent any possibility of inadvertently using faulty code in the future, we are recommending that all hubs have EPROM upgrades installed, at user convenience, when these become available.