Summary:
The core WOU network router pair failed to pass traffic beginning at 9:30am on January 14, 2015. Partial network throughput was restored at 12:40pm and a full recovery occured at 9:00pm January 14, 2015.
Timeline:
- Campus network outage began at approximately 9:30am on January 14, 2015
- UCS responded immediately and went into diagnostic mode
- Cisco TAC support was engaged at 10:30am
- High CPU utilization was identified as an issue on the core campus router pair at 11:00am
- Call placed to local Cisco representative for additional support at 11:30
- Call placed to NERO (the WOU ISP) engineer at 12:30
- NERO diagnostics led to finding a server that was identified as pushing an excessive amount of ARP request to the router. The server was removed from the network at 12:40pm
- Several networks were pulled out from behind the firewall, allowing network traffic to flow again
- CPU utilization went from 99% to 86% after server was removed from the network
- About 12:50 the CPU utilization had climbed back to 99% even though the server had not been reconnected to the network
- Additional Cisco support provided about 1:00pm — at this point we had three Cisco engineers on the phone and connected to our router pair via a Webex call.
- By late afternoon, I requested additional on-site support from Mt. States Networking.
- A Mt. States engineer was on site by 6:00pm
- At ~8:15pm, the router netflow process was identified as a culprit in the high CPU utilization. After the netflows were removed, the CPU utilization fell from 99% to 23%
- All networks were moved behind the firewall and traffic continued to flow properly.
- The suspect host that was removed in the morning was returned to service and the CPU utilization on the router immediately climbed to 99%
- The suspect host was removed
Forensics:
- February 15, 2015
- Our unix systems administrator has been reviewing the suspect servers logs and discovered the server had been compromised. This server is running openstack OS.
- We know that whoever compromised the server did not gain direct access to it via ssh or telnet
- Forensics work continues…