Summary:

The core WOU network router pair failed to pass traffic beginning at 9:30am on January 14, 2015. Partial network throughput was restored at 12:40pm and a full recovery occured at 9:00pm January 14, 2015.

Timeline:

Campus network outage began at approximately 9:30am on January 14, 2015
UCS responded immediately and went into diagnostic mode
Cisco TAC support was engaged at 10:30am
High CPU utilization was identified as an issue on the core campus router pair at 11:00am
Call placed to local Cisco representative for additional support at 11:30
Call placed to NERO (the WOU ISP) engineer at 12:30
NERO diagnostics led to finding a server that was identified as pushing an excessive amount of ARP request to the router. The server was removed from the network at 12:40pm
Several networks were pulled out from behind the firewall, allowing network traffic to flow again
CPU utilization went from 99% to 86% after server was removed from the network
About 12:50 the CPU utilization had climbed back to 99% even though the server had not been reconnected to the network
Additional Cisco support provided about 1:00pm — at this point we had three Cisco engineers on the phone and connected to our router pair via a Webex call.
By late afternoon, I requested additional on-site support from Mt. States Networking.
A Mt. States engineer was on site by 6:00pm
At ~8:15pm, the router netflow process was identified as a culprit in the high CPU utilization. After the netflows were removed, the CPU utilization fell from 99% to 23%
All networks were moved behind the firewall and traffic continued to flow properly.
The suspect host that was removed in the morning was returned to service and the CPU utilization on the router immediately climbed to 99%
The suspect host was removed

Forensics:

February 15, 2015
- Our unix systems administrator has been reviewing the suspect servers logs and discovered the server had been compromised. This server is running openstack OS.
- We know that whoever compromised the server did not gain direct access to it via ssh or telnet
- Forensics work continues…

Under the Hood

Monthly Archives: January 2015

Security Infrastructure

Phase 1

Phase 2

Network Outage

Summary:

Timeline:

Forensics: