One of our upstream provider’s core routers is currently experiencing significant routing issues, preventing us from contacting our DNS servers and degrading access to, from, and within our cluster. They are working on the problem now and we will update this status as more information becomes available. Thank you for your patience.
Update 1:55 PM MST: Network service has been restored. We deeply apologize for this outage and are working with our upstream network provider to find solutions to prevent this occurring in the future.
Update 4:03 PM MST: One of our upstream provider’s redundant core routers had serious routing problems from approximately 1:37 to 1:53 PM MST on Wednesday, November 16th.
While they are still investigating the cause, BGP routing sessions on one of their core Cisco routers (a Cisco 6509-E) started to flap badly at approximately 1:37 PM. As a result, CPU utilization spiked, and the router became unresponsive to management commands and began dropping and re-establishing additional BGP sessions. The flapping of these sessions going up and down and the resulting heavy load of processing routing information meant the router could not recover enough to fully establish all the sessions and keep them up.
To help reduce the load caused by this cycle, a network engineer decided to disconnect an Internet transit connection that went directly to this router. This kept the router from constantly processing and dropping the full feed of Internet routes from this provider and the router was able to stabilize very quickly afterward.
Although our upstream provider’s network is fully redundant, both at the border and core of their network, a partial outage like this is possible when a router is still partially operable because full failover to the other router might not get triggered. Among other services affected was one of their redundant recursive name servers, causing some DNS queries to time out.
Again, we are very sorry for any problems this may have caused and we are working to find solutions that will mitigate any future outages.