One of our transit providers had close to 100% connectivity loss between most EU and NA locations. We also saw some loss over this provider between the West and East coasts of USA. Because the level of packet loss increased so quickly, a lot of data required for our network automation system to function properly wasn't reported back to our collectors. As only some of the data was received, this system attempted to disable origin pulls via the affected provider and use other providers. Given a lot of destinations were unreachable, desired actions couldn't be taken in time to avert an impact.
This incident affected all traffic which was routed over this provider, and will have manifested itself in 522 response codes being generated when we were unable to reach customer origin servers. In addition to this, there will have been general reachability issues as visitors that were routed over this network will not have been able to reach our edge network.
Time | Points of Presence (Colo) | Services | Description |
---|---|---|---|
2017-05-02 14:41 UTC | All EU and NA | All | Poor connectivity via affected transit provider. 522s served if this transit provider is in the path between the Cloudflare colo and the customer's origin. IMPACT START |
2017-05-02 14:46 UTC | LHR, VIE, HAM, MXP, MRS, OTP, OSL DXB, ATL, EWR | Our network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE |
|
2017-05-02 14:51 UTC | DME | Manually disabled as this colo only has transit over the affected provider. IMPACT DOWNGRADE |
|
2017-05-02 14:51 UTC | YUL | Network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE |
|
2017-05-02 14:55 UTC | BCN | Manually dropped colo IMPACT DOWNGRADE |
|
2017-05-02 14:57 UTC | DUB, BOS, MSP, STL, YYZ, DEN, ORD, DFW, SJC, SEA, LAX, FRA | Network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE |
|
2017-05-02 14:57 UTC | ALL | Transit issue is essentially resolved and our network automation system was able to take a large amount of actions at this time. IMPACT DOWNGRADE |
|
2017-05-02 14:59 UTC | KBP | Manually dropped colo | |
2017-05-02 15:24 UTC | BCN, DME | Anycast enabled manually IMPACT END |
The root cause for this was resolved by our upstream transit provider. During this incident we disabled traffic on the origin pulls that were routed over the affected provider, and re-enabled them once the incident had been resolved.
We are investigating the possibility of localising our network automation so that in the event that a provider outage causes a colo to be unreachable, the colo can re-route itself to avoid further impact.