Network Performance Issues in multiple locations

Incident Report for Cloudflare

Postmortem

Transit Provider Backbone Incident
2017-05-02

Incident Description

One of our transit providers had close to 100% connectivity loss between most EU and NA locations. We also saw some loss over this provider between the West and East coasts of USA. Because the level of packet loss increased so quickly, a lot of data required for our network automation system to function properly wasn't reported back to our collectors. As only some of the data was received, this system attempted to disable origin pulls via the affected provider and use other providers. Given a lot of destinations were unreachable, desired actions couldn't be taken in time to avert an impact.

Customer Impact

This incident affected all traffic which was routed over this provider, and will have manifested itself in 522 response codes being generated when we were unable to reach customer origin servers. In addition to this, there will have been general reachability issues as visitors that were routed over this network will not have been able to reach our edge network.

Timeline of events

Time	Points of Presence (Colo)	Services	Description
2017-05-02 14:41 UTC	All EU and NA	All	Poor connectivity via affected transit provider. 522s served if this transit provider is in the path between the Cloudflare colo and the customer's origin. IMPACT START
2017-05-02 14:46 UTC	LHR, VIE, HAM, MXP, MRS, OTP, OSL DXB, ATL, EWR		Our network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE
2017-05-02 14:51 UTC	DME		Manually disabled as this colo only has transit over the affected provider. IMPACT DOWNGRADE
2017-05-02 14:51 UTC	YUL		Network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE
2017-05-02 14:55 UTC	BCN		Manually dropped colo IMPACT DOWNGRADE
2017-05-02 14:57 UTC	DUB, BOS, MSP, STL, YYZ, DEN, ORD, DFW, SJC, SEA, LAX, FRA		Network automation system disabled origin pulls via the impacted transit provider IMPACT DOWNGRADE
2017-05-02 14:57 UTC	ALL		Transit issue is essentially resolved and our network automation system was able to take a large amount of actions at this time. IMPACT DOWNGRADE
2017-05-02 14:59 UTC	KBP		Manually dropped colo
2017-05-02 15:24 UTC	BCN, DME		Anycast enabled manually IMPACT END

Resolution

The root cause for this was resolved by our upstream transit provider. During this incident we disabled traffic on the origin pulls that were routed over the affected provider, and re-enabled them once the incident had been resolved.

Recommendations

We are investigating the possibility of localising our network automation so that in the event that a provider outage causes a colo to be unreachable, the colo can re-route itself to avoid further impact.

Posted May 03, 2017 - 23:10 UTC

Resolved

This issue has been resolved and service has returned to normal.

Posted May 02, 2017 - 15:53 UTC

Monitoring

We have implemented a fix for this issue and are currently monitoring the results. We will update once we have confirmed it is resolved.

Posted May 02, 2017 - 14:59 UTC

Identified

The issue is related to a specific transit provider and we are working on temporarily disabling this provider to route around the issue

Posted May 02, 2017 - 14:55 UTC

Investigating

Cloudflare is observing network performance issues in multiple locations. We are actively working to reduce or eliminate any impact to Internet users in these locations.

Posted May 02, 2017 - 14:46 UTC

This incident affected: Europe (Amsterdam, Netherlands - (AMS), Barcelona, Spain - (BCN), Belgrade, Serbia - (BEG), Brussels, Belgium - (BRU), Frankfurt, Germany - (FRA), Hamburg, Germany - (HAM), London, United Kingdom - (LHR), Madrid, Spain - (MAD), Marseille, France - (MRS), Milan, Italy - (MXP), Moscow, Russia - (DME), Oslo, Norway - (OSL), Paris, France - (CDG), Stockholm, Sweden - (ARN), Vienna, Austria - (VIE), Warsaw, Poland - (WAW), Zürich, Switzerland - (ZRH)), North America (Ashburn, VA, United States - (IAD), Atlanta, GA, United States - (ATL), Boston, MA, United States - (BOS), Chicago, IL, United States - (ORD), Dallas, TX, United States - (DFW), Denver, CO, United States - (DEN), Kansas City, MO, United States - (MCI), Los Angeles, CA, United States - (LAX), Miami, FL, United States - (MIA), Minneapolis, MN, United States - (MSP), Montréal, QC, Canada - (YUL), San Jose, CA, United States - (SJC), Seattle, WA, United States - (SEA), Toronto, ON, Canada - (YYZ), Vancouver, BC, Canada - (YVR)), Asia (Bangkok, Thailand - (BKK), Chennai, India - (MAA), Manila, Philippines - (MNL), Taipei - (TPE), Tokyo, Japan - (NRT)), Africa (Cape Town, South Africa - (CPT), Johannesburg, South Africa - (JNB)), Oceania (Melbourne, VIC, Australia - (MEL), Perth, WA, Australia - (PER), Sydney, NSW, Australia - (SYD)), Middle East (Doha, Qatar - (DOH), Dubai, United Arab Emirates - (DXB)), and Latin America & the Caribbean (Medellín, Colombia - (MDE)).

Cloudflare System Status