Elevated 503 error rates
Incident Report for Percy
Postmortem

We would like to share some details regarding our recent latency and outage event that occurred last week which caused instability in our APIs and affected Percy’s overall availability.

This is our first public post-mortem for an event. We feel it’s important, especially due to the length and impact of the outage, for us to acknowledge and share more details about this event and how it impacted our customers, how we addressed the issues, and the steps we are taking to improve our systems and processes going forward.

Summary

Starting on Monday October 28th at 4:58 pm US/Pacific, and continuing until Tuesday Oct 29 at 5:20 pm US/Pacific, we experienced a critical increase in API latency and rates of 503 responses which rendered many Percy API endpoints slow or unusable. The latency became normal on Tuesday at 5:20 pm and the core issues were fully remediated by 10:30 pm US/Pacific.

Details

The root cause was related to a cluster upgrade that we underwent on Monday to increase the capacity of our Kubernetes node pools to provide more CPU and memory capacity for running services. As part of the cluster upgrade, some networking-heavy background workloads were moved into the cluster and were co-located on the same nodes as API-related services.

Immediately after workloads were co-located, our monitoring alerted us of increased CPU load and latency issues. We were able to quickly correct the CPU load issue and it returned to normal levels. Unfortunately, coincidentally, API latency also decreased at the same time due to work-day timing, which caused us to assume the issue was resolved. During the evening, latency levels would temporarily spike even while CPU and memory load remained normal. Our monitoring was not correctly configured to alert us of this particular kind of spikey latency issue when not under normal load. Additionally, in the evening we were also working to resolve a backlog of jobs that had accumulated due to the latency issues, by increasing throughput to handle it.

On Tuesday morning, as traffic increased to US morning levels we were again alerted that API latency was increasing significantly and, additionally, we were occasionally serving large bursts of HTTP 503 responses. We quickly identified that SQL queries were slow across the board, and that bursts of 503 responses were caused by API servers becoming overloaded with long-running requests, which proceeded to correctly fail health checks and be restarted.

Unfortunately, there was a cascading failure that occurred when a single API server would fail, causing a cascading DDoS effect on the remainder of healthy servers. We have internal retry logic for some requests through our load balancing infrastructure, and additionally many of our SDKs are configured to retry 5XX requests as well. When an API server became overloaded and failed, new traffic would be rerouted to flow to the remaining healthy servers, which would in turn become quickly overloaded with slow requests and also fail, cascading to each API server. This continued to cause intermittent cascading server restarts and bursts of 503 responses until the latency issue was resolved.

Early in the incident our team convened an ops “war room” call to triage the compounding issues and identify how to resolve them. We identified that while CPU and memory usage were well within bounds, both SQL queries and Redis latency had increased together at the same time. We began to understand that the issue was likely unrelated to our database or new queries we had introduced the previous day, and was likely networking related instead. Though we didn’t understand the full scope of the root cause yet, early on in the incident we identified that moving our API workloads back to their own isolated nodes could likely fix the networking latency issues.

While we did identify this as a solution early on, before we fully understood the root cause, it unfortunately took us another 10 hours to implement it and get the issue fully under control. This was due to another set of networking issues that manifested each time we tried to move the API workloads to their own pool of resources (a standard operation that we do all the time), which prevented connectivity to our Redis infrastructure. We later identified that this issue was caused by our cluster being an older GCP cluster type which is not “VPC-native,” which required workarounds for connectivity to Google’s Redis instances. During the outage we were forced to change our infrastructure again to work around these issues, and had to do so carefully to avoid causing extra problems for API traffic that was working well at the time. These compounding issues significantly increased the time to remediation and impacted our ability to work around the issues efficiently.

The root cause of the latency issues on the co-located API nodes was later determined to be caused by Docker networking saturation. The main cluster that runs our primary services was built in early 2018 and uses Docker overlay networking, which is limited in how much networking throughput it can handle for a single machine. Even though the new nodes were larger and had more CPU and memory, we experienced a significant “noisy-neighbor” network congestion issue while our API workloads were co-located with other networking-heavy services. This manifested as increased outbound network latency across the board, including SQL query latency and Redis latency, and was the ultimate root cause of all other issues. Once we able to successfully separate the API workloads, latency returned to normal and the issue was resolved.

--

We deeply apologize for this incident, its effect on our customers, and for the time it took us to resolve it. We have identified many improvements we can make to our systems and processes going forward, including improving our monitoring and alerts for latency-related issues, better configuration to prevent cascading retries, and we have also begun the process of migrating to a modern VPC-native cluster that does not use docker overlay networking. This GCP cluster type will have better networking performance, and will remove a class of connectivity issues we encountered that slowed remediation.

This was a difficult event for us and our users, and we didn't live up to the expectations we have for serving Percy customers with speed and stability. We appreciate your patience with us during this event and we’re working hard to improve our processes and systems to prevent outages like this in the future.

Posted Nov 05, 2019 - 10:31 PST

Resolved
Operations have returned to normal. Percy has been performing at normal stability and responsiveness for the last 12 hours.

Apologies again for the disruption. We will be issuing further details on the cause, impact, and remediation of the incident in the near future.
Posted Oct 30, 2019 - 10:31 PDT
Monitoring
Overall system health has been stable and responsive for the last 6 hours since 4:30 pm PT. We are continuing to monitor the results of multiple fixes that are in place and will provide further updates tomorrow.

We deeply apologies for these issues and their effect on your ability to use Percy reliability—as the result of some recent infrastructure upgrades, we are experiencing non-trivial networking load issues that have affected our APIs responsiveness. Permanently fixing these issues is our highest priority, and putting systems and processes in place to avoid recurrence.

We will leave this status up until we're confident the issue is fully resolved.
Posted Oct 29, 2019 - 22:29 PDT
Update
We are continuing to investigate this issue.
Posted Oct 29, 2019 - 21:42 PDT
Update
We are continuing to investigate this issue. At times our website and API may be responsive.
Posted Oct 29, 2019 - 14:20 PDT
Investigating
We're looking into an issue regarding elevated 503 response codes. We'll update here when we know more.
Posted Oct 29, 2019 - 08:25 PDT
This incident affected: Web and API.