Gateway timeout issue

Incident Report for Percy

Resolved

This incident has been resolved.

The root cause of this issue was maintenance performed by our third-party redis provider, which caused a network partition with our internal redis proxying infrastructure. This caused all outbound redis connections to hang indefinitely in a way that was not caught by the connection timeouts we have in place. This had a downstream cascading effect on our API servers, causing them to intermittently become unresponsive after connection pools would fill up, which caused them to fail health checks and get restarted.

This affected all snapshot API requests, as well as some unrelated API requests on affected servers, starting at 1:25 am Pacific until the incident was manually resolved at 6:06 am Pacific. Thank you to all of our customers who reported this issue to us as well. We will continue to conduct our internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence.

Posted Nov 15, 2018 - 07:08 PST

Monitoring

A fix has been implemented and we are monitoring the results. We will post a full post mortem of this incident after our investigation is complete and we have identified steps to prevent this from occurring again in the future.

Posted Nov 15, 2018 - 06:17 PST

Identified

We have identified the issue is related to our third-party Redis provider's maintenance this morning, which has caused a cascading failure in our ability to connect to our redis instances. We are narrowing down the root cause and working towards a fix. We are upgrading this to a major outage.

Posted Nov 15, 2018 - 05:43 PST

Update

We are continuing to investigate this issue.

Posted Nov 15, 2018 - 05:14 PST

Investigating

We are currently investigating reports of 504 Gateway timeout issues with the Percy API.

Posted Nov 15, 2018 - 05:13 PST

This incident affected: API and Rendering infrastructure.