Data Ingestion Incident Postmortem

On Tuesday August 8, 2017, we had degraded data collection for 57 minutes from 11:17 AM PDT to 12:14 PM PDT. Data processing remained delayed until 6:30 PM PDT.

Specifically, Amplitude was completely unavailable for data collection between 11:17 AM PDT and 11:58 AM PDT (41 minutes). Amplitude was available for partial data collection between 11:58 AM to 12:14 PM PDT (16 Minutes).

What happened?

Today, at 11:17 AM PDT (Tuesday, August 8), a portion of nodes of our event ingestion cluster unexpectedly and suddenly went down. This put our event ingestion cluster into a state where it was unable to handle the large amount of load that it was under.

At 11:58 AM, we were able to bring back partial data ingestion and we were able to get back to full data ingestion at 12:14 PM PDT.

Scope of impact

During the period that ingestion was down (August 8, 11:17 AM PDT - 11:58 AM PDT), all events that were sent to Amplitude were rejected. Beginning from August 8, 11:58 AM PDT until August 8 12:14 PM PDT, we started accepting a percentage of the events that we received. From 12:14 PM PDT onwards, all events were being accepted.

What is the extent of data loss?

There is a possibility of data loss whether you are sending data via our SDKs or HTTP API, depending on certain factors outlined below.

SDKs - There is default logic built into our SDK that stores up to 1000 events if a device goes offline; the device will then send the stored events once the device reconnects. Any events logged after 1000 will begin to displace the oldest event stored. All data has been preserved for users who performed 1000 or fewer events during the outage time frame of 57 minutes (August 8, 2017 from 11:17 AM PDT to August 8,2017 12:14 PM PDT) and have reopened the software. Any user who performed more than 1000 events during the outage time frame will experience partial data loss. Users who have not come back online yet will also be missing data. However, once those users come back over the coming days or weeks, data up to 1000 events per app-device_id pair would be sent to Amplitude.

HTTP API - If you are using our HTTP API and have built a retry logic that continued to retry until after data collection recovered (Tuesday August 8, 12:14 PM PDT), there should be no impact on data.

If you do NOT have retry logic built in, you can re-send the events that occurred during the time frame. PLEASE NOTE: because a subset of the events were still being accepted and processed, there will be a possibility of duplicated events if you were NOT sending an insertid to Amplitude during the time of the outage. If you were using an insertid, then we will deduplicate all events that have the same insertid. Unfortunately, there is no way for us to fix duplicates if you choose to resend events without an insertid.

Segment or mParticle - For the data that was sent via Segment or mParticle, the missing data for the 57 minute time period has been re-sent to Amplitude based on our analysis of a sample customer set. If however, you do notice any missing data for the 57 minute interval, please let us know.

How can I tell if my data was impacted?

We encourage you to check to see if there were any noticeable dips in event volume during the outage timeframe. Because of differences in user usage patterns across customers, it is difficult for us to accurately measure impact for any one customer. For most customers, there will be on average a 2 to 3% decrease in event totals during the outage window of 57 minutes.

If you are on the Enterprise plan, you can also look at the Real-time event segmentation chart for August 8th to see if there are any dips in data and compare that to the previous day using Period over Period Comparison feature. You can also compare the data to same time period from the previous weeks as well.

If your investigation reveals that your data was impacted by the outage, please reach out to us and include your findings so we can verify and look into possible solutions.

Why did it happen?

This incident was triggered by an unexpected unavailability of a portion of nodes of our event ingestion cluster. This put an unexpected and sudden load on the remaining nodes of the ingestion cluster putting it into a state where the ingestion cluster was not able to collect data.

We were able to apply the learnings from our previous data collection incident on July 28 and have the event ingestion mechanism up and partially running in 41 minutes and fully running in 57 minutes. However, we take the responsibility of being available for data collection extremely seriously and hence on top of the changes we mentioned last week, we will be making two major changes to our event collection mechanism as listed in the following section.

What are the immediate changes being made to prevent this from happening again?

We are scaling out our event ingestion cluster significantly to allow it to handle much greater load and be able to continue processing as normal should some nodes ever fail.

We are adding improved monitoring to our servers to be alerted sooner should CPU utilization become a problem on our event ingestion nodes.

We are immediately building a secondary event ingestion cluster where event data can be routed and stored should problems ever occur with our primary event ingestion cluster. This will add an additional layer of fault tolerance to the platform.

The responsibility that we have of being available for collecting event data in a reliable manner is something we take very seriously and hence we are extremely disappointed that we had this incident today, less than two weeks after a previous data collection incident.

While, we were able to rely on the learnings and the changes made from the previous incident and reduce the event collection down time from 7 hours in the previous instance to 57 minutes this time, we realize and fully appreciate that even this is not acceptable and we need to do better.

Consequently, we are taking very aggressive steps highlighted above to mitigate the causes of this incident in the future including scaling the entire ingestion cluster significantly, enforcing extremely strict CPU utilization limits, and building out a secondary event collection system.

Thank You for being a valued Amplitude partner. If you have any questions/feedback for us, please reach out to your dedicated Success Manager or platform@amplitude.com.

Spenser Skates

CEO

Posted Aug 08, 2017 - 22:00 PDT

Resolved

Data processing is now fully recovered and up-to-date.

Posted Aug 08, 2017 - 21:48 PDT

Update

Our event-servers are completely back up, and we are accepting all data.
Data processing is still delayed and we are working on expediting the recovery for data processing delay.

Posted Aug 08, 2017 - 12:37 PDT

Identified

We have identified and fixed the issue with our ingestion system. We are now beginning the recovery process.

Posted Aug 08, 2017 - 12:14 PDT

Investigating

We are currently investigating this issue.

Posted Aug 08, 2017 - 11:42 PDT