Data Processing Delays

Incident Report for Amplitude

Postmortem

Root Cause Analysis on Amplitude Data Loss on 7/20/2021 [Work In progress]

Summary

From 7/20/2021 11:45 AM until 7/21/2021 2:20 AM, for about 15 hours, Amplitude lost the data we had received via our /batch API due to a misconfiguration of a kafka cluster holding the accepted messages. These events are either coming through one of our partners (mParticle, Segment, Braze and RudderStack), by direct API call from your servers or Amplitude Vacuum service.

We have already replayed all the Vacuum data and they have been recovered.
We are working with our partners mParticle, Segment, Braze, and RudderStack to resend the events we have lost.
Unfortunately, we have lost the data that you have sent to us through direct API. However, with your help we can recover the data loss if you have the data and can replay it to us.
To make sure that the data is not duplicated, please send the data to us by 8/1 at 11:59 PM PDT. Of course, you will not incur any cost for sending these events.
If you were using the batch API to send us data during the impacted timeframe and you are not able to replay those events - then we’ve lost those events.

During the incident, we believed that no data was lost, and all data would be queued and processed with some delays. However, after the incident we realized that due to a misconfiguration of one of our queuing systems, the events that arrived on the batch API were not queued and are lost.

Root Cause Analysis

High level system setup

The Amplitude ingestion system has a number of entry points, including our high priority streaming ingestion aka /httpapi, and our batch data ingestion aka /batch.

Received streaming events are fed into a streaming kafka cluster and batch data are fed into a separate batch kafka cluster. The Amplitude data processing component is processing events from these kafka clusters. Each kafka cluster is configured with 7 days of retention and has 2 replicas.

With this architecture, if any of our downstream processing modules have issues, we should not lose any data as they are kept in the queue until the downstream processing module has recovered and can process the events.

Events and actions on 7/20/2021

On 7/20/2021 12:20 PM, our data ingestion system stopped processing data since a critical postgres database stopped accepting updates and deleted SQL commands. We were able to resolve the issue and recover the DB around 7 PM, and we were fully caught up by 7/21 2:20 AM.

At this time, with the previous assumption, we were not aware that there was a data loss so we did not mention anything in the status page.

Events and actions on 7/21/2021

After we caught up on processing all the events in the queue, we realized with our customer’s help that there was still some missing data in some charts. After debugging, we figured out that data coming from batch kafka clusters was missing. Further inspection revealed that by mistake the batch kafka cluster retention time was set to 20 minutes, hence during all incident hours on that day, we lost the batch queued data.

Sources of batch data

These are the different ways our customers may send us data through our /batch API:

Amplitude Vacuum system: This is an internal system that processes files in a shared S3 bucket with our customers.
Using a CDP or a partner service: Most common cases are mParticle, Segment, Braze, and RudderStack
Direct API Call. Usually using a locally deployed CDP: We do not have any visibility on how specific implementation. A common example might be Looker.

Next steps based on your setup

First and foremost, please verify your charts and if you are missing data in this time period, please let us know. Second, let’s go over each method of batch ingestion and review the actions we are taking and where we need your additional help:

Customers who are using Vacuum: This system is owned by Amplitude and we have ingested the events. No further action needs to be taken from these customers.
Customers who are using CDP or related SaaS services: We are talking to mParticle, Segment, Braze, and RudderStack to replay your data to Amplitude. We may need your help to authorize them to do so, in which case we will reach out to you and ask for your assistance.
Customers who are calling our API directly or are using locally deployed CDP and are not able to resend the data - we have lost that data and are not able to recover it.
Customers who are calling our API directly or are using locally deployed CDP: We would need you to resend the data to us. Our support and engineering team are available to you and we will provide any help needed. Feel free to reach out to us and we will work with you.
- For our deduplication system to work properly - data has to be replayed before August 1st and requires the insert_id field to be present in the API call.
- If you have not used the insert_id field in the API call, we can create a drop filter to remove the duplicate events after you’ve replayed the data. Please contact our Support team who can partner with Engineering to complete this for you.

How to prevent the same type of errors in the future

We have added monitors and alarms for our Postgres databases to check our sequence usage.
We have added monitors and alarms for all of Kafka’s retention times.
We are designing a system to majority cases of detect data loss scenarios.
We are designing a process to improve our communications with our customers.

Posted Jul 22, 2021 - 23:10 PDT

Resolved

The issue is resolved and all systems are working as expected.

Posted Jul 21, 2021 - 01:32 PDT

Update

We are continuing to monitor the fix.

Posted Jul 20, 2021 - 21:59 PDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 20, 2021 - 18:06 PDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 20, 2021 - 14:17 PDT

Update

We are continuing to investigate this issue.

Posted Jul 20, 2021 - 13:52 PDT

Update

We are continuing to investigate this issue.

Posted Jul 20, 2021 - 13:25 PDT

Update

We are continuing to investigate this issue.

Posted Jul 20, 2021 - 12:52 PDT

Investigating

Our data processing systems are delayed. This incident started at 18:50 UTC.

Current status:
a) Our ingestion systems are working as expected and no data is lost. Our processing systems are not processing newly ingested data.
b) Impacted customers will see delayed metrics.

We are investigating the issue and will post an update in an hour or earlier if we identify the issue.

Posted Jul 20, 2021 - 12:22 PDT

This incident affected: Data (Data Processing).