From 7/20/2021 11:45 AM until 7/21/2021 2:20 AM, for about 15 hours, Amplitude lost the data we had received via our /batch API due to a misconfiguration of a kafka cluster holding the accepted messages. These events are either coming through one of our partners (mParticle, Segment, Braze and RudderStack), by direct API call from your servers or Amplitude Vacuum service.
During the incident, we believed that no data was lost, and all data would be queued and processed with some delays. However, after the incident we realized that due to a misconfiguration of one of our queuing systems, the events that arrived on the batch API were not queued and are lost.
The Amplitude ingestion system has a number of entry points, including our high priority streaming ingestion aka /httpapi, and our batch data ingestion aka /batch.
Received streaming events are fed into a streaming kafka cluster and batch data are fed into a separate batch kafka cluster. The Amplitude data processing component is processing events from these kafka clusters. Each kafka cluster is configured with 7 days of retention and has 2 replicas.
With this architecture, if any of our downstream processing modules have issues, we should not lose any data as they are kept in the queue until the downstream processing module has recovered and can process the events.
On 7/20/2021 12:20 PM, our data ingestion system stopped processing data since a critical postgres database stopped accepting updates and deleted SQL commands. We were able to resolve the issue and recover the DB around 7 PM, and we were fully caught up by 7/21 2:20 AM.
At this time, with the previous assumption, we were not aware that there was a data loss so we did not mention anything in the status page.
After we caught up on processing all the events in the queue, we realized with our customer’s help that there was still some missing data in some charts. After debugging, we figured out that data coming from batch kafka clusters was missing. Further inspection revealed that by mistake the batch kafka cluster retention time was set to 20 minutes, hence during all incident hours on that day, we lost the batch queued data.
These are the different ways our customers may send us data through our /batch API:
First and foremost, please verify your charts and if you are missing data in this time period, please let us know. Second, let’s go over each method of batch ingestion and review the actions we are taking and where we need your additional help:
Customers who are calling our API directly or are using locally deployed CDP: We would need you to resend the data to us. Our support and engineering team are available to you and we will provide any help needed. Feel free to reach out to us and we will work with you.