From Monday July 1, 2019 from 7pm PDT to Tuesday July 2, 2019 8:23 AM PDT we experienced delays in our data processing system. During this incident customers couldn’t see data ingested up to the last 4hrs in Amplitude. No data was lost during the incident.
- On Monday July 1, 2019 we discovered our data processing systems skipped processing data between Sunday June 30, 2019 10:20 PM PDT and 10:55 PM PDT, while recovering from AWS AZ networking issues.
- Monday July 1, 2019 our team got together to finalize the plan to reprocess the data and scheduled a maintenance window from Monday July 1, 2019 7PM PDT.
- On Monday July 1, 2019 7PM PDT we stopped our data processing system and reset it to start processing data from Sunday June 30, 2019 10:20 PM PDT, skipping events that are already processed. We initially estimated reprocessing of 24hrs of data to take 4hrs but it took 14hrs for data processing systems to catch up fully with live data.
Scope of the Impact
- Due to delayed data processing during the emergency maintenance window between Monday July 1, 2019 from 7pm PDT to Tuesday July 2, 2019 8:23 AM PDT, customers couldn’t see data ingested up to the last 4hrs in Charts or User Activity in Amplitude. No data was lost during the incident.
- Data between Sunday June 30, 2019 10:20 PM PDT to 10:55 PM PDT was missing in Amplitude charts until Monday July 1, 2019 7pm PDT.
- In the cases where the user property was not sent explicitly, events between Sunday June 30, 2019 10:20 PM PDT to 10:55 PM PDT which were processed one day later could have the latest synced user properties.
Why did data processing delays last so long?
Data processing system has some dependencies which don’t auto scale. We overestimated extra provisioned capacity we have, due to this recovery went slower than what we expected.
What are we doing to prevent it from happening again?
There’s a lot to learn and improve on from this incident, and we’d like to share a few of the specific actions we’re taking/taken to avoid future incidents like it.
- We fixed the problematic failure handling code that resulted in events getting skipped.
- We are implementing monitors to catch the missing data issue immediately.
- We are improving on-call steps to handle AWS AZ networking issues.
- We created a team to improve and benchmark maximum throughput of our data processing system to reduce and better estimate recovery speeds.
- We are going to look into better ways of reprocessing data to minimize the impact on data processing delays.
We understand that our customers rely on us to reliably collect and process data about their product, and we apologize for the incident and any complications it has caused. We will do everything we can to improve our processes to ensure that you can rely on Amplitude in the future.