Emergency Maintenance

Scheduled Maintenance Report for Amplitude

Postmortem

From Monday July 1, 2019 from 7pm PDT to Tuesday July 2, 2019 8:23 AM PDT we experienced delays in our data processing system. During this incident customers couldn’t see data ingested up to the last 4hrs in Amplitude. No data was lost during the incident.

What happened:

On Monday July 1, 2019 we discovered our data processing systems skipped processing data between Sunday June 30, 2019 10:20 PM PDT and 10:55 PM PDT, while recovering from AWS AZ networking issues.
Monday July 1, 2019 our team got together to finalize the plan to reprocess the data and scheduled a maintenance window from Monday July 1, 2019 7PM PDT.
On Monday July 1, 2019 7PM PDT we stopped our data processing system and reset it to start processing data from Sunday June 30, 2019 10:20 PM PDT, skipping events that are already processed. We initially estimated reprocessing of 24hrs of data to take 4hrs but it took 14hrs for data processing systems to catch up fully with live data.

Scope of the Impact

Due to delayed data processing during the emergency maintenance window between Monday July 1, 2019 from 7pm PDT to Tuesday July 2, 2019 8:23 AM PDT, customers couldn’t see data ingested up to the last 4hrs in Charts or User Activity in Amplitude. No data was lost during the incident.
Data between Sunday June 30, 2019 10:20 PM PDT to 10:55 PM PDT was missing in Amplitude charts until Monday July 1, 2019 7pm PDT.
In the cases where the user property was not sent explicitly, events between Sunday June 30, 2019 10:20 PM PDT to 10:55 PM PDT which were processed one day later could have the latest synced user properties.

Why did data processing delays last so long?

Data processing system has some dependencies which don’t auto scale. We overestimated extra provisioned capacity we have, due to this recovery went slower than what we expected.

What are we doing to prevent it from happening again?

There’s a lot to learn and improve on from this incident, and we’d like to share a few of the specific actions we’re taking/taken to avoid future incidents like it.

We fixed the problematic failure handling code that resulted in events getting skipped.
We are implementing monitors to catch the missing data issue immediately.
We are improving on-call steps to handle AWS AZ networking issues.
We created a team to improve and benchmark maximum throughput of our data processing system to reduce and better estimate recovery speeds.
We are going to look into better ways of reprocessing data to minimize the impact on data processing delays.

‌

We understand that our customers rely on us to reliably collect and process data about their product, and we apologize for the incident and any complications it has caused. We will do everything we can to improve our processes to ensure that you can rely on Amplitude in the future.

Posted Jul 18, 2019 - 14:06 PDT

Completed

All our data processing systems have caught up and real-time data is available.

No data was lost.

Posted Jul 02, 2019 - 08:53 PDT

Update

Our data processing systems are continuing to catch up and we expect data to be caught up by 7/2/2019 15:00 UTC. No data has been lost and our ingestion systems are working as expected.

Posted Jul 02, 2019 - 05:12 PDT

Update

The updated ETA for data processing to fully recover is 7/2/2019 10AM UTC

Posted Jul 01, 2019 - 21:00 PDT

In progress

Scheduled maintenance is currently in progress. We will provide updates as necessary.

Posted Jul 01, 2019 - 19:00 PDT

Scheduled

We will be performing emergency maintenance to reprocess ingested data from 7/1/2019 5:20 AM UTC to 7/1/2019 5:56 AM UTC.

As a result, data processing will be delayed significantly and realtime data from will be unavailable for querying until processing has recovered. Data collection is not affected.

We are currently estimating 4 hours for data processing to fully recover and will provide updates if this changes.

Posted Jul 01, 2019 - 18:35 PDT

This scheduled maintenance affected: Data (Data Processing).