Web reporting outage and data processing delays
Incident Report for Amplitude
Resolved
This incident has been resolved.
Posted Jan 12, 2016 - 10:28 PST
Update
All systems have been fully restored to normal operations. Thank you for being patient with us throughout this incident. We have tested platform features extensively and the dashboard, raw data services, and Redshift are running as expected.
Please note that saved behavioral cohorts will need to be refreshed by the end user to incorporate up-to-date information.
Posted Jan 11, 2016 - 15:40 PST
Monitoring
Dashboard:
- Our estimate for the dashboard to resume normal operations is 12 hours from now (Jan. 11 6:00 PM PST).
- Dashboards are up to date with data through Jan. 8 UTC.
- We are still validating data for Jan. 9 and 10 UTC.
- Live data is being processed in real-time as of Jan. 11 UTC.

Raw data:
- All data (up to real-time) is available via the Export API.
- Going forward, all data will be available via the Export API ~1 hour after the end of the hour.

Redshift:
- Redshift data is ~12 hours behind real-time.
- We estimate that Redshift will catch up to normal operations by 6 PM PST.
Posted Jan 11, 2016 - 05:58 PST
Update
Dashboard:
- Our estimate for the dashboard to resume normal operations is 12 hours from now (Jan. 11 6:00 PM PST).
- Dashboards are up to date with data through Jan. 8 UTC.
- We are still validating data for Jan. 9 and 10 UTC.
- Live data is being processed in real-time as of Jan. 11 UTC.

Raw data:
- All data (up to real-time) is available via the Export API.
- Going forward, all data will be available via the Export API ~1 hour after the end of the hour.

Redshift:
- Redshift data is ~12 hours behind real-time.
- We estimate that Redshift will catch up to normal operations by 6 PM PST.
Posted Jan 11, 2016 - 05:58 PST
Update
Dashboard:
- Our estimate for the dashboard to resume normal operations is still 23 hours from now (Jan. 11 9:00 PM PST).
- Dashboards are up to date with data through Jan. 8 UTC.
- We are still processing data for Jan. 9 and 10 UTC.
- Live data is being processed in real-time as of Jan. 11 UTC.

Raw data:
- All data (up to real-time) is available via the Export API.
- Going forward, all data will be available via the Export API ~1 hour after the end of the hour.

Redshift:
- Redshift data is ~24 hours behind real-time.
- We estimate that Redshift will catch up to normal operations by 9 AM PST.
Posted Jan 10, 2016 - 22:17 PST
Update
We are 90% back to normal operations. You should see most events appearing in dashboard in real time. We are continuing to process the few remaining batches of event data in the backlog. Residual cleanup is still ongoing.

Raw data up to Jan 9 23:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 1-2 days (Jan 10-11 9:00 PM PST).
Posted Jan 10, 2016 - 16:55 PST
Update
We are 90% back to normal operations. You should see most events appearing in dashboard in real time. We are continuing to process the few remaining batches of event data in the backlog. Residual cleanup is still ongoing.

Raw data up to Jan 9 12:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 1-2 days (Jan 10-11 9:00 PM PST).
Posted Jan 10, 2016 - 09:27 PST
Update
We are 90% back to normal operations. You should see most events appearing in dashboard in real time. We are continuing to process the few remaining batches of event data in the backlog. Residual cleanup is still ongoing.

Raw data up to Jan 8 20:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 1-2 days (Jan 10-11 9:00 PM PST).

Thank you again for your patience and understanding.

This will be our last update for tonight. We will resume updates tomorrow around 9 AM PST.
Posted Jan 09, 2016 - 23:09 PST
Update
We are 90% back to normal operations. You should see most events appearing in dashboard in real time. We are continuing to process the few remaining batches of event data in the backlog. Residual cleanup is still ongoing.

Raw data up to Jan 8 17:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 2-3 days (Jan 10-11 9:00 PM PST).
Posted Jan 09, 2016 - 19:28 PST
Update
We are 90% back to normal operations. You should see most events appearing in dashboard in real time. We are continuing to process the few remaining batches of event data in the backlog. Residual cleanup is still ongoing.

Raw data up to Jan 8 13:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 2-3 days (Jan 10-11 9:00 PM PST).
Posted Jan 09, 2016 - 15:22 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 7 and 8, so you should start seeing events from those days appear in your dashboard and Realtime Activity tab. The processing of events from Jan 5 and 6 is now complete.

Raw data up to Jan 8 12:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 2-3 days (Jan 10-11 9:00 PM PST).
Posted Jan 09, 2016 - 14:14 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 6, 7, and now 8 in parallel, so you should start seeing events from those days appear in your dashboard and Realtime Activity tab. The processing of events from Jan 5 is complete.

Raw data up to Jan 8 04:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 2-3 days (Jan 10-11 9:00 PM PST).
Posted Jan 09, 2016 - 08:02 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 6, 7, and now 8 in parallel, so you should start seeing events from those days appear in your dashboard and Realtime Activity tab. The processing of events from Jan 5 is complete.

Raw data up to Jan 7 19:00 UTC is now available through our export API.

Our estimate for all the data catching up and resuming normal operations is still 2-3 days (Jan 10-11 9:00 PM PST).

Thank you again for your patience and understanding.

This will be our last update for tonight. We will resume updates tomorrow around 8 AM PST.
Posted Jan 08, 2016 - 22:24 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 5, 6, 7, and now 8 in parallel, so you should start seeing events from those days appear in your Realtime Activity tab and in your dashboard. Raw data up to Jan 7 12:00 UTC is now available through our export API. Our estimate for all the data catching up and resuming normal operations is still 3-4 days (Jan 10-11 9:00 PM PST).
Posted Jan 08, 2016 - 16:06 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 5, 6, 7 in parallel, so you should start seeing events from those days appear in your Realtime Activity tab and in your dashboard. Raw data up to Jan 7 05:00 UTC is now available through our export API. Our estimate for all the data catching up and resuming normal operations is still 3-4 days (Jan 10-11 9:00 PM PST).
Posted Jan 08, 2016 - 14:03 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 5, 6, 7 in parallel, so you should start seeing events from those days appear in your Realtime Activity tab and in your dashboard. Raw data up to Jan 7 02:00 UTC is now available through our export API. Our estimate for all the data catching up and resuming normal operations is still 3-4 days (Jan 10-11 9:00 PM PST).
Posted Jan 08, 2016 - 12:39 PST
Update
We are continuing to process the event data backlog. We are processing events from Jan 5, 6, 7 in parallel, so you should start seeing events from those days appear in your Realtime Activity tab and in your dashboard. Raw data up to Jan 6 19:00 UTC is now available through our export API. Our estimate for all the data catching up is still 3-4 days (Jan 10-11 9:00 PM PST), after which most operations will return to normal.
Posted Jan 08, 2016 - 10:26 PST
Update
We are continuing to process the event data backlog. We estimate that events from Jan 5 will appear in dashboards within 12-24 hours from now (Jan 8 9:00 PM PST). Raw data up to Jan 6 19:00 UTC is now available through our export API.
Posted Jan 08, 2016 - 10:07 PST
Update
We are continuing to process the event data backlog. We estimate that events from Jan 5 will appear in dashboards within 12-24 hours from now (Jan 8 9:00 PM PST). Raw data up to Jan 6 16:00 UTC is now available through our export API.
Posted Jan 08, 2016 - 07:48 PST
Update
Summary:
* All dashboard capabilities have been restored for data prior to Jan. 4 8:22 PM PST.
* We are continuing to process the event data backlog.
* Events in the backlog (logged after Jan. 4 8:22 PM PST) should start appearing in dashboards within 1-2 days (Jan 8-9 9:00 PM PST).
* The rest of the backlog should appear in 3-4 days (Jan 10-11 9:00 PM PST). After this point, we will process data in real time.
* We apologize that our current time estimates has been extended 1-2 days from what was communicated in the previous update.

Details:
We have resumed data processing for events in the backlog, which will transform the raw event data and make it available for our various query systems. We have a multi-tiered processing pipeline and are now working through each of the tiers, scaling up computing resources as much as we can.

We have to process events in chronological order starting with the events immediately after the incident. Events logged moving forward from this time will be added to the end of the backlog queue. We estimate that events from Jan 5 will appear in dashboards within 1-2 days (Jan 8-9 9:00 PM PST), soon followed by events from Jan 6 and so on. We estimate that the rest of the data will be caught up within 3-4 days (Jan 10-11 9:00 PM PST). Event data will appear in Redshift a couple hours after they appear on the dashboard.

Thank you again for your patience and understanding.

This will be the last update for tonight. We will resume updates tomorrow around 8 AM PST.
Posted Jan 07, 2016 - 21:40 PST
Update
We have finished the recovery of the Amplitude ID tables. All dashboard functionality has been restored for data prior to Jan 4 8:22PM PST. Some residual clean up is still needed, and new user counts for days after Jan 4 may be inflated by no more than 1% until it is completed.

We have resumed data processing for the event backlog and are closely monitoring the progress. We expect the processing to take 1-2 days to catch up (Jan 8-9 3:00PM PST). The Realtime Activity tab only shows events from the most recent hour, so you may not see events in that tab until the processing is caught up.
Posted Jan 07, 2016 - 17:46 PST
Update
We have finished the recovery of the Amplitude ID tables. All dashboard functionality has been restored for data prior to Jan 4 8:22PM PST. Some residual clean up is still needed, and new user counts for days after Jan 4 may be inflated by no more than 1% until it is completed.

We have resumed data processing for the event backlog and are closely monitoring the progress. We expect the processing to take 1-2 days to catch up (Jan 8-9 3:00PM PST). The Realtime Activity tab only shows events from the most recent hour, so you may not see events in that tab until the processing is caught up.
Posted Jan 07, 2016 - 15:37 PST
Update
We are running validations on the restored Amplitude ID tables as their reconstruction approaches completion. We still expect to begin data processing on backlogged events this evening.
Posted Jan 07, 2016 - 12:42 PST
Update
We are running validations on the restored Amplitude ID tables as their reconstruction approaches completion. We expect to begin data processing on backlogged events this evening. We will be working to improve this estimate over the course of the day.
Posted Jan 07, 2016 - 10:08 PST
Update
We are running validations on the restored Amplitude ID tables as their reconstruction approaches completion. We expect to begin data processing on backlogged events this evening. We will be working to improve this estimate over the course of the day.
Posted Jan 07, 2016 - 07:54 PST
Update
We are continuing to make progress on restoring Amplitude ID tables from historical event data. Once these tables are restored we will continue data processing of all backlogged events that arrived since the incident on Monday Jan 4 at 8:22PM PST.
Posted Jan 07, 2016 - 06:02 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. This will be the last update for tonight. We will provide a revised detailed time estimate tomorrow morning around 6AM PST.
Posted Jan 06, 2016 - 22:14 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. We will provide a revised estimate for completion tomorrow morning around 6AM PST.
Posted Jan 06, 2016 - 20:20 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. We will likely finish before our previous estimate. We will provide a revised time estimate tomorrow morning around 6AM PST.
Posted Jan 06, 2016 - 18:10 PST
Update
Summary:
* Most major dashboard capabilities have been restored for data prior to Jan. 4 8:22 PM PST. New data is not available in the dashboard, but is still being collected.
---- Restored capabilities: daily active users, segmentation, flows, funnels, revenue, retention
---- Non-restored capabilities: cohorts, microscope, user profiles
* We estimate that 48-72 hours from now (approx. Jan. 8-9 4:00 PM PST), dashboards will be fully operational with data PRIOR to Jan. 4 8:22 PM PST.
* We estimate that it will be 5-6 days (Jan 11-12 4:00 PM PST) until dashboards will be fully operational with all data CURRENT data.
* We acknowledge that our initial time estimates have been extended 1-2 days from what was initially communicated. This is unacceptable and we will be more conservative in our estimates going forward.

Details:
We are currently making progress on the first of two steps of the recovery process. The first step involves reconstructing several Amplitude ID tables from historical event data. Once this step is completed, then all dashboard functionality will be restored (cohorts, microscope, etc.) with data prior to Jan. 4 8:22 PM PST. Step 1 is expected to take another 48-72 hours to complete.

Upon completion of step 1, step 2 will begin to resume data processing and process all backlogged event data. All event data that was sent to our servers since the start of the incident, Monday Jan 4 at 8:22PM PST, is in the backlog. We expect there to be roughly 3 days of backlogged event data to process, as of now. We will process the backlog in chronological order. Any new events sent will be added to the end of the backlog. Once this step is completed, our systems will be back to normal status. Step 2 is expected to take 3-4 days once it’s started, depending on the size of the backlog.

Again, we cannot apologize enough for our critical mistake which has resulted in this outage. We are doing everything we can to restore services as fast as possible. Thank you again for your patience and understanding.
Posted Jan 06, 2016 - 16:20 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. The rebuild job is still in progress, but taking longer than expected, and a second job is required to fully rebuild the tables. Our estimate for data processing to resume is still some time between 9AM PST Thursday and 9PM PST Thursday, tentatively. Dashboard functionality is still degraded. Data collection is still at 100%. We will post another update in 2 hours.
Posted Jan 06, 2016 - 11:54 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. The rebuild job is still in progress, but taking longer than expected, and a second job is required to fully rebuild the tables. As a result, we are adjusting our estimate for data processing to resume to some time between 9AM PST Thursday and 9PM PST Thursday, tentatively. Dashboard functionality is still degraded. Data collection is still at 100%. We will post another update in 2 hours.
Posted Jan 06, 2016 - 08:42 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. A MapReduce job to re-populate the tables from the recovered data was kicked off at 4:30 AM PST and is currently in progress.

We anticipate resuming data processing some time between 9AM PST Wednesday and 3PM PST Thursday, tentatively. Dashboard functionality is still degraded. Data collection is still at 100%. We will post another update in 2 hours.
Posted Jan 06, 2016 - 06:22 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. We have finished recovering the data and are beginning the process of re-populating the tables. We anticipate resuming data processing some time between 6AM PST Wednesday and 3PM PST Thursday, tentatively. Dashboard functionality is still degraded. Data collection is still at 100%. This is our last update for the night. We will post another update at tomorrow Wednesday at 6AM PST.
Posted Jan 05, 2016 - 21:12 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. We anticipate resuming data processing some time between 6AM PST Wednesday and 3PM PST Thursday, tentatively. Dashboard functionality is still degraded. Data collection is still at 100%. We will post another update in 2 hours.
Posted Jan 05, 2016 - 19:16 PST
Update
We are continuing to make progress on rebuilding the Amplitude ID tables from historical event data. Dashboard functionality is still degraded. Data collection is still at 100%. We will post another update in 2 hours.
Posted Jan 05, 2016 - 17:15 PST
Update
Critical Outage as of January 4, 2016: Data processing downtime, degraded dashboards

Dear Amplitude User,

On January 4, 2016, at 8:22 PM Pacific Standard Time, we erroneously deleted four metadata tables that live on DynamoDB. These tables contained the following pieces of information:
* Internal configuration data for processing event data
* File metadata that our query engine uses for querying data
* Mappings from device IDs to their Amplitude IDs
* The first event seen for a given Amplitude ID.
This information is required for processing event data and querying data on the web reporting dashboards. As a result, our dashboards became unavailable and data processing was paused for all Amplitude customers. Data collection continued as it does not depend on those tables.

Our team immediately began work on recovering the tables. We were able to restore the internal configuration data table. We started a job to rebuild the file metadata table and put a fall back solution in place while that job completes. We are also actively working on recreating the two Amplitude ID tables from historical event data via a MapReduce job. In the meantime, we have contacted Amazon Web Services support to see what can be done on their side to revert the delete.

Approximately three hours following the incident, we were able to bring the dashboards back up in a degraded state. The dashboards are currently viewable, but real-time activity, user timelines, Microscope, and cohort recomputation and downloads are still unavailable.

Data processing has been paused until the tables have been recovered. No new events will show up in dashboards until we can resume data processing. This means that dashboards will only show information as of January 4, 2016 8:22 PM PST.

Throughout the incident, Amplitude has been continuing to collect data. This means all event data coming into our servers has been and will continue to be saved, but will be processed once we’ve recreated the tables.

Amplitude web reporting will return to normal operation once the tables have been restored and data processing has caught up. We expect recovering the tables could take anywhere between 12 and 48 hours. After the tables have been restored, we can resume data processing, which we expect to take 2-4 days to catch up. We will be providing periodic status updates for the recovery process. Please feel free to subscribe to our updates on this incident as well.

We sincerely apologize for our dashboard and data processing downtime. Please be assured that we at Amplitude are always striving to bring you the highest quality service and support. We realize that we have fallen short of these expectations with this incident and are working hard to resolve it as soon as possible.

Thank you for your patience and understanding,

Spenser Skates
CEO
Posted Jan 05, 2016 - 14:44 PST
Update
We've brought the web dashboards back up (excluding some functionality such as user search), but data processing is still stopped until further notice (new data will not appear in dashboards). Data collection is still functioning properly.
Posted Jan 04, 2016 - 23:59 PST
Identified
We are working on resolving an issue with the web dashboards as well as the data processing pipeline caused by one of our AWS resources. Data collection is still functioning properly.
Posted Jan 04, 2016 - 20:59 PST