Using incorrect amplitude-ids during the data processing outage

Incident Report for Amplitude

Postmortem

Issue Overview

Amplitude had an outage on July 20. During the outage, some new users were assigned duplicate Amplitude identifiers causing them to share the same user profile in the same project. All new users after July 20 at 5:30pm PT were assigned proper Amplitude IDs. The incorrect Amplitude IDs created during the outage continued to be corrupted until August 6. We discovered the issue on August 6 after investigating a customer inquiry. On August 6, we invalidated the shared profiles and assigned new Amplitude IDs to affected users.

Customer Impact

The issue appears to have affected almost all Amplitude customers. It affected some new users created on July 20 from 11:45am to 5:30pm PT. The corrupted data represents less than 0.07% on average of our customers’ project data from July 20 to August 6.

Root Cause

Our streaming data processor component is using a Postgres sequence to assign a new Amplitude ID to each new user. This sequence reached the max value causing the initial outage. A mix of Postgres behavior on “next_val(sequence)” call and our code caused the system to re-use the maximum value of the sequence until we shut the system down.

We did not become aware of this side effect until we finished investigating two customer tickets on Aug 6th. Initially, we thought it was a minor issue impacting just a few customers, but after reviewing all data from July 20 to August 6 we were able to determine that nearly all customers are impacted.

Unfortunately, since we have a 7-day retention period in our Kafka systems, we do not have a way to fix the already ingested data from July 20th till Aug 6th.

How Amplitude Will Prevent this Issue in the Future

As part of the original incident’s RCA, we’ve added monitoring for our sequences in Postgres.

Posted Aug 12, 2021 - 12:04 PDT

Resolved

We have recently discovered that on July 20th at 11:45 AM PDT, during the previously posted incident in our data processing component, we have used invalid amplitude-ids during our ingestion. This happened until 5:30 PM PT and on average 0.07% of events from all customers got affected. Please read the post-mortem for more detailed information.

Posted Jul 20, 2021 - 23:30 PDT