Amplitude had an outage on July 20. During the outage, some new users were assigned duplicate Amplitude identifiers causing them to share the same user profile in the same project. All new users after July 20 at 5:30pm PT were assigned proper Amplitude IDs. The incorrect Amplitude IDs created during the outage continued to be corrupted until August 6. We discovered the issue on August 6 after investigating a customer inquiry. On August 6, we invalidated the shared profiles and assigned new Amplitude IDs to affected users.
The issue appears to have affected almost all Amplitude customers. It affected some new users created on July 20 from 11:45am to 5:30pm PT. The corrupted data represents less than 0.07% on average of our customers’ project data from July 20 to August 6.
Our streaming data processor component is using a Postgres sequence to assign a new Amplitude ID to each new user. This sequence reached the max value causing the initial outage. A mix of Postgres behavior on “next_val(sequence)” call and our code caused the system to re-use the maximum value of the sequence until we shut the system down.
We did not become aware of this side effect until we finished investigating two customer tickets on Aug 6th. Initially, we thought it was a minor issue impacting just a few customers, but after reviewing all data from July 20 to August 6 we were able to determine that nearly all customers are impacted.
Unfortunately, since we have a 7-day retention period in our Kafka systems, we do not have a way to fix the already ingested data from July 20th till Aug 6th.
How Amplitude Will Prevent this Issue in the Future
As part of the original incident’s RCA, we’ve added monitoring for our sequences in Postgres.