Data collection loss from JS SDK file hosting bug
Incident Report for Amplitude
Postmortem

After further investigation, there is no statistically significant dip in event volume that may be attributed to the JS SDK error. It is highly likely that the CloudFront edge node caches kept the working versions of the files until the error was resolved. Customers and end users may have seen this error if they disabled the cache for the JS SDK. As mentioned in the previous update, the error has been resolved and additional tools have been added to the existing monitoring that is in place for the SDKs.

Posted Oct 28, 2019 - 15:38 PDT

Resolved
On Friday we broke how our JS SDK files were being hosted which caused data loss.
We fixed the issue on Monday.

Who was affected?
Customers using our JS SDK via the snippet (https://developers.amplitude.com/#installation).
Customers who use npm to include our SDK won't have been affected.

What happened?
We had intended to increase the HTTP header Cache-Control max-age on all SDK files.
But we accidentally dropped and overwrote other important headers (Content-Type and Content-Encoding) in a way that broke how browsers loaded them.
We fixed the issue by re-setting the correct headers.

When did this happen?
We broke the SDK file headers at approx 16:45 PDT (23:45 UTC) 2019-10-18.
We fixed the headers at approx 09:30 PDT (16:30 UTC) 2019-10-21.

What are we doing to prevent this ever happening again?
At a low level the issue stemmed from us not understanding how boto2's set_remote_metadata method worked (we use AWS CloudFront and S3 for our CDN).
We at least know not to use this API in the future.
We'll also spend some time thinking about how we can tighten up controls around the SDK files to prevent similar accidental changes in the future.
Posted Oct 18, 2019 - 16:30 PDT