Isolated Root Causes
Bug in our database cluster’s storage backend prevented garbage collection of stale secondary indexes.
Broken archival queue monitoring.
Impact Causality Chain
TrustedForm has three primary components:
These three components are bound together by our short-term certificate data cache with time-boxed expiration. Certificate data flows from the user's browser into these database servers, and expires a number of days later. When a customer claims a certificate, we place an archival job into a queue to be serviced by a background worker. The worker must complete the archival process before the certificate data expires from the cache.
Certificate data is indexed by ID, the index value expires along with the certificate data. A bug in the database backend prevented the garbage collection of expired index values, leading to their slow accumulation over time. This accumulation eventually resulted in a critical performance degradation. This degradation led to three consequences:
As archival queue performance slowly decreased, the size of the archival queue inversely grew. Eventually the size of the archival queue exceeded the expiration time window for the short-term certificate cache and archival jobs began to fail, leading to the inability to correctly archive certificates.
Normally, we monitor the size of the archival queue for issues such as this, however, our monitoring was silently broken.
Immediately after becoming aware of this issue, our engineering staff took the following mitigation actions:
Once the problem was identified as an index leak, engineers did a live rolling repair of the database cluster's indexes.
We've taken the following steps to prevent this from happening in the future:
In order to categorically resolve this manner of issue, our engineers are currently exploring effective methods of moving certificate data out of the short-term cache and into a temporary holding area while its archival job is pending.
Additionally, we are in the process of incrementally rewriting our archival process from the ground up on a new platform, we expect orders of magnitude in performance gains.
We are also investigating either replacing the database’s storage backend or moving to another database system altogether.
Our operations staff is presently discussing how to effectively "monitor the monitors", in order to detect monitoring issues.