TrustedForm Lost Certificate Videos

Incident Report for ActiveProspect

Postmortem

Customer Impact

Missing Video Replay Snapshots
Lost Certificates

Internal Impact

Sporadic inability to archive new certificates.
Certificate ingestion and claiming processes experienced increased latency.

Isolated Root Causes

Bug in our database cluster’s storage backend prevented garbage collection of stale secondary indexes.
Broken archival queue monitoring.

Impact Causality Chain

TrustedForm has three primary components:

Browser-facing certificate data ingestion endpoint/queue, "issuer".
Customer-facing certificate claiming endpoint, "claimer".
Background queues, including the certificate archival queue.

These three components are bound together by our short-term certificate data cache with time-boxed expiration. Certificate data flows from the user's browser into these database servers, and expires a number of days later. When a customer claims a certificate, we place an archival job into a queue to be serviced by a background worker. The worker must complete the archival process before the certificate data expires from the cache.

Certificate data is indexed by ID, the index value expires along with the certificate data. A bug in the database backend prevented the garbage collection of expired index values, leading to their slow accumulation over time. This accumulation eventually resulted in a critical performance degradation. This degradation led to three consequences:

Increased issuer ingestion queue latency.
Increased claimer response latency.
Decreased archival queue worker performance.

As archival queue performance slowly decreased, the size of the archival queue inversely grew. Eventually the size of the archival queue exceeded the expiration time window for the short-term certificate cache and archival jobs began to fail, leading to the inability to correctly archive certificates.

Normally, we monitor the size of the archival queue for issues such as this, however, our monitoring was silently broken.

Mitigation Efforts

Immediately after becoming aware of this issue, our engineering staff took the following mitigation actions:

Doubled the certificate cache expiration window so as to prevent further data expiration while the issue was investigated.
Tripled the number of archival queue workers to bring down the queue size.

Once the problem was identified as an index leak, engineers did a live rolling repair of the database cluster's indexes.

Remediation Steps

We've taken the following steps to prevent this from happening in the future:

Implemented alerting for stale index accumulation.
Repaired archival queue alerting.
Reconfigured our database cluster to reduce dependency on poorly performing members.

In order to categorically resolve this manner of issue, our engineers are currently exploring effective methods of moving certificate data out of the short-term cache and into a temporary holding area while its archival job is pending.

Additionally, we are in the process of incrementally rewriting our archival process from the ground up on a new platform, we expect orders of magnitude in performance gains.

We are also investigating either replacing the database’s storage backend or moving to another database system altogether.

Our operations staff is presently discussing how to effectively "monitor the monitors", in order to detect monitoring issues.

Posted May 15, 2019 - 16:30 CDT

Resolved

A bug in TrustedForm's database cluster resulted in lost certificate videos.

Posted Apr 22, 2019 - 16:25 CDT