LeadConduit database under heavy load
Incident Report for ActiveProspect
Postmortem

What happened?

At approximately 7:26pm CDT, we were notified by our monitoring system that LeadConduit was slow to process incoming leads. Our team responded by diagnosing the problem: the database server was under heavy load and was not inserting new records in a timely fashion.

Once the diagnosis was made, we set out to relieve the pressure on the database so that it could recover.

This meant activating a feature we call "lead spooling" which simply captures incoming leads without touching the database and without performing any lead processing. When lead spooling is enabled, we give our "success" response that does not include a lead ID. When we use this feature, we must later "replay" the captured leads in order to process them. You can think of this as deferred processing for leads that arrive while spooling is active. Spooling started at 7:50pm CDT.

Shortly after removing load from the database, it recovered and we were able to slowly turn off spooling for a portion of our total traffic. By 8:25pm CDT spooling was completely disabled and leads were being handled normally. All spooled leads had been replayed by 8:56pm CDT.

What was the root cause?

After the incident, we began the process of investigating the database problem. The root cause was database contention resulting from our lead deletion process. That process removes old leads grouped by account, campaign, and day.

Some accounts have campaigns that are very high volume. This can cause the delete process to work harder to remove a group of leads. Under normal conditions this work load is no problem. However during this time period there were multiple accounts and campaigns with very high volume.

The large number of deletes caused database contention which slowed down inserts of new leads.

What are we doing to prevent this from happening again?

Two things.

First, when we respond to high load on the database, we'll start by turning off less critical features first. During this particular outage we could have responded by turning off lead deletions first. This might have solved the problem quickly without requiring any lead spooling. Changing our incident response process slightly in this way will produce a better outcome.

Second, we have established an always-on emergency overflow spooling mechanism. When a lead is submitted and our servers are under heavy load, the lead will automatically be sent to the emergency spooling mechanism by our load balancers. This will mean that lead submissions will no longer be refused when load is high. It will also automate a process that was heretofore a manual one: turning on lead spooling. During times when our systems are running into trouble, this mechanism will provide an automatic and instant failsafe.

We're committed to maintaining a reliable platform for our customers and take these sorts of incidents extremely seriously. On behalf of the technical team at ActiveProspect, please accept our apologies for the problems this outage caused for you.

Posted Mar 11, 2015 - 13:00 CDT

Resolved
The backlog of leads has now been completely replayed. All LeadConduit services are back to normal. A postmortem is to follow.
Posted Mar 10, 2015 - 21:56 CDT
Update
Starting at 8:25pm CDT, full capacity was restored to LeadConduit and all leads are now being handled normally. The backlog of leads is now being processed.
Posted Mar 10, 2015 - 20:42 CDT
Monitoring
From approximately 7:30pm CDT through 7:50pm, LeadConduit performance was severely degraded. As a result, some incoming lead traffic was slow to be processed, and some inbound connections were outright refused. Starting at 7:50pm we initiated an emergency measure to capture all inbound leads for deferred processing. Those leads will be processed shortly.
Posted Mar 10, 2015 - 20:39 CDT
Identified
Our database is under heavy load and we are taking steps to remediate the problem.
Posted Mar 10, 2015 - 20:25 CDT