LeadConduit Processing Slowdowns
Incident Report for ActiveProspect
Postmortem

After further investigation, we have determined the root cause of this incident. Following LeadConduit performance issues on the preceding two days, we set about to increase our capacity by adding additional application servers. Because our current hosting provider does not offer the infrastructure that would allow us to scale up quickly in response to increased demand, we had to set the servers up the old fashioned way: by hand. In doing so, we inadvertently misconfigured firewall rules for some of our existing servers. That human error led to those servers becoming unavailable to handle traffic. Reduced capacity caused the same behavior in the application that we had seen the preceding two days.

So what are we doing about it?

  1. We have added additional servers to handle the needs of our growing customer base.
  2. We are moving our entire infrastructure to a cloud provider. This will afford us many advantages over our current host, one of which is the ability to scale up quickly in response to increased demand. This work has begun and will be occurring Q2 through Q4 2020.
  3. We will no longer be relying on servers set up “by hand”. Instead, we will be provisioning via infrastructure as code, and automating application configurations. This marks a turning point in our ability to scale quickly. This is a timely improvement because our applications have never been more popular. We can more quickly provision additional capacity because we no longer have to fuss with the configuration of each individual server manually. Infrastructure as code also gives us control over all changes made in production and reduces mistakes made in the heat of battle. We are making this change as part of our move to the cloud.
  4. We currently have a robust suite of production monitors that notify us when something looks strange in our production environment. This incident, along with the others from the preceding two days identified the need for an additional monitor. That monitor has now been added. If we see this same sort of problem going forward, regardless of the root cause, we expect to be alerted. This allows us to attend to an emerging condition before it causes a problem for our customers.

As always, our most important job in running our services is to ensure that they are highly available and perform their duties in a timely fashion. We feel confident that the above changes will more than address the issues we had with LeadConduit this week. Thank you once again for your patience.

Posted Jun 05, 2020 - 18:30 CDT

Resolved
We have resolved the firewall issue that was causing increased load on the remaining application server cluster members. LeadConduit performance has returned to normal. We will follow up with a post-mortem.
Posted Jun 04, 2020 - 08:35 CDT
Identified
We believe we have identified the root cause of this this incident. There were several application servers in our cluster which had misconfigured firewalls. That condition was causing higher than normal load on the other servers in the cluster. We are still investigating.
Posted Jun 04, 2020 - 08:28 CDT
Investigating
We are investigating slow downs in LeadConduit lead processing.
Posted Jun 04, 2020 - 07:39 CDT
This incident affected: LeadConduit.