After further investigation, we have determined the root cause of this incident. Following LeadConduit performance issues on the preceding two days, we set about to increase our capacity by adding additional application servers. Because our current hosting provider does not offer the infrastructure that would allow us to scale up quickly in response to increased demand, we had to set the servers up the old fashioned way: by hand. In doing so, we inadvertently misconfigured firewall rules for some of our existing servers. That human error led to those servers becoming unavailable to handle traffic. Reduced capacity caused the same behavior in the application that we had seen the preceding two days.
So what are we doing about it?
- We have added additional servers to handle the needs of our growing customer base.
- We are moving our entire infrastructure to a cloud provider. This will afford us many advantages over our current host, one of which is the ability to scale up quickly in response to increased demand. This work has begun and will be occurring Q2 through Q4 2020.
- We will no longer be relying on servers set up “by hand”. Instead, we will be provisioning via infrastructure as code, and automating application configurations. This marks a turning point in our ability to scale quickly. This is a timely improvement because our applications have never been more popular. We can more quickly provision additional capacity because we no longer have to fuss with the configuration of each individual server manually. Infrastructure as code also gives us control over all changes made in production and reduces mistakes made in the heat of battle. We are making this change as part of our move to the cloud.
- We currently have a robust suite of production monitors that notify us when something looks strange in our production environment. This incident, along with the others from the preceding two days identified the need for an additional monitor. That monitor has now been added. If we see this same sort of problem going forward, regardless of the root cause, we expect to be alerted. This allows us to attend to an emerging condition before it causes a problem for our customers.
As always, our most important job in running our services is to ensure that they are highly available and perform their duties in a timely fashion. We feel confident that the above changes will more than address the issues we had with LeadConduit this week. Thank you once again for your patience.