The quickest way to get us to solve a problem is to tell us it’s impossible to solve.
It was mid 2017 when I got a call from a friend who manages a development firm. He had a large client with major server issues that had went on unresolved for months. Their API servers were going down multiple times a day throughout peak traffic periods, causing a domino effect that brought down some web services and left clients and vendors frustrated. This was a lead exchange and the server issues were severely affecting revenue, because as the API became unresponsive, vendors would bypass them on the ping tree and even pause the client for hours, causing the client to lose opportunities to bid. The problem was compounded by the fact that the client was still running their systems on legacy hardware (bare metal), hindering their ability to quickly respond to high demand or easily migrate and build new server instances to remedy the issue. The client had a dedicated database team on call to respond to availability issues, optimize the database server and manage the data archival process.
The client had already reached out to cloud engineers and been told that due to the architecture of their database, it would be impossible to run their systems on the cloud. We not only stopped the downtime, but successfully migrated the client to the AWS cloud, saved them a lot of money, and continued to manage and monitor their environment.
The client was a prominent lead exchange and performance marketing company, processing over one million consumer leads per day across twenty-one different verticals. The system topology consisted of two load-balanced database servers, two load-balanced API servers, an Admin system web server and a couple of general web servers, along with a firewall and miscellaneous cloud instances.
Peak time API traffic resulted in saturation of the network connection between the API (edge) servers and the database servers, bringing down other dependent services in a domino effect.
There were some factors that contributed to the severity of the issue:
- All servers were running cPanel with incorrectly configured cPHulk and software firewalls. The default cPanel services configuration is not suitable to API and remote MySQL service and was causing connectivity issues between the edge and database servers during peak times.
- The default configuration of Percona (MySQL) Server had been heavily modified and many settings were hard coded into config files, with no config management, and were out of date.
- Apache settings as well as server files limit and memory management had been incorrectly configured, causing the server to become saturated to the point of unresponsiveness during peak load times. Basically, the server was no longer preventing processes from spawning out of control.
- The main database was poorly architected, based on a legacy relatively flat format. Working data grew at the rate of 18+GB/day with no sustainable or automated processes in place to reclaim disk space, manage fragmentation of tables, or facilitate the archiving of data for compliance purposes.
The Immediate Solution
We jumped right in and quickly reconfigured Linux and the network to stop inhibiting MySQL traffic from the edge servers. We also corrected package issues, Linux configuration and Apache configuration on the API (edge) servers to stabilize them during peak load times. This meant that the API would still delay or effectively deny requests during peak times, but at least the server load would stay under the threshold at which the server would descend into an irrecoverable state requiring hard reboot.
Work was done to prepare the system to archive the database into S3 buckets, and better manage the working data lifecycle.
The Long-Term Solution
Once the system was relatively stable and we had prevented the daily server crash situation, it was time to start planning the move to the cloud. There are some fundamental differences in the way cloud compute instances are interconnected, requiring us to re-think the topology of the system.
Traditional hardware servers typically have multiple physical network interfaces. This means that when you have separate database and edge servers, requests can come in through the public interface, be processed by the web server, make a trip to the database via the separate private interface, and finally return the response via the public interface. This standard design prevents a single network interface from becoming saturated with database traffic, which could increase load on the interface by at least double but often much more. Multiple cloud instances run on a single hardware machine, usually with a single shared network interface. Not only is the network interface for the machine usually shared with other instances on the same machine, but the single virtual interface must carry the load of the web requests and the database traffic as well. There is a private and public interface, but these are routed in the network layer external to the server (think NAT).
Additionally, storage on traditional hardware servers is normally contained within the machine and accessed via an internal bus and has nothing to do with the network connection at all. Cloud instance use two different types of storage, local (ephemeral) and persistent network attached block storage. Since local disks on a cloud instance are lost on reboot, they are only suitable for temporary workloads such as caching, working directories and swap, and not suitable for operating system or persistent data storage. This means that the hard disks of a traditional server are replaced with network attached storage on cloud instances.
These two points alone mean that the network interface is going to become saturated much more quickly on instances unless we change the layout. We considered RDS but testing revealed that we were able to extract more performance from Percona servers running on EC2 than Aurora and more performance per dollar than MySQL on RDS.
We decided to combine the edge and database server into the same instance, saving a round trip on the network interfaces. We left the general web servers and Admin system server on separate server instances as the traffic was not a problem on these. Once finalized, we migrated everything to the carefully planned EC2 environment.
The client experienced immediate cost savings of close to 50% over their traditional hosting bill and we eliminated downtime through smart planning of their new elastic infrastructure. Further cost savings of another 15-20% were realized as we continuously monitored, tested and optimized. Stabilization of the database environment removed the need for a dedicated database team. From the time we migrated until the posting of this article, the client has not experienced a single instance of unplanned downtime due to server or load issues. Current cloud hosting cost averages roughly 27% of their old bill, and the client continues to innovate, unencumbered by legacy infrastructure.
We can help with your migration to the managed cloud. Get in touch today for a free consultation!