Electronics distributors figured out some time ago they either have to embrace the digital world or get left behind. Catalog distributors were among the leaders in adopting digital technology to better serve their customers and better manage their back-end operations.
In the early days of e-commerce many distributors developed their own systems. As Mouser Electronics’ digital strategy has evolved, software has played a bigger role in its system-upgrade efforts. To reduce downtime and its associated costs, Mouser has garnered some help from ScaleArc, a provider of database performance and load balancing software. By adding ScaleArc’s load balancer, Mouser realized several benefits including the ability to recover quickly from outages, easier maintenance, and an unexpected bonus of increased efficiency thanks to the load balancer’s feature of caching and buffering queries and results.
A few years ago, Mouser was dealing with frequent downtime issues due to failures, lockouts or outages at a rate of one every two to three months. The cost to Mouser was about $200,000 to $250,000 an hour.
“Sometimes it would only be a few seconds or a 60-second outage when failing over a cluster, but sometimes it lasted as long as 30 to 60 minutes, which in this environment adds up to hundreds of thousands of dollars lost a quarter,” said Mark Price, Mouser’s director of internet operations and architecture.
“It’s not vast amounts of money for routine failover but the issue for the database layer is, if it takes two to five minutes, it’s a question of reliability and being able to do the maintenance and patching work without having to sustain these outages,” Price added. “And if anything backs up or goes wrong or is problematic on that one active mode, you need to be able to failover onto the passive mode and that is not always as automatic as it would seem.”
In an active/passive model, such as Mouser’s, two of the nodes (primary and secondary) accept the live traffic, while the other two remain “dark” – not accepting any traffic until one of the active nodes fails.
“So, yes, we were losing money and every time we had a hardware failure, which would have been an outage of anywhere from 30 to 60 minutes, it cost us from $75,000 to $250,000,” Price added.
Need for High Reliability
Mouser’s initial interest in updating to SQL Server 2014 was to increase uptime reliability. Specifically, Mouser wanted to take advantage of two features: scale-out (distributing data across multiple servers), and always-on (automatic failover if the primary node fails).
“We wanted to have a more redundant SQL Server cluster than what the traditional active/passive cluster allowed for,” said Price. “We had a system that drove revenues of about $2 million dollars a day and needed to be up 24 hours a day/7 days a week because we do business in multiple countries around the world and support 23 languages in 17 currencies.”
Because about 55 percent of the Mouser’s revenue – approximately $10 million per week - runs through e-commerce it was one of the big drivers behind the need to improve reliability.
Dealing with even planned outages was a challenge with SQL Server 2008 model, according to Price. If the database needed to failover across the passive mode to do maintenance on the active mode, it was a 30- to 60-second outage, he said.
“You had to keep your fingers crossed that the passive mode was ready to go and take up the transactions. We were looking for a way to have multiple active modes so we did not have an outage every time we failed over,” he added.
“Normally it took 90 seconds to occur but there was always the risk it wouldn’t come back online,” said Price. “On more than one occasion when we failed over instead of a two- to three- minute outage it was a 10 or 15 minute outage because the passive node wouldn’t restart and take the load instantly.”
Today, if Mouser needs to service a server, it can pull it out of the load balance core; work on it and put it back without interrupting the other servers.
Having any outage at all is a problem, said Price. “We sell to design engineers who are very specialized and exacting customers. It’s typical for an engineer to open up a shopping cart or a shopping list on Monday and keep it open for 10 days and check out with a couple of hundred items in his cart.”
An outage might cause customers to lose their carts in very extended shopping trips to the e-commerce site, which could impact customers very badly, he continued.
At the same time Mouser was looking to implement SQL Server 2014, it also evaluated the ScaleArc load balancer solution to help leverage the new SQL Server capabilities. “The most attractive feature about ScaleArc is that it offered full load balancing on the SQL cluster,” said Price.
“Mouser was looking to upgrade to a new form of SQL Server that had a lot of new cool capabilities but you usually need to do a lot of application re-coding to take advantage of those capabilities,” explained Michelle McLean, vice president of marketing, ScaleArc. “ScaleArc helped them to take advantage of all the new capabilities without having to modify their own code.”
“Companies are moving from disaster recovery (DR) to a new mode called different things – active/active operations or continuous availability. The notion of DR is that you will have a failure then you want to recover from that failure as quickly as possible, but you’re going to have a failure,” explained McLean. “The new thinking is that a piece may fail but the overall system cannot fail so designing for that notion of continuous availability was in Mouser’s mindset.”
Mouser needed an infrastructure that could keep pace with its increasing online sales, said McLean. “A big part of that was being able to do this kind of ‘live DR’ – the notion where you don’t actually go down.”
But redundancy at the data tier is difficult and requires two pieces – the database itself needs to have multiple servers that can serve the traffic; and then the layer that ScaleArc adds, she said.
Think of ScaleArc as database load balancing software but it does a lot more than load balancing, McLean explained. It understands the different traffic types – read/write – and directs them to the right database server, she continued. “But very importantly it natively understands the incoming traffic and directs it appropriately so no application re-coding is needed.”
“SQL Server 2014 offers the always-on feature where you can run in a single cluster – in this case four SQL Server nodes – but in order to load balance traffic and have them running hot you have to alter your codes to give the read-only requests to the database.”
This translated into hundreds and even thousands of lines of code to alter. But ScaleArc’s solution eliminates that challenge because it has a built-in algorithm whereby the load balancer interrogates the incoming query and determines whether it’s read-only or an update or write query.
If it is an update query it’s sent to the primary node, and if it is a read-only query it would balance it across all four nodes in the SQL Server cluster so all four nodes would take traffic,” said Price. “Instead of running two boxes at 70 percent capacity and two others running idle, we now had four boxes running at about 40 percent on each of the machines.”
“If we need to take out one box for maintenance or lost one box to any kind of hardware failure or a lockout on that particular database we could remove it from the ScaleArc load balancer and the traffic would load balance across the remaining three nodes,” he continued.
That is a huge improvement over cold failovers and it’s totally transparent to the customer, Price said.
He cited an example where Mouser had a hardware failure – a socket failure on the primary SQL Server node, which is the single node that takes all the update traffic as well as a proportion of the read traffic.
“It was in the middle of the day – the busiest time of the day – thousands of users were on the system and millions of transactions running and node number one failed. Every alert in the building went off,” said Price. “Before we could log in to see what had happened the ScaleArc load balancer had detected the failure; buffered all of the update requests, and started serving all of those buffered requests within about 10 to 15 seconds to a new primary node.”
“All of the read requests, which were about 90 percent of our traffic, continued without interruption, which would have otherwise been a 30- to 90-minute outage costing us anywhere from $150,000 to $300,000,” Price said. “It was completely transparent to the customer and the ScaleArc load balancer did the whole failover and recovery before we could even log in. We were about three feet from our desks when the alerts went off and it basically paid for itself right there.”
“By the time everybody got back to their desks and logged into both the SQL Server and ScaleArc system the combination of the database doing failover and the ScaleArc software buffering the incoming traffic the users on the website and those with items in the shopping cart never saw an outage,” said McLean. “They saw what could have been a little bit of a delay while the database was finishing the failover but they never got page load errors or a cart flush or any of the things that could happen when an application fails. He said ‘the recovery was fully automatic – no outage; no drop sessions; and no tangible customer impact. Impressive.’”
Price estimates that the upgrade saves the company hundreds and likely thousands of hours of work and reduces the need to work after hours or on weekends to perform maintenance. It also has been a huge boost to Mouser’s ability to keep up with patches, particularly, security patches, because a node can be pulled without worrying about a failover or outage, said Price.
Mouser does a huge amount of business overseas, and they don’t have a great time to be down, explained McLean. “That pain of having to take down a site or take down an app for maintenance is really challenging for businesses that don’t feel like they ever have time for downtime.”
“The nice application of ScaleArc software is for these planned outages. It no longer has to mean an outage to the users because with ScaleArc software you can isolate one server at a time so you can take it offline for patching or other maintenance and the software will route around that and keep the app running. “