Putting Magento to the test on Boxing Day

In this post I’ll write about Magento performance optimization, human solutions to technical problems, and the real-world experience of running an e-commerce business with a tiny, agile team.On Boxing Day we had a sale event at LiveOutThere.com. We placed some great items from this season on sale for 30% to 50% off, and using an additional discount code, our customers could receive up to 55% off if they had been paying attention to our email deployments in the days before. These are great prices and it is no wonder we attracted a little more attention than we bargained for:

In order to give our customers on the west coast a chance to buy product before everyone on the east coast snapped it up, we decided to put up a holding page and close the doors until 11am MST. I placed our Google Analytics tag on the holding page and watched eagerly as hundreds of concurrent visitors piled in.

11am – the sale is on!

When 11am came, I removed the RewriteRule that forced all of our inbound visitors to the holding page, and boom – the website instantly choked. The disappointment my partners and I felt as we imagined the frustration these hundreds of eager shoppers must have experienced hung in the air. This was not supposed to happen.

We have two servers – one front-end web / cache server, and one dedicated database server. We do not run a dedicated reverse proxy like Squid or Varnish, and in this case, I don’t think it would have made a difference. We already use Amazon CloudFront to host our media assets and CSS. Our CSS and JavaScript are minified, and we try to follow all of the YUI guidelines. It was safe to say that this was not a front-end problem. In anticipation of the traffic increase, I had recently switched to Redis for our Magento cache backend. We use Redis in combination with TinyBrick’s LightSpeed extension to store the full HTML of our pages, which will theoretically reduce the database traffic to near zero if the cache is hot.

11:15am – initial diagnosis and damage control


thread_cache_size = 16
max_connections = 800
query_cache_size = 512M
sort_buffer_size = 4M
join_buffer_size = 3M
open_files_limit = 4096
table_open_cache = 2048
Solving a problem like this on the fly is kind of like replacing a valve in a fully pressurized plumbing system with no master shut-off. Within a few minutes I had established that the issue was certainly heavy load on the database server, but unfortunately, could not do much about it. We had instantly reached max_connections and sent the processor utilization to 100%. With nothing else to do, I added a RewriteRule back to our .htaccess that shunted visitors back to a holding page. Jamie Clarke, LiveOutThere.com’s president, drafted a quick apology that we published to the holding page while I worked to understand and fix the issue.

Communicate early and as often as possible.  Our Facebook page allowed us to continue a conversation with our waiting customers, and ultimately serve them one-on-one.

Our dedicated MySQL server has 24GB of RAM and uses the key configuration values shown in the pull-quote above. This was based on the best practices found at magentocommerce.com. This is certainly not a cheap piece of hardware, and even under heavy load it should have performed adequately – something wasn’t adding up. I tried restarting MySQL a few times. No dice. I would get 2 seconds of operation and would immediately be back in the 100% utilization range. I knew I had two different challenges – first, how to bring the site back up in some sort of minimal viable configuration, and second, what the hell was causing the utilization in the first place? Our Redis cache was hot, and the database server should have been under almost no load.

11:30am – minimum viable fix

I realized that the only option I had would be to use mod_rewrite to do traffic control — by letting some customers in, but not all of them, I could keep the load on the database server under control and allow at least some of our customers to make purchases. This posed a few problems – you can’t let someone in and then kick them back out again, so I knew I eventually had to solve the real issue, as I would eventually begin to swamp the database server anyways. My first instinct was to do this by IP subnets and I broke the fully-allowable range of IP addresses into a few separate chunks. This was an obvious technical solution, but within a few minutes I realized it was going to just make our customers even more mad. I needed a human solution.

Learn from Disney. Make the waiting experience a shared, collective experience. Even if it’s frustrating, it’s memorable and it’s an opportunity to let your real strengths shine through – this is ultimately a story about great customer service, not database server performance.

Solution #2 was to engage each customer one-on-one. I instructed Alex, our customer service lead, to collect IP addresses from our most vocal Facebook customers so I could let them in personally. This was a seat-of-the-pants decision that turned out to be a wise one. Where before I had hundreds of customers, none of whom seemed to be able to get into the store (and I guess that’s statistics for you), now I had hand-picked customers who seemed to be particularly patient, particularly frustrated, or particularly interested in communicating with us while we determined the problem.

Allowing our most vocal Facebook fans in first quickly turned the tide of discussion on our Facebook page from frustration to hopeful anticipation.

As we slowly began to process sales, a vibe developed on our Facebook page that was all about the exclusivity of access, the anticipation of being allowed into a great sale, and the shared experience of waiting. Customers posted IP address after IP address, and as Alex read them to me I retyped out RewriteCond directives to give each person access. When they received access, they would almost always make a positive comment about it on the wall.

1:00pm – the true nature of the problem

After an hour and a half of coordinating Facebook sales, I had a chance to think about what was going on with the database server – and had no choice – as our utilization was creeping higher because by that point I had about 150 people shopping. The obvious next step was to use MySQL’s SHOW PROCESSLIST command to see exactly what kind of queries were being requested in the first place. At first glance, this showed me a whole bunch of queries that were in SLEEP state. I assumed these queries were waiting for row locks, but some of the times were so high (in the 30+ second range), that I figured just killing them outright would be better than serving a page in 30+ seconds anyways.

You can use pt-kill, a utility from Percona’s advanced MySQL toolkit, to selectively kill queries that match your criteria.

Using pt-kill, I set up a process that would scan for SLEEPing queries every 30 seconds and kill them – I figured this would mean better performance for the queries that were getting served, and at the very least would serve error pages to our customers quicker than letting them grow older watching a spinning cursor. This would also reduce our connections count significantly every 30 seconds, so we weren’t throwing max_connections exceeded errors. This provided another stop-gap measure that bought me another hour while I could diagnose the real problem.

2:00pm – partial resolution and the opening of the floodgates

After watching SHOW PROCESSLIST carefully I began to pick out particular queries that didn’t make a lot of sense to me; namely involving customer sessions. This was my first “aha!” moment all day.


Why was Magento storing customer session data in the database? That wasn’t how it should have been configured!
We have our /var directory mounted in a tmpfs partition (a virtual disk stored in RAM), and session data should have been written to /var/session on the web server. Sure enough, I looked at our /app/etc/local.xml file and our session_save location directive was set to database (this is how our development environment is configured, so at some point, we must have accidentally copied a development local.xml to live but forgotten to reset this particular directive) I knew this simple change would drastically reduce the load on the database server, but I also knew that if I changed the session_save location, it would cause our in-progress sessions to be wiped out. This meant that customers who had things in their cart would lose them.

I had no choice – it was either switch the location sessions were being saved to, or eventually choke the MySQL server again. I made the change and refreshed Magento’s configuration cache.

Within a few minutes the utilization on the database server dropped to a more reasonable level. I began removing RewriteConds cautiously, monitoring MySQL as I saved each copy of .htaccess back up to our web server. Within 10 minutes, I had removed all of the rewrite conditions and every one of our customers was able to shop. Unfortunately as a result of changing the session_save location, we did have a number of angry customers who had items in their cart or had almost completed checkout, and we made every effort to make it right for each one of these customers on a case-by-case basis. In many cases, we captured the order details using our Olark LiveChat instance and simply placed the order on the customer’s behalf.

3:00pm to 1:00am – seeing it through

Now that customers had full access to the store, we began receiving orders one after the other. For the remainder of the day, we averaged about 40 orders an hour. At any given time, we had between 100 and 200 active shoppers, and dozens of people checking out simultaneously. We use Dg_Matrix to watch shopper behaviour, and the screen was flying. Alex and I stayed on live chat until approximately 1:00am, and processed many one-on-one sales. At one point, Alex had 36 chats going at once. Somehow, she kept it together.

At LiveOutThere.com, we attempt to offer a full-service approach comparable to an in-store experience over LiveChat. That means helping you determine exactly what winter jackets you should compare, how they should fit, how much you should spend, and what other alternatives might be available to you.

We finished the sale day between midnight and 1am, ringing up $100,000 in sales and hopefully delighting more customers than we disappointed. The customer service follow-up has been steady, and we have been quick to offer extensions of our sale prices for anyone who was unable to purchase in the initial few hours of the sale. In conclusion, this experience has certainly shown the way forward for future optimizations, illustrated a clear need to lint our configuration files, and taught us valuable lessons about damage control and the power of open, transparent communication with customers.

About Drew

I do a few different things well enough to be dangerous, and since 2001, I have applied my expertise to clients in the advertising, retail, philanthropic, travel/tourism, real estate, and healthcare industries.

4 Comments

  1. Sorry to hear that Drew. I recently did an article for thewhir.com discussing how, as a dedicated Magento host, we prepare for Black Friday and Cyber Monday (ie. Boxing Day):

    http://www.thewhir.com/web-hosting-news/112511_QA_Eric_Hileman_of_Magento_Web_Host_MageMogo_on_Surviving_Black_Friday_and_Cyber_Monday

    We had a few clients do this type of “door buster” promotion and overwhelm their servers here. Fortunately for them we’re always available and monitoring all servers and were immediately available to upgrade their server to a higher level of resources within minutes. Throwing hardware at the problem, when you can quickly do it in a viritualized infrastructure, is a good way to solve an immediate problem like this.

    We do some siege testing and a quick and dirty way is using Ashley Shroder’s speed test site:
    http://www.magespeedtest.com/

    …but like you said it’s not real accurate when it comes to traffic patterns, especially the checkout process. Still it gives us a good feel if a client can survive their anticipated traffic. I bet had you done that you would have caught your xml config issue right away :)

    You sound like the perfect client to host here and make the most of our performance optimization services which we provide at no cost with all our plans. Check us out because we’d love to work with you.

    Anyway, great article. Thanks for sharing.

  2. As a Systems Admin / Magento Dev who’s gone through a similar albeit smaller ride than yourself, there’s nothing like the anxiety of watching those load averages climb, especially when that floodgate opens at sale time. At least in a physical sale, people can ‘see’ others and understand the situation if they have to wait in line.

    You may find this interesting. The ability to replay your logs for load testing
    http://www.igvita.com/2008/09/30/load-testing-with-log-replay/

    • drew.gillson

      Thanks for that link~ fascinating stuff. I’m always less than enthusiastic about Siege testing because it’s just so damn hard to accurately reproduce realistic traffic load patterns. Doing log reconstruction makes a lot of sense to me.

  3. Thanks for this post Drew. We also run a few eCommerce shops with Magento at our core and boxing day/week has been a wild ride (we’ll be posting our notes next week). I love the approach you took with Facebook and one on one service. Bloody brilliant!

Leave a Reply

*