Deploying Daily

Six Months Ago

“I’m not seeing the new tooltip on the grid application. Did you deploy that one yet?” “Yeah, just now.” “Did you remember to sync the submodules?” “Yes.” “Did you flush Memcached and the CDN?” “Crap. No, I forgot. Which order do we flush those in again?” “Memcached, then CDN. Then make sure to deploy to both Heroku and Engine Yard.”

These were dark times.

When I started at Rent the Runway nine months ago, our storefront was divided into seven swimlanes: independent Sinatra applications that, in theory, allowed us to handle high traffic through fine-grained control over different parts of renttherunway.com. If we suddenly got a burst of traffic from a news story or celebrity tweet, we could independently scale the home page from the rest of the application to handle it. We also gained fault isolation and tolerance, since one part of the application going down meant we both knew where the problem was and could prevent it from spreading.

We soon discovered, however, that while swimlanes were a good idea, we’d applied them at the wrong level of granularity: 99% of the time, our changes to the codebase affected multiple swimlanes. Even when we were working on just one, we ended up writing a lot of duplicate code because all the underlying data providers were identical. Fault tolerance and isolation were nice, but we were dead in the water if users could hit the home page but not transact, or vice-versa. We didn’t even have full isolation: since all the data providers were the same, a service disruption to one swimlane would likely affect several. All this meant that deploying our storefront had rapidly become an enterprise of lunar mission proportions.

Deploying all seven swimlanes entailed syncing two submodules (one for views, another for assets), possibly bumping the version on two gems (one for the base application from which each swimlane inherited, another for our API clients), updating code for all seven swimlanes (each its own Git repo), then deploying each one to both Heroku and Engine Yard (in the event of problems with one, we’d fail over to the other). The dependency graph looked like a drunk John Madden telestrator instant replay diagram.

Oh yeah: and each swimlane had its own CDN and Memcached instance, each of which had to be flushed separately and in the right order.

Oh! And this was just to deploy to stage. Once we were on stage and confirmed everything looked good, we had to do the whole thing again for production. It took two developers six hours, start to finish, to deploy renttherunway.com. We deployed once a week on Thursdays, and if for some reason we couldn’t get a release out the door (which, as you can imagine, wasn’t uncommon), site traffic and weekend on-call concerns forced us to wait until the following Tuesday for another window of opportunity. Like I said, it was like a lunar mission.

Present Day

We knew we had to do better. Camille, our head of engineering, charged us with putting together a task force to get us from our wobbly weekly deploys to a system of smooth, daily ones. Since many of Rent the Runway’s back end services are named after animals, the task force became the TuskForce™, with our unit of progress the stampede (rather than the sprint). We determined we needed to do three things:

  • Consolidate our seven swimlanes into one application, which included eliminating the submodules;

  • Develop a feature flagging system to allow us to toggle features on and off without deploying the application;

  • Move the application to a single host (rather than Heroku and Engine Yard) and automate the deploy process.

Smashing the swimlanes together was arduous and a lot of things broke, but within a couple of weeks we had a functioning, single “swimming pool” comprising the original seven Sinatra applications. At some point, “Seven [rings] for the Dwarf-lords in their halls of stone” came up. Many Lord of the Rings jokes followed.

While we were busy combining the swimlanes, I set to work forging a feature flagging service, dubbed “Flaggregator.” (Rejected names included “The Flaggro Crag” and “Spiro ‘Feature’ Flagnew”.) Flaggregator is a Sinatra application that keeps track of feature flag data through Git and GitHub, so in keeping with our LotR theme, we named Flaggregator’s GitHub user “Frodo Flaggins.”

With the arrival of Flaggregator, we could modify features on the site without a deploy. Now, if we’d stopped here, we’d already have had some serious wins over the previous process: deploying one application weekly and changing features whenever we wanted was night and day compared to the stress of our old weekly, six-hour deploy marathons. But we knew we could do more, so Fellowship-style, we pressed on.

We moved the application off Heroku and Engine Yard and over to the private cloud, then automated the deploy process via Rake and Capistrano. (This is nothing groundbreaking, which is a Good Thing: you should save the experimentation for features and small, new projects, not core infrastructure.) We also spun up a Hubot instance named Fashionator to serve as our deploy concierge, telling us about pull requests, notifying us of deploys, and running integration tests.

The result? It now takes one developer six minutes to deploy our storefront application, and we do it every day. And this wasn’t our only win: moving to a culture of continuous integration has resulted in more and better tests, better code quality and coverage, and improved code review processes (all of which I’ll cover in future posts).

TL;DR

TuskForce completed its work at the end of January of this year. In three months, we made the following changes to our storefront deploy process:

    • Went from deploying seven applications to just one;

    • Decreased our deploy time by 98%;

    • Increased our deploy velocity by 400%;

    • Added the ability to turn features on and off independently of daily deploys;

    • Made our business stakeholders, customers, and developers much, much happier.

We still have a ton of work to do: scaling, infrastructure development, configuration management, and continuous deployment (that is, multiple deploys per day) are just a few of the difficult problems we’re tackling in 2014. But with the huge cultural and technological successes we’ve seen in the last few months, I have no doubt we’ll find imaginative solutions for them. Rent the Runway’s future is very, very bright.

If you want to be part of it, come work with us.


Who Said Multiple Personalities Were Bad?

RTR & the Clown Car Technique for Adaptive Images