Rent the Runway Cycle Counting Redesign

At Rent the Runway, our business has been quickly growing ever since the first day we launched. For us, that means more and more physical items that we need to buy, clean, repair, store, and ship. We are having to scale not only our servers, but our actual physical space and processes as well! One of the most important jobs is to make sure that we know where our stuff is, and we employ several different ways to ensure everything is in the right place. Our technology is not just digital, but it also has far more important physical implications. A little over a year ago, we started the project of upgrading our cycle counting system to handle the hundreds of thousands of units of clothing we need to track. We had an old cycle counting tool that was showing its age and needed to be redesigned.

“Cycle counting,” also called cycle scanning, is the process of ensuring that everything is in the right place, by  comparing what is in inventory with what is expected to be in inventory. Our inventory is generally garments stored on hangers, but not exclusively: we also have for-sale extras like bras, accessories, etc. The process of counting all items is done by a special team in our warehouse and tends to take place while other teams are also putting away items returned or picking for new orders.

Any system depends first on the organization of the units. Initially, Rent the Runway organized its inventory alphabetically by designer. Anyone who has organized their books alphabetically by author will know what this means: When you add a new unit, everything after it needs to be pushed out to the right. No big deal when it starts with W or a Z but a bigger deal if it starts with C. And it’s not such a huge burden when you have 100 books, but it becomes a problem when you have a giant warehouse full of half a million garments.

When our warehouse managers saw this was becoming an issue, they moved to organizing by style, and new stock got added wherever there was room. However, then you need to keep track of where your inventory is located in the warehouse. When we started the design process we had 250k units, and knew that the number would quickly grow as we now have many, many more than that. And all of it needs to be tracked.

Design considerations

Our tracking is twofold. First, we need to know what general station a unit is in: with the customer; in cleaning; in quality control; or on the rack, ready to be shipped to a customer. Second, when a unit is on the rack, we need to know where it is in the racks, which we find out based on a mapping in the database containing styles and their locations. A style-location mapping tool was one of the first things I worked on at RtR. The warehouse team needed an interface to manage those mappings. Our automated picking and putaway system depends on them.

Another consideration when redesigning the tool was mobility. Many parts of our warehouse management tool were built with a desktop browser in mind, not mobile units. Laptops were as mobile as they would get. So when the team did a cycle count using the old tool, it would take two people, a mobile laptop station, and a barcode scanner attached to the laptop. One person would scan the units on the rail, and the other one would watch the results on the screen.

Also, for the upper rails, the person counting would have to go up and down a ladder because half the inventory is in the upper rail in the aisle!

Another really big issue was the timeouts that routinely occurred with the old tool. It worked by scanning all the units in a style and then submitting all of them at once to the server, which would then calculate missing units, misracked, etc. But there was a timeout on the page, and once you started counting you had a finite window within which you had to scan it all or it timed out - and all your scans to that point would be lost. One of the first decisions we made was to make each scan atomic, so its content would be saved instantly. That way, workers could scan along at a good pace but not have to worry about losing their work if they exceeded a timeout window.

We were constrained by the fact that the smallest collection of units with strongly delineated start and end is the "rail."A rail is just a fancy term for the long pipe that we hang our garments off of for storage. It contains sections. It looks sort of like this:

rail_drawing_fine.jpg

Unfortunately, the boundary between one section and the next is not strongly delineated, so units can and do easily move to the next section if a style gets really crowded. For example, units that are supposed to be in section “001” may slide over to “002.” For this reason, taking stock of what you've got in your present count at the end of a single section could lead to thinking you are missing units when they aren't really missing. You really need to assess at the end of a discrete space. Therefore, we decided to associate a cycle count with the entire rail and not to take stock until we scan all the sections in the rail. We were also assured that styles would not run from one rail to the next.

(Thanks to Rachael McKnight for assistance with this illustration.)

Finally, we designed with an eye toward minimizing the number of steps a count would take. The teams are counting hundreds of units per rail - per cycle count - so, for example, clicking "OK" could mean hundreds of excess actions. We tried to eliminate confirmations and pop ups wherever possible in the process. Of six message responses possible when scanning a unit, we were able to design all but two with only a courtesy message (e.g., “scanned barcode 12345678 to section 001”) that disappears when the next scan is read. The two that require intervention are meant to (a) draw attention to what we call "anomalies," mainly misracked items, and (b) ensure the scanner removes the unit from the rack and places it aside for reracking later. And even in this situation, if this is a unit that has "slid over" to the next section, we offer an option to add a style-location mapping for the current location, so any subsequent units that would have come up misracked as well will then just be read as in a correct location.

Anomalies include misracked, missing, and duplicate barcodes. True duplicates are extremely rare, but duplicate scans may occur if the scanner loses their place and scans the same unit twice. This is the other screen where we require a reaction, to let us know if it was a true duplicate (or just finding where they left off). Items may be misracked because the unit has "slid over" to the next section, or really goes somewhere else in the warehouse.

Process

A user will start a new cycle count by scanning the barcode for any section in a rail:

  The starting screen

The starting screen

The system will look for a cycle count already in progress. If none is found, it will initiate a new cycle count and ask the user to start scanning barcodes of garments.

For each garment barcode scanned, the system will ask itself if the garment is misracked or the barcode is a duplicate, and it will respond accordingly.

  A misracked unit

A misracked unit

  A duplicate scan

A duplicate scan

If the barcode is neither misracked nor a duplicate, the application will query the database to see if the unit is expected on rack, and if so, if it is a unit in an active state (not set aside for clearance or sale). If both these are true, the app will display a courtesy message saying the unit is expected at the location. The user can then scan the next unit’s barcode.

At the end of a section the scanner will click a button to end the current section and is prompted to scan the next section's identifier. They continue counting units until they reach the end of the rail. Then they click a button to close out the rail.

At this point the application will calculate missing items based on what inventory we expect to be in the sections scanned. Then the app will display a summary of misracked and missing units. As part of completing a cycle count the team must put away misracked units in their proper locations, and attempt to locate missing units in the rail just scanned.

Once they have found all the missing units they can and put away all the misracked units, the cycle count for the rail is marked complete.

  A report for a completed cycle count

A report for a completed cycle count

Conclusion

At Rent the Runway, the number of items we are managing is exponentially increasing, and we’re already well on the way to building an even more scalable way to ensure the right items are in the right place at the right time, to make sure we really can be our customers’ “Closet in the Cloud.”


From CRUD to CQRS with Dropwizard (Part 3)

Part 3: Eventually Consistent Denormalization

This is the third part of a multi-week series of posts describing various implementations of Command Query Responsibility Segregation (CQRS) and Event Sourcing using Dropwizard. Week 1 and 2 can be found here and here. As a quick refresher, CQRS is a design pattern where commands modifying data are separated from queries requesting it and the data is denormalized into structures matching the Data Transfer Objects (DTO) that these queries are requesting. If you want to learn more about CQRS, read Martin Fowler’s blog post on the subject.

Picking up from last week, here’s the CQRS application we built:

And here is this week’s application:

The major difference, as you can see, is that now we handle events in a similar manner to the way we were handling commands last week. Instead of directly processing the events created from the command, we are listening to the change log of our source of truth, writing events to a message bus, and then asynchronously reading the events out of the message bus, denormalizing the data contained in them, and writing it to the data store backing our DTO.

I’ve created another small Dropwizard application to demonstrate this pattern using Mongo and Kafka. You can find the code and instructions on how to run it in IntelliJ and send commands to it via Postman here.

The steps for a data update request (command) are:

  1. Http request received to change entity
  2. Request is validated
  3. Request is translated to command(s)
  4. Command(s) are written to message bus
  5. Response is sent to client
  6. Command(s) are pulled off of message bus
  7. Command(s) are handled
    1. Existing entity is retrieved and command(s) are validated
    2. Command(s) are applied to entity and delta is determined
    3. If there’s a delta:
      1. The entity is updated
      2. A service listening to the the change log of datastore registers the entity update and generates event(s)
      3. The event is written to a message bus
  8. If the Command results in Event(s):
    1. Events are handled off of message bus
    2. The event(s) are denormalized to relevant data sources

The steps for a data retrieval request (query) are the same as last week:

  1. Http request is made to retrieve entity
  2. Entity is retrieved via key lookup

There are now even more moving pieces and we may have changed the user experience. Now we can no longer assume that the data in our data stores is strongly consistent and we need to make sure to take this into account when displaying data to our users.

Comparing this approach to last week’s, this is the main additional drawbacks:

  • Denormalized data is only eventually consistent

But there are also additional benefits:

  • Commands are handled more quickly and written to the source of truth without having to wait for synchronous denormalization
  • Decoupling between the service handling commands for the source of truth and the services reacting to events that are emitted from the source of truth
  • Resiliency via the message bus to ensure each message from the denormalizer(s) is handled at least once.

Coming up in Part 4: CQRS we take a step back and start to discuss some of the topics and design patterns that are required to understand and implement systems like this.


From CRUD to CQRS with Dropwizard (Part 2)

Part 2: Asynchronous Command Handling

This is the second part of a multi-week series of posts describing various implementations of Command Query Responsibility Segregation (CQRS) and Event Sourcing using Dropwizard. Week 1 can be found here. As I mentioned last week, CQRS is a design pattern where commands modifying data are separated from queries requesting it and the data is denormalized into structures matching the Data Transfer Objects (DTO) that these queries are requesting. If you want to learn more about CQRS, read Martin Fowler’s blog post on the subject.

Picking up from last week, here’s the initial CQRS application we built:

This application was entirely synchronous. We didn’t send a response to the client until we had already denormalized and written the changes they requested to all data stores. In contrast, this week’s application looks like this:

CQRS Async Instantaneous.jpg

The major difference, as you can see, is that instead of directly processing the commands created from the request, we are writing these commands to a message bus (after validating the request) and then asynchronously reading the commands out of the message bus, handling them, and writing our data. Handling the command consists of retrieving the existing state of the entity, applying the command, and generating an event if there is a delta. This event should encapsulate the change in state of the entity and also include some mechanism for ensuring idempotency (we’ll get into this more in a future post). So for example if a command tells our service to create a product, the event generated upon successful execution of the command would be a ProductCreated event which would contain the data for that new product’s current state.

I’ve created another small Dropwizard application to demonstrate this pattern using Mongo and Kafka. You can find the code and instructions on how to run it in IntelliJ and send commands to it via Postman here.

The steps for a data update request (command) are:

  1. Http request received to change entity
  2. Request is validated
  3. Request is translated to command(s)
  4. Command(s) are written to message bus
  5. Response is sent to client
  6. Command(s) are pulled off of message bus
  7. Command(s) are handled
    1. Existing entity is retrieved and command(s) are validated
    2. Command(s) are applied to entity and delta is determined
    3. If there’s a delta:
      1. The entity is updated
      2. Event(s) are generated
  8. If the Command results in Event(s):
    1. The event(s) are denormalized to relevant data sources

The steps for a data retrieval request (query) are the same as last week:

  1. Http request is made to retrieve entity
  2. Entity is retrieved via key lookup

There are now certainly more moving pieces and we have changed the user experience. Instead of being able to respond to the client and tell them we’ve completed their request, we can only tell them that their request has been received and looks good to the best of our current knowledge.

Comparing this approach to last weeks, there are some additional drawbacks:

  • Client doesn’t know when command is handled
  • Additional technologies- message bus
  • Additional complexity- writing to and reading from the message bus

But there are also additional benefits:

  • Write requests are validated and response sent to the client more quickly
  • The handling of commands is decoupled from the handling of requests
    • If the format of the request changes tomorrow then the command handling service doesn’t need to changed
    • At some point in the future the client could directly write to the message bus or commands could be written from several sources and the command handler wouldn’t change
    • The location of either the request handler or the command handler could change at any time with no need for service discovery or reconfiguration
  • If our message bus can be partitioned (Kafka) then we can horizontally scale out our command handlers and ensure that all commands for a given partition key (ex SKU ID) are handled by the same instance of our command handler.
  • If our message bus persists messages (Kafka) then we can replay requests for debugging or disaster recovery by using another consumer or moving the offset back
    • Transient errors during command handling would automatically be retried with the correct Kafka settings in place.

Coming up in part 3: CQRS with eventually consistent data denormalization.


From CRUD to CQRS with Dropwizard (Part 1)

Part 1: Synchronicity Everywhere

This is the first part of a multi-week series of posts describing various implementations of Command Query Responsibility Segregation (CQRS) and Event Sourcing using Dropwizard. CQRS is a design pattern where commands modifying data are separated from queries requesting it and the data is denormalized into structures matching the Data Transfer Objects (DTO) that these queries are requesting. I’m not going to get deep into the details of CQRS here, if you want to learn more then I highly recommend Martin Fowler’s blog post on the subject. But here is a quick comparison between CRUD (Create, Read, Update, Delete) and CQRS. A typical CRUD application looks like this:

As you can see, there’s a User Interface which writes and request data from an API which in turn persists and retrieves it from a data store.

In contrast, here’s a basic CQRS application:

CQRS.jpg

The major difference, as you can see, is that we are separating the source of truth written to by the api from the projection which is read by the api. A denormalizer is used to keep the two in sync. In future weeks we’ll introduce asynchronicity, message buses like Kafka, and eventual consistency. But for our initial purposes we will assume that this denormalization will be done synchronously with the update of the source of truth and prior to the response being sent to the user interface.

I’ve created a small Dropwizard application to demonstrate this pattern using Mongo. You can find the code and instructions on how to run it in IntelliJ and send commands to it via Postman here.

The steps for a data update request (command) are:

  1. Http request is received by API to change entity
  2. Request is translated to command(s)
  3. Command(s) are handled
  4. Existing entity is retrieved and command(s) are validated
  5. Command(s) are applied to entity and delta is determined
  6. If there’s a delta:
    1. The entity is updated
    2. Event(s) are generated
  7. If the Command results in Event(s):
    1. The event(s) are denormalized to relevant data sources
  8. Response is sent to client

And the steps for a data retrieval request (query) are:

  1. Http request is made to retrieve entity
  2. Entity is retrieved via key lookup

As you can see, data changes require a few extra steps but data retrieval is extremely simple. This would still be the case regardless of how many sources of truth must be denormalized and combined to form the document we need for the query.

However there are still some drawbacks:

  • Duplicate data storage
  • Transactions across data stores need to be handled by application
  • Denormalization needs to be carefully managed to avoid inconsistent states
  • Writes are slower/more expensive since we are synchronously denormalizing
  • Reads can result in large payloads depending on domain design

However, in some cases these are outweighed by the benefits:

  • Reads are faster and can be optimized separately from writes
  • Since data is stored as key/value, lower level/cheaper data storage can be used
  • Fewer http calls on read since documents are integrated on writes
  • UX doesn’t need to change because consistency model is the same
  • We don’t need message buses (yet!)

Coming up in part 2: CQRS with asynchronous commands


Using Flow to Write More Confident React Apps

Writing your first React app is simple with tools such as Create React App, but as your application and your team grow, it becomes more difficult to write scalable, confident code efficiently. Think about the number of times you pass props around a React application and expect them to look a certain way or be of a certain type, or the times when you’ve passed the wrong type of argument into a function and rendered a page useless. Fortunately, there are many solutions for these preventable problems, and one of the tools we’ve found helpful here at Rent The Runway is Flow.

What Is Flow?

According to Flow’s official website, Flow is a static type checker for JavaScript. After you install Flow, start the Flow background process, and annotate your code, Flow checks for errors while you code and gently alerts you to them. There are even Flow packages available for many popular text editors, which can highlight potential problems and provide explanatory text right beside the code itself.

Many times, Flow can infer how you want your code to work, so you don’t always need to do extra work, but it helps to be certain by actually using the static type annotations. In short, Flow checks how data moves through your app, ensuring that you’re using the types of data that you expect and forcing you to think twice before you reassign a variable to a different type, or try to access a non-existent property of an object.

Basic Example

Here’s a basic example of how to add Flow type checking to a file.

This is a simple component to display and edit data for a customer’s account.

Telling Flow that it should check this file is as simple as including the `// @flow` flag at the top of the file. Almost immediately, Flow will begin checking the file, and you’ll notice red dots in the text editor (Sublime) warning about potential problems. When we hover over those lines, we see additional information about the error at the bottom of the text editor.

After tackling the highlighted lines one by one, we’re left with a nicely annotated file:

Here, we’re doing the bare minimum to silence the red dots, telling Flow that the `constructor` method accepts a `props` argument that should be an object. (Flow allows you to be more specific, which we’ll show in a bit, but this suffices for now). In addition, we’ve explicitly declared the properties and types of each property for the class, which is part of Flow’s Class type checking capabilities. Now, the shape of state and the types of its properties are guaranteed, and we only had to do a bit of annotation for Flow to get started.

To be more explicit, we can also declare the types of properties that an object when supplied as an argument to a function must have. Below, on line 15 of the CustomerAccountProfile component (right), we specify that the props must have a customer property.

In that component’s parent, CustomerTabAccount (left), on line 75, you can see that when we try to pass in ‘null’ as a prop, Flow gives us a warning that ‘This type is incompatible with (object type)’.

To go one step farther, you could even declare the types of all the properties within props for the component. Now, if you try to reference a property that isn’t declared with the props, Flow will let you know that the property is not found.

Although this is just a sample of the many type checks Flow is capable of, the Flow documentation goes pretty deep into all the possibilities.

Advantages

What’s great about Flow is that it can catch errors early and help development teams communicate better about how components work together within an application. And although React’s PropTypes serve a similar purpose and are arguably simpler to use, they only throw errors at runtime rather than while you’re coding. Plus, since they’re declared below the component, they seem more like an afterthought.

Beyond the basics, adding Flow to your codebase is fairly painless, since it’s opt-in. You tell Flow to analyze a file by including a `// @flow` flag at the top of a file. This will mark it as a Flow file, and the Flow background process will monitor all Flow files. Files without the flag will be ignored, so you get to control which files you check and you can integrate it into a new or existing codebase over time.

It also plays nicely with ES6 and Babel, which strips out all the annotations for production.

Disadvantages

On the other hand, that same opt-in policy leaves room for neglect. Less diligent programmers can opt to forgo Flow altogether. In addition, there is a learning curve for type annotations, especially for new developers who aren’t used to type checking their code. What I find to be a bigger challenge is not getting started, but using Flow well. Basic type checking is useful, but Flow offers the ability to write very use-case specific type checks such as “Maybe” types, which check optional values and “Class” types, which allow you to use a class as a type. It’s easy to begin using Flow, but difficult to master all of its capabilities.

We began adopting Flow slowly in one of our retail applications, and as we continue building new features, we continue to convert older components to use Flow. Overall, it has helped us identify bugs sooner rather than later and has enabled easier refactoring because we can be more confident about the arguments functions accept and how different function invocations change a component’s state.

For more information on Flow and getting started, consult the official Flow documentation.

Resources

Official Flow Documentation

React PropTypes Versus Flow