A/B Testing

Experimenting with Alchemy


A/B testing is a tool we heavily rely on to tell us whether new or improvements of existing user-facing features add business value.  It could be something as simple as the placement of a calendar on a page and as complex as an entire new Checkout workflow.  In both cases, it's worth knowing whether these changes will affect the business positively or negatively.

In order to configure these experiments and to randomize which user receives which treatment (e.g. original Checkout page vs new Checkout page), we've written our own in-house tool.  In fact, it's gone through three incarnations.


First Generation

The first approach was fairly straight-forward:

  1. A user logs into our website and a list of experiments is retrieved
  2. For each given experiment, if the user had not been randomly assigned a treatment yet, they would be assigned one, and this mapping would be stored in a database
  3. Any time they visited the site in the future, this mapping would be retrieved and would represent which treatment a user should receive

This first attempt has both pros and cons:

  • Pros
    • Able to control treatment assignments to users on a fine-grained level
    • If the ratio of users that should receive treatment A versus B changes in the future, it does not affect existing users
    • Cons
      • Does not scale well once you have millions of users and hundreds of experiments
      • Each time a unique user visits the website, it has to be loaded from the database and then cached
      • Doesn't handle the scenario where experiments should also apply to anonymous users who haven't logged in
        • Can't store a row for treatment assignment for each session id

Some arguments could be made that fine-grained control of treatment allocation isn't needed for A/B testing, since after all, it's supposed to be randomized testing.  With this argument in mind, the second incarnation was born.

Second Generation

With some new ideas of how the problem of A/B testing could be approached, the second version of our testing framework was created.  There were also additional requirements that the old system did not meet, such as supporting anonymous users.

The new approach was as follows:

  1. Experiments are configured by specifying what ratio of users should receive a given set of treatments
  2. These treatments are assigned to a series of 'bins'
  3. When a user or guest accesses the site, their information is hashed to a number, which is then assigned to a 'bin'
  4. Depending on what treatment was assigned to that 'bin', the user receives that treatment
    1. As an example, you could have the first 50 bins assigned to A, and the second 50 bins assigned to B, if the user lands in bin 75, they receive B
    2. A given user will always hash to the same 'bin' number, which is also randomized by a seed value configured on each experiment
    3. To ensure that the same user always receives the same treatment, the user's userId is hashed.  In the case of an anonymous user, the sessionId is hashed

For the most part, experiment service met most of the needs for configuring and running experiments.  There was a list of things left to be desired:

  • Being able to override what treatment a user is assigned, mainly for QA testing, which the old system allowed easily
  • Ease of configuration
    • There were a lot of quirks with how the experiments were configured and where they were stored
      • Experiments were stored using Redis, but a separate instance per host that was hosting experiment service
      • Each time an experiment had to be configured, it needed to be configured on N machines
      • When configuring treatment allocations, each bin had to be explicitly assigned a treatment, rather than just blocks of treatments
      • Whenever a new experiment needed to be added, an entry had to be hard-coded into a Java file
      • We wanted to open source experiments service for the community to use and improve

As a result of these items, Alchemy was written: an OpenSource A/B testing framework that makes configuration of experiments simple and that runs on a time-proven RESTful framework, Dropwizard.

Third Generation

So, how did we do with our list of things to be desired?

1. Being able to override what treatment a user is assigned

  • This is now supported in Alchemy through 'treatment overrides'
  • Uses a simple filter expressions to be able to specify predicates like "user_id=7389332" for matching who should receive what treatment override
  • Should be used sparingly, namely, for QA testing purposes, since each 'treatment override' expression has to be evaluated any time a user retrieves their treatments

2. Ease of configuration

  • Alchemy uses a simple allocation model
    • You no longer deal with bins, you deal with allocations of treatments
    • Treatments are allocated, deallocated or reallocated with given amounts
      • For example, to assign all 100 bins, one could allocate 50 to A and 50 to B, and that's it
      • Reallocation allows you to reassign a portion of users receiving one treatment to another treatment without caring which 'bins' they were actually assigned to
      • Easy to use REST interface
        • By leveraging Dropwizard, which uses Jersey and Jetty, it's easy to spin up a REST service for configuring experiments
        • Experiments are stored in a single place
          • The currently implementation supports MongoDB but can be easily extended to support other databases
          • Only one place where experiments are stored means only one endpoint to configure experiments on for an entire cluster of hosts
          • Database is read as little as needed -- all experiment configurations are cached
          • Caching is highly configurable

3. Open source