Background
A/B testing is a tool we heavily rely on to tell us whether new or improvements of existing user-facing features add business value. It could be something as simple as the placement of a calendar on a page and as complex as an entire new Checkout workflow. In both cases, it's worth knowing whether these changes will affect the business positively or negatively.
In order to configure these experiments and to randomize which user receives which treatment (e.g. original Checkout page vs new Checkout page), we've written our own in-house tool. In fact, it's gone through three incarnations.
First Generation
The first approach was fairly straight-forward:
A user logs into our website and a list of experiments is retrieved
For each given experiment, if the user had not been randomly assigned a treatment yet, they would be assigned one, and this mapping would be stored in a database
Any time they visited the site in the future, this mapping would be retrieved and would represent which treatment a user should receive
This first attempt has both pros and cons:
Pros
Able to control treatment assignments to users on a fine-grained level
If the ratio of users that should receive treatment A versus B changes in the future, it does not affect existing users
Cons
Does not scale well once you have millions of users and hundreds of experiments
Each time a unique user visits the website, it has to be loaded from the database and then cached
Doesn't handle the scenario where experiments should also apply to anonymous users who haven't logged in
Can't store a row for treatment assignment for each session id
Some arguments could be made that fine-grained control of treatment allocation isn't needed for A/B testing, since after all, it's supposed to be randomized testing. With this argument in mind, the second incarnation was born.
Second Generation
With some new ideas of how the problem of A/B testing could be approached, the second version of our testing framework was created. There were also additional requirements that the old system did not meet, such as supporting anonymous users.
The new approach was as follows:
Experiments are configured by specifying what ratio of users should receive a given set of treatments
These treatments are assigned to a series of 'bins'
When a user or guest accesses the site, their information is hashed to a number, which is then assigned to a 'bin'
Depending on what treatment was assigned to that 'bin', the user receives that treatment
As an example, you could have the first 50 bins assigned to A, and the second 50 bins assigned to B, if the user lands in bin 75, they receive B
A given user will always hash to the same 'bin' number, which is also randomized by a seed value configured on each experiment
To ensure that the same user always receives the same treatment, the user's userId is hashed. In the case of an anonymous user, the sessionId is hashed
For the most part, experiment service met most of the needs for configuring and running experiments. There was a list of things left to be desired:
Being able to override what treatment a user is assigned, mainly for QA testing, which the old system allowed easily
Ease of configuration
There were a lot of quirks with how the experiments were configured and where they were stored
Experiments were stored using Redis, but a separate instance per host that was hosting experiment service
Each time an experiment had to be configured, it needed to be configured on N machines
When configuring treatment allocations, each bin had to be explicitly assigned a treatment, rather than just blocks of treatments
Whenever a new experiment needed to be added, an entry had to be hard-coded into a Java file
We wanted to open source experiments service for the community to use and improve
As a result of these items, Alchemy was written: an OpenSource A/B testing framework that makes configuration of experiments simple and that runs on a time-proven RESTful framework, Dropwizard.
Third Generation
So, how did we do with our list of things to be desired?
1. Being able to override what treatment a user is assigned
This is now supported in Alchemy through 'treatment overrides'
Uses a simple filter expressions to be able to specify predicates like "user_id=7389332" for matching who should receive what treatment override
Should be used sparingly, namely, for QA testing purposes, since each 'treatment override' expression has to be evaluated any time a user retrieves their treatments
2. Ease of configuration
Alchemy uses a simple allocation model
You no longer deal with bins, you deal with allocations of treatments
Treatments are allocated, deallocated or reallocated with given amounts
For example, to assign all 100 bins, one could allocate 50 to A and 50 to B, and that's it
Reallocation allows you to reassign a portion of users receiving one treatment to another treatment without caring which 'bins' they were actually assigned to
Easy to use REST interface
By leveraging Dropwizard, which uses Jersey and Jetty, it's easy to spin up a REST service for configuring experiments
Experiments are stored in a single place
The currently implementation supports MongoDB but can be easily extended to support other databases
Only one place where experiments are stored means only one endpoint to configure experiments on for an entire cluster of hosts
Database is read as little as needed -- all experiment configurations are cached
Caching is highly configurable
3. Open source
Alchemy is now open source and available from https://github.com/RentTheRunway2/alchemy
Artifacts are available from Maven Central Repository