A/B testing and the infinite monkey theorem.

A manager's guide to online experimentation

A/B testing and the infinite monkey theorem I know, I know - a chimpanzee is an ape not a monkey.

If you don’t know what the infinite monkey theorem is, here’s the definition:

“a monkey hitting keys at random for an infinite amount of time will almost surely type the complete works of William Shakespeare.”

What does it have to do with A/B testing, you may ask.

A/B testing helps find out which of two versions of a website performs better while both are running simultaneously. Half of visitors see version A (the control group) and the other half see version B (the test group). The test results should give a straightforward answers whether we should kill the test version, push it to 100% of visitors or at least lead us to another action.

If you would give a monkey the access to an A/B testing platform it could make random changes and, test after test after test, the website would grow because only the positive changes would be preserved. Hence

a monkey A/B testing at random for an infinite amount of time will almost surely reach the conversion rate of Amazon.

But I’m not a monkey, why do I need A/B testing?

We A/B test because every day is different, unlike in the “Groundhog Day” movie. We don’t have the luxury of waking up after each failure in the same place and the same time and tweaking our actions over and over till we get satisfactory results.

The world around us is changing and our everyday actions, good or bad, will not change bigger trends overnight. If your business is shrinking, a positive action will make your business shrink less, but your business will get smaller anyway. If your business is growing, it will probably be bigger next week even after a bad decision or two.

This means that unless an action is A/B tested we won’t know its impact.

It was nicely explained by AirBnB in one of their blog posts:

“Experiments provide a clean and simple way to make causal inference. It’s often surprisingly hard to tell the impact of something you do by simply doing it and seeing what happens, as illustrated in Figure 1.

Figure 1. - It’s hard to tell the effect of this product launch.

The outside world often has a much larger effect on metrics than product changes do. Users can behave very differently depending on the day of week, the time of year, the weather (especially in the case of a travel company like Airbnb), or whether they learned about the website through an online ad or found the site organically. Controlled experiments isolate the impact of the product change while controlling for the aforementioned external factors.”

Now make a guess. What is the industry average success rate of A/B testing?

It’s 14%. This means that just 1 out of 7 A/B tests yields successful results. We might think we’re all smart, but this benchmark means we’re not much better at website optimisation than monkeys at typing the next Romeo and Juliet.

You may find different benchmarks on the internet - 1 out of 8, 1 out of 10 and similar. Some companies have much higher hit rates. Our recent case study with Play (Iliad Group) reported a 43% success rate over a full year.

How to be the greatest monkey in the biz if infinity is not an option?
1. Be a quick monkey.

How to be the greatest monkey in the business if infinity is not an option? Yes, a gorilla is not a monkey either.

Let’s do the math. If it takes 2 weeks per test and just 1 test out of 7 is successful then... A/B testing is slow. Unless you experiment at scale.

If you’d look at some of the biggest e-commerce companies many of them are running tens of A/B tests per month. Here are some numbers reported by them at different stages of growth: Zalando - 80+ experiments, Etsy - 100+, Shop Direct - 100+, or Amazon - 160+. Of course, experimentation isn’t just for online shops: Netflix - 80+, or Wix - 1000+ (!)

According to one report “Organizations running 1-20 experiments per month are, on average, driving 1-4% increases in revenue” and those “running 21 or more [...] are most likely to drive over 14% increase in revenue”. You could interpret this data in different ways - maybe testing velocity translates directly into revenue, perhaps companies that see higher ROI are scaling experimentation more quickly. Maybe they just get better over time.

The currency in which you pay for A/B tests is traffic. The more traffic you have, the more tests you can run at the same time. So if you want to be fast, then never waste the traffic you have.

Of course, most websites don’t have enough traffic to run even 10 tests per month, which makes efficient planning even more important. Unfortunately, it’s surprisingly difficult to do that regardless of your size.

Here are the ground rules that will help you scale your experimentation:

Ground rules:
1. Test ideas are subject to prioritisation not approval.

Too much decision-making How not to scale experimentation.

Stop arguing whether an individual test idea is good or bad. The beauty of A/B testing lies in the fact that you will find out if it was good or bad anyway and the damage in the latter case will be limited. If you and your team will have to get approval each and every time you will never get far.

Instead, keep a list of test ideas and prioritize them. The magic formula is simple:

x opportunity size
x strategy
= priority

Evidence is anything that makes you sure that the idea is not just wishful thinking. The evidence can be based on data from analytics, surveys, market reports or your own research of the problem. On a scale from 0 to 100% your boss's “I tell you so” counts for 0% and reproducible bugs count for 100%.

Ideally, try to calculate the evidence in an objective way, for example, ask yourself if an issue you are trying to solve was observed in different sources of half-truths:

analytics surveys usability testing QA testing market trends experience evidence

Give -1 for each NO, +1 for each YES and 0 for UNKNOWN. Of course, the list of sources should reflect your process. If you never do usability testing then there is no point in using it for prioritisation.

A survey example An example of a survey collecting feedback on test related evidence.

You can make the evidence score more nuanced by using a scale from -2 to 2 where -2 or -1 mean that the source is negating the issue, 1 or 2 are confirming it and 0 means no observation at all:

analytics surveys usability testing QA testing market trends experience evidence
-1 2 0 0 0 2 25% (3/12)

After tens of A/B tests you will realise which of those sources are trustworthy and which need some extra work.

Opportunity size is the potential gain minus related costs. If you know that you lose 1 million over a year due to an issue in the checkout and it would take you a 5000 to fix it then the opportunity size is 995 000.

Calculating the opportunity size is difficult and we tend to overestimate it. Try to be realistic - if you think that choosing a delivery option in checkout needs fixing then the opportunity size can’t be bigger than the total value of carts abandoned after reaching that step. If you want to add another sorting option on your listing, check how many people change the default sorting.

Sometimes, there is no past data that can help you calculate the opportunity size, but you can run A/B tests specifically to test demand and make an estimate.

Strategy is your company’s focus that doesn’t shift daily. Some people say that A/B testing and experimentation reduces the role of HiPPO’s (Highest-Paid Person’s Opinion), enforces meritocracy and democratizes decision making. It’s all nice, but your boss and your boss's bosses will always have a major say in what everyone is doing. The role of strategy in the formula is to make sure you work on what they want and in return they don’t change their minds too often.

For example::

We are 80% sure that investing 5000 in doing X will increase revenue by a maximum of 1 million over a year and this is 50% aligned with our strategy of improving customer retention.

Priority = 80% * 1 000 000 * 50% = 400 000

It might be tricky at the beginning to estimate all the parameters, but using this formula you can balance tiny, well researched changes with big bets that can change your fortune.

If you don’t like the formula proposed above you can write your own or check some other frameworks including: PIE (Potential Importance Ease), PXL, or ICE (Impact Confidence Ease). For some reason, they all avoid using financial (or any other) assumptions that could actually be verified in the test. Without them, the feedback loop is limited and you will find it hard to improve your estimates over time.

Oh, and remember, the worst idea on the list should still get tested if resources (and traffic) are available. If you don’t want to test it, get new ideas, but use up all your resources and traffic to keep high velocity.

Obviously, you can’t run too many concurrent tests in the same part of your website since at some point they will either start interfering with each other or, if you isolate each test, your sample size will shrink too much to get statistically significant results.

Since you don’t want to waste any traffic, you should work concurrently on tests in different parts of your website. This means that in practice your prioritisation needs to factor that in.

Ground rules:
2. Accept the fact that things will go wrong.

How to be the greatest monkey in the business if infinity is not an option? Self-scaling experimentation.

One of the goals of A/B testing is to de-risk your choices. If you deploy all your changes as tests you will know exactly which change is causing you problems and you can quickly turn it off if you need to. That's a luxury you don’t have if you release all your changes at once every Monday.

Once you adopt A/B testing you should relax, and let the team work on the backlog. Mistakes will surely happen, but A/B testing is a smart way to make stupid things.

One more thing. If you have a winner, and you are using an external A/B testing tool, don’t take the test down and wait for your team to implement it directly in the website’s code. Instead, push it to 100% and take it down only once the feature was recreated internally. Some tools, including UseItBetter, allow you to implement unlimited changes and are very efficient at serving them.

How to be the greatest monkey in the biz if infinity is not an option?
2. Cheat.

I mentioned before that only 1 one out of 7 tests is a winner so what about the other 6? Well, 5 of them will be inconclusive. They are pure evil because you don’t learn anything from them.

Here are some frequent reason why tests are inconclusive:

  • a) too few users were using the changed feature to get statistical significance,
  • b) the changed feature had little to do with the metrics used to evaluate the test,
  • c) there were multiple changes in the same test and they levelled up,
  • d) the test hypothesis was incorrectly formulated.

In order to run a successful experimentation programme you have to accept that A/B testing is not about making more money. You A/B test to learn what doesn’t work, what works and how well. This means you can successfully run tests... that have no chance of success.

Here are a few cheats that will make a world of difference for you.

Experiment to learn the opportunity size.

There are plenty of blog posts telling you that website loading times matter and citing case studies that reducing loading speed by 1 second increased conversion by X and brought billions of dollars. Then ask your developers how much time they would need to invest to reduce the loading time by one second. Then ask how much time they would need to slow down the loading time by one second. You’ll find out that slowing down a website is much easier than making it faster - it’s basically a single line of code.

So here’s some advice. Run an A/B test that slows down your website. If slowing down your website does not yield significantly worse results, don’t waste time on speeding it up. If it does harm your revenue, you have quite a precise estimate of the opportunity size for your prioritisation formula.  

One change per test. Order matters.

Two ways to run a test Lots of work or little work?

Let’s say that you would like to add videos to all the products on your website. You have hundreds of categories in your store so you decide to test it first on shoes, to see if the idea is worth your time. Now you only need to produce videos, upload them, make sure you can handle increased bandwidth, then add links on your website. After three months you’re ready to launch the test. Two weeks later you get the results and they are… inconclusive.

Now, imagine a different approach. Put a link to a video on every product category on your website, but don’t waste time on producing the videos. Instead, add a message - ‘sorry, videos are coming soon!’ or even trigger a survey asking how useful would videos be?

You will do it one afternoon and two weeks later you will get the results - X people clicked links to watch videos of shoes and Y% of them converted anyway, despite the fact that they haven’t actually seen the videos. The % of people who didn’t convert is your opportunity size.

If you design a test this way, you may find out that a lot more users are trying to see videos of washing machines than shoes.

Again, the test didn’t increase your revenue, but you’ve learned the opportunity size and either saved yourself unnecessary troubles, or confirmed that testing the videos actually makes sense.

Use the hypothesis you can actually validate

Instead of asking yourself if adding videos increase conversions you can form a hypothesis that is much easier to validate. “There are enough people who want to see videos of shoes/dresses/washing machines for us to produce videos”. If you add links and people don’t click them, consider another iteration of the test and change the link placement - maybe users haven’t noticed them before. If you can’t get people to click the link to watch videos, the chances are slim that producing the videos will make them convert.

You may ask if your users will be happy that you test dummy features on them. Put yourself in their shoes - would you prefer your favourite company to waste time on features that don’t matter to you or move on quickly to fixing things that matter?

Remember, the reward you get from A/B testing is knowledge not revenue. The revenue will come as a result of applied knowledge.

How to be the greatest monkey in the biz if infinity is not an option?
3. Don’t be a monkey.

Mr. William Shakespeare Again, not a monkey. It's Mr William Shakespeare.

If 1 test out of 7 wins, another 5 are inconclusive then what about the 1 test that fails?

According to this survey 3 out of 4 companies (that are A/B testing) make changes based on intuition or best practices. That’s not much evidence, is it?

Let’s forget about monkeys for a moment and talk about gnomes. The South Park gnomes. They famously decided to collect underpants to make profit. They too, probably, followed intuition and best practices:

How to be the greatest monkey in the business if infinity is not an option? Monkeys have tails. These creatures don't.

You launch an A/B test, the result shows no sign of profit, and the idea gets killed. But one failed test doesn’t make collecting underpants a bad idea. Perhaps it was the execution that was flawed.

Smarter A/B testing flow A/B testing workflow that doesn't work too well.

Running a test that returns a negative result, as opposed to an inconclusive one, is actually good news. At least you know you stroke a chord with your audience. You broke something that was important enough to drive your revenue down. Some people celebrate failures and post motivational quotes on Linkedin encouraging people to fail more. But there’s no value in failures if you don’t learn from them, and you don’t learn if you don’t prepare for failure.

Some years ago, a friend who worked at Spotify shared how their A/B testing flow looks like:

Smarter A/B testing flow Example of A/B testing workflow at Spotify caurtesy of @bendressler

As you can see from this example, Spotify used both surveys and analytics, declarative and behavioural data, to explain why the results were negative. They even went an extra mile by contacting some of those users to make sure that they nailed the problem.

Post-test analysis is usually much easier than pre-test research since you can compare data from the test and control group and focus on differences. Survey answer X, error Y, behaviour Z are more frequent in the test group and you dig in to find out why. Still, the analyses often take more time than the test implementation. Therefore, teams - which usually have fewer analysts than developers/designers - tend to ignore that step. Don’t do this. The real price you pay for not researching why tests fail is the death of great ideas (like collecting underpants).

Smarter A/B testing flow Integrated A/B testing workflow at UseItBetter

Here’s how we believe the testing workflow should look like and we’ve designed UseItBetter to support it by integrating an A/B testing engine with highly detailed analytics and surveys. Having everything in one place, sharing segmentation rules and data, makes the job much faster. You really have no excuse not to find out what happened in the test, regardless of the results.

Researching successful tests is as important as researching failed tests. If something worked well, did it work because our hypothesis was correct or because of a side-effect? Also, iterating successful tests is the easiest way to grow. Ask an angler - you never leave a spot where you just caught a fish.

How to be the greatest monkey in the biz if infinity is not an option?
4. Don’t be a theorist either.

An analyst researching for an infinite amount of time will almost surely get your A/B testing to 100% hit ratio.

You don’t want this either. Gathering enough evidence to be sure that a change will work takes ages. It may take you 1h to be 50% sure, 4h to be 90% sure, a week to get to 99% but you might never be 100% sure. And you shouldn’t need to be. 

A/B testing encourages you to move quickly, because every bad decision you make can be relatively easily spotted and reversed hours, days or at worst weeks after a release.

Let’s summarise. If you are going to A/B test:

  • 1. Never waste your traffic.
  • 2. Many small changes are better than one big change.
  • 3. Even the smallest change needs an insight.
  • 4. Prepare for failure.
  • 5. Failure is pointless if you don't know why you failed.
  • 6. Iterate.
  • 7. Be honest.

How to be an honest monkey?

Some obvious truths - don’t argue with data, follow up on results etc. But you should also look beyond individual tests and measure the efficiency of your experimentation programme. This is really a subject for another blog post, but I think there are some key metrics you should look at, and you may find them surprising.

What percentage of changes to your website are tested? This is a more meaningful metric than the number of tests run per month. Some (most) teams tend to A/B test easy things and deploy difficult changes without testing. What’s the point of testing colours of buttons ten times per month if you change the payment provider without testing?

What percentage of tests led you to an action? If a variant is a winner, was it deployed to 100%? Was a loser taken down? If not, was that result useful in any other way?

What percentage of tests were inconclusive? If you have few conclusive results then you are probably not working on important stuff. Of course, sometimes a statistically insignificant difference is a good thing - if you are forced to make a change ten all you need is to make it perform as well as the previous version.

Finally, how much money have you saved due to testing? A common practice is to focus on uplift from winning variants, but the undisputable benefit of A/B testing is avoiding losses. If you’d deploy a winner without testing it would still bring you the revenue, even more actually because it would be deployed to 100% percent of users right away. However, if you’d deploy a loser without testing you probably wouldn’t be able to revert your decision. 

Anything else?

I intentionally haven’t mentioned any statistical pitfalls, p-value hacking, or Sample Ratio Mismatch (SRM) because they are critical technicalities but technicalities nonetheless. If you want to do A/B testing you have to get them right.

About the author:

Łukasz Twardowski is the founder of UseItBetter, a platform for data-driven web development.

Łukasz combines engineering, design and data skills necessary to build great online experiences. His work was recognised by prestigious competitions (Cannes Lions, SXSW, The FWA) and featured in publications around the world (Taschen, Web Design Mag).

He delivered analytics and optimisation tools and services to the largest online brands in e-commerce (Shop Direct), telecom (T-Mobile, Iliad), financial (AXA), entertainment (Spotify), and gaming (Kabam) industries which gave him a chance to learn from the best.

Łukasz is a former chess player, who ranked 5th in the youth World Championship. You may try to beat him on Lichess.org.