Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

10 Things to Know about A/B Testing

Jeff Sauro • November 27, 2012

 Whether you're new to A/B testing or a seasoned practitioner, here are 10 things you should know about this essential method for quantifying the user experience.
  1. A and B refer to Alternate Versions: A/B testing is often called split-testing as you split a sample of users where half use one version, arbitrarily called A, and the other half use the other version, arbitrarily called B.  You can split-test comparable designs, competing products or old applications versus new applications. When possible you should split and test simultaneously instead of sequentially (such as running treatment A for a week then B for the next week) as seasonal variation, holidays, weather and all sorts of other undesirable variables can impact your results.

  2. Anything Can Be Split-Tested:  A/B testing is associated with websites using products like Google Website Analyzer.  However, A/B testing can be conducted on desktop software or physical products. I think of A/B testing more generally and apply it along WITH, not instead of, many other user research methods. In fact, A/B testing is at the heart of the scientific method with a rich history of researchers randomly assigning different treatments to patients, animals and almost anything for hundreds of years.

    More recently, split testing gained traction in direct mail marketing when different versions of printed material were sent to subsets of long lists of physical addresses.  With physical marketing material you have the real costs of printing and shipping. By splitting the list you could see which image, tagline or envelope resulted in more sales or inquires.  With electronic campaigns there's little to no direct cost, just the opportunity cost of using the less optimal campaign.  As with direct mail, by splitting the media into smaller pieces (headlines, links, buttons, pictures, pages, form fields) and testing each, you can identify which elements increases the intended Key Process Indicators (KPI's)—signups, purchases, calls or just task completion.

  3. Understanding Chance: A fundamental principal when working with any subset or sample of a larger user population is the role of random chance.  Just because you observe 5% of users purchase using treatment A and 7% purchase with  treatment B doesn't necessarily mean that when all users are exposed to treatment B more, or at exactly 7% will purchase.  Statistical significance, now part of our lexicon, tells us whether the difference we observed is actually greater than what we'd expect from just chance fluctuations in the sample we selected.

  4. Determining Statistical Significance: To determine if two conversion rates (which are proportions expressed as percentages) are statistically different, use the A/B test calculator. It uses a statistical procedure called the N-1 Chi-Square test. It's a slight derivation from the more common Chi-Square test that is taught in introductory statistical classes but it has been shown to work well for large (>10,000) and small (<10) sample sizes.  For example, if 100 out of 5000 users click through on an email (2% conversion rate) and 70 out of 4800 click through on a different version (1.46% conversion rate), the probability of obtaining a difference this larger or larger, if there really was no difference, is 4% (p-value = .04).  That is to say, it's statistically significant—you just don't see differences this large very often from chance alone.

    Technical Note: When sample sizes get small (expected cell counts less than 1) the calculator uses the Fisher Exact Test, otherwise it uses the N-1 Chi-Square test, which is equivalent to the N-1 Two Proportion test that we teach in our courses.  Some calculators will use just the Chi-Square test or Z-test (often called the normal approximation to the binomial) which generally work fine as long as sample sizes are reasonably large (and expected cell counts are large, usually above 5). See Chapter 5 in Quantifying the User Experience for a more detailed account of the formula.

  5. Confidence and P-Value:  Many calculators, including ours, often convey statistical significance as confidence. This is usually done by subtracting the p-value from 1. For example, the p-value from the earlier example was 0.04 which gets expressed as 96% confidence.  While the p-value and confidence level are different things, in this context, little harm comes from thinking of them in the same way (just be prepared for the more technically minded to call you out on it). The p-value is what you get after a test is run and tells you the probability of obtaining a difference that large if there really was no difference, while the confidence level is what you set before the test and affects the confidence interval around the difference [see below].

  6. Use a Two-Sided P-value: Many calculators, including ours, provide one and two-tailed p-values, also expressed at confidence. When in doubt, use the 2-sided p-value. You should only use the 1-sided p-value when you have a very strong reason to suspect that one version is really superior to the other.  See Chapter 9 in Quantifying the User Experience for a more detailed discussion on the issue of 1 versus 2 tailed p-values.

  7. Sample Sizes: As with every statistical procedure, one of the most common questions is "What sample size do I need?"  For A/B testing, the "right" sample size comes largely down to how large of a difference you want to be able to detect, should one exist at all. The other factors are the level of confidence, power and variability (values closer to 50% have higher variability). However, we can usually hold power and confidence to typical levels of 90% and 80% respectively and pick a reasonable range for conversion rates, say around 5%. Then we can just vary the difference between A and B and see what sample size we'd need to be able to detect a difference as statistically significant.  The table below shows the sample size needed to detect differences as small as .1% (over half a million in each group) or as large as 50% (just 11 in each group).

    Sample Size

    DifferenceEach GroupTotalAB
    0.1%592,9051,185,8105%5.1%
    0.5%24604 49,2085%5.5%
    1.0%6428 12,856 5%6.0%
    5.0%3446885%10.0%
    10.0%1122245%15.0%
    20.0%40805%25.0%
    30.0%23 465%35.0%
    40.0%1530 5%45.0%
    50.0%1122 5%55.0%
    Table 1: Sample size needed to detect differences from .1% to 50%, assuming 90% confidence and 80% power and conversion rates hovering around 5%.

    One approach to sample size planning is to take the approximate "traffic" you expect on a website and split it so half receives treatment A and half receives treatment B. If you expect approximately 1000 pageviews a day, then you'd need to plan on testing for about 13 days. At that sample size, if there was a difference of 1 percentage point or larger (e.g. 5% vs 6%) then that difference would be statistically significant [see the row in Table 1 that starts with 1.0%].

    If you want to determine if your new application has at least a 20% higher completion rate than the older application, then you should plan on testing 80 people (40 in each group). 

  8. Stopping Early: There is some controversy about stopping A/B tests early, rather than waiting for the predetermined sample size. The crux of the argument against peeking and stopping is that you're inflating the chance of getting a false positive—saying there's a difference between A and B when one doesn't really exist. This is related to a problem called alpha inflation which we address in Chapter 9 of Quantifying the User Experience.  For example, if you plan on testing for 13 days to achieve the total sample size of about 13k but after four days you check your numbers and see a statistically significant difference between conversion rates of 2% (40 out of 2000) and 3% (60 out of 2000). Do you stop or keep going? 

    If you are publishing a paper you should probably wait for the full 13 days, especially if you have a grant and need to spend the funds and to get any picky reviewers off your back. If you want to make a decision on which is the better version you should almost certainly stop and go with treatment B. There are merits to the argument that multiple tests will inflate your significance level and lead you to more false positives, however, a false positive in this case means saying there's a difference when one doesn't exist.

    In other words, A and B might just be the same, so going with either one would be fine and it would be better to spend your efforts on another test!  In fact, given these results, it's highly improbably (less than a 3% chance) that A is BETTER than B. Even using very conservative adjustments to the p-value to account for alpha inflation (not that I recommend that), B will still be the better choice. In applied research, it's usually picking the best alternative, not publishing papers.

    You will essentially do no harm or better by going with B (in most cases) and by cutting your testing short you're also reducing the opportunity cost of delaying the better element into your design. Only when stakes are high, the costs of switching to B over A are high (say it involved a lot of technical implementation), the cost of additional sample is low, and the opportunity cost of not inserting the better treatment is low should you keep testing for the full number of days.

  9. One Variable at a Time is Simple but Limited : The simplicity of A/B testing is also its weakness. While you can vary things like headlines, pictures and button colors one at a time, you miss out on testing all combinations of these variables. Multivariate analysis (also referred to as Full and Partial Factorials) allow you to understand which combinations of variables tested simultaneously generate the highest conversion rate. This is not a reason to exclude A/B testing, but rather understand that while you are making improvements, you could be making MORE improvements with multivariate testing.

  10. Statistical Significance Does Not Mean Practical Significance: With a lot of the focus on chance, statistical significance, optimal sample sizes and alpha inflation, it's easy to get distracted and lose sight of the real reason for A/B testing: making real and noticeable improvements in interfaces. As you increase your sample size, the chance you will find differences in treatments as statistically significant increases. Table 1 shows that when you have over 10,000 users in each group, differences of less than a percentage point are statistically significant. For high transaction websites, a difference this small could translate into thousands or millions of dollars more.

    In many cases though, small differences may go unnoticed and have little effect. So just because there is a statistical difference between treatments doesn't mean it's important. One way to qualify the impact of the statistical significance is to use a confidence interval around the difference as is done in the Stats Usability Package.

    For example, with the observed difference of 1% between treatment A at 2% and treatment B at 3%, we can be 90% confident the difference, if exposed to the entire user population, would fall between 0.2% and 1.8%.  Depending on the context, even a 0.2% improvement might be meaningful to sales or leads. Or, at most a 1.8% improvement might not be worth the cost of implementing the new change at all. Context dictates what makes a statistical difference of practical importance, but the confidence interval provides the boundaries on the most plausible range of that difference.



About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More


UX Bootcamp: Aug 20th-22nd in Denver, CO
Best Practices for Remote Usability Testing
The Science of Great Site Navigation: Online Card Sorting + Tree Testing Live Webinar


You Might Also Be Interested In:

Related Topics

A/B Testing, Statistics
.

Posted Comments

There are 5 Comments

October 17, 2013 | Amandine D wrote:

Great article, I should definitely work points 4 and 7... For sure you can make more improvements with multivariate testing than with AB testing, but for newbies like me AB testing is a good start! Let me share an article on the 7 steps to successful testing that is complementary to yours and may interest your readers: http://www.atinsight.co.uk/actualite/test-procedure-7-steps-to-successful-testing/ 


September 4, 2013 | Dennis wrote:

I would use an opline A/B Test Calculator http://www.convert.com/tools/ab-split-multivariate-test-duration-visitor-calculator/ to get started 


December 11, 2012 | market of inexpensive bankofitem.com wrote:

whatever money my character results in as cata launches. I don't see a concern for gold personally and best bet your ultimate goal start farming in cata plainly need any gold 


December 6, 2012 | Jess Joseph wrote:

You say that "More recently, split testing gained traction in direct mail marketing." In fact a/b split testing has been going on for at least 40 years, and probably more in magazines and newspapers as well as in direct mail. There are very few concepts used in testing online that were not used in direct response marketing before the advent of the internet.  


December 3, 2012 | Thomas Maas wrote:

Thanks fo the article, Jeff.

Instead of saying "You should only use the 1-sided p-value when you have a very strong reason to suspect that one version is really superior to the other" and reading the parts of the book you refer to, I would tend to phrase it as:

"You should definitely use the 1-sided p-value when you are only interested in improvements". This would be true in most business contexts. Who cares if my alternative checkout form performs worse than the current checkout form. If it's not better, it means we have work to do. 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 1 + 3: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4078 Subscribers]

Connect With Us

UX Bootcamp

Denver CO, Aug 20-22nd 2014

3 Days of Hands-On Training on User Experience Methods, Metrics and Analysis.Learn More

Our Supporters

Loop11 Online Usabilty Testing

Use Card Sorting to improve your IA

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Usertesting.com

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.