Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Why you only need to test with five users (explained)

Jeff Sauro • March 8, 2010

One question I get a lot is, Do you really only need to test with 5 users? There are a lot of strong opinions about the magic number 5 in usability testing and much has been written about it (e.g. see Lewis 2006PDF). As you can imagine there isn't a fixed number of users that will always be the right number (us quantitative folks love to say that) but testing with five users may be all you need for discovering problems in an interface, given some conditions.

The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%. Most people either leave off the last part or are not sure what it means.  This does not apply to all testing situations such as comparing two products or when trying to get a precise measure of task times or completion rates but to discovering problems with an interface. Where does 31% come from?  It was found as an average problem frequency from several studies (more on this below).

5 users only applies to discovering problems, not comparing interfaces or estimating a task time or completion rate.
For example, the calendar on Hertz.com has a problem with the dates. Let's imagine that this will adversely affect 31% of reservations—which is quite a lot. So the question becomes, if a problem occurs this frequently (affects this many users) how many users do you need to observe to have an 85% chance of seeing it during a usability test?  You might be tempted to think you only need 3 to see it once, but chance fluctuations don't quite work that way at small sample sizes. You actually would need 5, and this comes from the binomial probability.

The formulas actually work quite well, but math tends to bring back bad memories for many people so I've provided some simulations below to show you how it works.

Tossing a Coin (50% Probability)

We all know that there is a 50% chance of getting a tails and 50% chance of getting heads when flipping a coin. If you wanted to know how many times you should plan on flipping a coin to see tails at least once, using the binomial formula the answer is 3. You can see this for yourself; click the flip 1 coin until you see tails. 85% of the time you'll need to click it no more than three times. You can repeat this exercise and see the number of sample sizes which take more than 3. Over time, a bit more than 85% of all your samples will be 3 or less.

Q: How many times do you need to toss a coin to be 85% sure you'll see tails at least once?
A: 3 or fewer
On every button click there is a 50% chance you'll see tails.

Sample Sizes :

Rolling a Die (16.7% Probability)

There is a 1/6 chance of getting any number from a 6-sided die. So on any toss there is a 16.667% chance of getting a 1. The binomial formula predicts that you'd need to toss a die on average 10 times to be 85% sure you'll see a 1.

Q: How many times do you need to toss a die to be 85% sure you'll see a one at least once?
A: 10 or fewer
On every button click there is a 16.667% chance you'll see a one.

Sample Sizes :


UI Problems

Now I have three UI problems which occur 31%, 10% and 1% of the time. Every time you click "Test 1 User" it's like testing a user (but without the expenses!).

Detecting Problems that affect 31% of users

Q:How many users do you have to test to be 85% sure you'll see a problem that affects 31% of users at least once?
A: 5 or fewer
On every button click there is a 31% chance you'll see a UI Problem.

Sample Sizes :

Detecting Problems that affect 10% of users

Q: How many users do you have to test to be 85% sure you'll see a problem that affects 10% of users at least once?
A: 18 or fewer
On every button click there is a 10% chance you'll see a UI Problem.

Sample Sizes :

Detecting Problems that affect 1% of users

Q: How many users do you have to test to be 85% sure you'll see a problem that affects 1% of users at least once?
A: 189 or fewer
On every button click there is a 1% chance you'll see a UI Problem.

Sample Sizes :


How many users do I need, then?

So if you plan on testing with five users, know that you're not likely to see most problems, you just likely to see most problems that affect 31%-100% for this population and set of tasks.
Of course not all problems affect 31% of users. In fact, in released software or websites, the likelihood of encountering a problem might be closer to 10% or 5%.  When problems are much less likely to be "detected" by users interacting with the software, you need more users to test to have a decent chance of observing them in a usability test. For example, given a problem only affecting 10% of users, you need to plan on testing 18 to have an 85% chance of seeing it once.  I've graphed the differences in the figure below. The blue line shows problems affecting 10% of users and the red affects 31% of users.

 
 Figure 1:  Difference in sample sizes needed to have an 85% chance of detecting a problem that affects 10% of users vs.  31% of users. You would need to plan on testing 18 users to have an 85% chance of detecting problems that affect 10% of users.

Why the controversy?

There's no concern about the binomial formula (or Poisson equivalent), the controversy is around how frequently UI problems really occur. In reality they aren't a fixed percent like 31% or 10%, instead these represent an average of problem frequency.
Problems don't uniformly affect users. 31% is an average frequency from many studies, for already released applications the frequency is probably less than 10%.
Problems in fact do not uniformly affect users, or affect users in an easily predictable way. While it is difficult to know how frequently problems occur, as a general rule, for early designs it will be higher (31% or more) and for applications that are in use with many users it will likely be below 10%. Of course you don't know what the probability a user will encounter a problem. In fact, you often don't even know if there is a problem—if you did you'd fix it!

As a strategy, pick some percentage of problem occurrence, say 20%, and likelihood of discovery, say 85%, which would mean you'd need to plan on testing 9 users. After testing 9 users, you'd know you've seen most of the problems that affect 20% or more of the users.  If you need to be surer of the findings, then increase the likelihood of discovery, for example, to 95%. Doing so would increase your required sample size to 13.

The best strategy is to bring in some set of users, find the problems they have, fix those problems, then bring in another set of users as part of an iterative design and test strategy. In the end, although you're never testing more than 5 users at a time, in total you might test 15 or 20 users. In fact, this is what Nielsen recommends in his article, not just testing 5 users in total.

So if you plan on testing with five users, know that you're not likely to see most problems, you are just likely to see most problems that affect 31%-100% of users for this population and set of tasks. You will also pick up some of the problems that affect less than 31% of users-- just not 85% of them.  For example, a sample size of 5 should pick up about 50% of the problems with likelihoods of occurrence of 15%, 75% of the problems with likelihoods of 25%, and so on. Change the tasks or type of users and you'll need a new sample of users.


About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More


UX Bootcamp: Aug 20th-22nd in Denver, CO
Best Practices for Remote Usability Testing
The Science of Great Site Navigation: Online Card Sorting + Tree Testing Live Webinar


You Might Also Be Interested In:

Related Topics

Sample Size, Problem Discovery
.

Posted Comments

There are 52 Comments

September 12, 2011 | Kostiantyn Sokolinskyi wrote:

nie article. thanks for the explanation 


November 8, 2010 | jasmyne scudder wrote:

it's ok 


November 8, 2010 | jasmyne scudder wrote:

it's ok 


April 27, 2010 | Rob Todd wrote:

Excellent explaination! 


April 21, 2010 | John Sorflaten wrote:

Jeff, hi
I still wonder how Faulkner\'s analysis fits this article. Doesn\'t finding \"85% of the problems\" mean ON AVERAGE...over lots of testing events with 5 participants? Your web site and blogs are great. You must be getting a good following.
I\'m working at www.saic.com as of Feb, this year. I\'m with a large gov\'t client for a couple year project in Maryland, near Baltimore.
best
John 


April 14, 2010 | Jennifer Romano wrote:

Wow. Tough crowd. Keep up the good work, Jeff. I sure appreciate it. 


April 9, 2010 | Jim Hall wrote:

This was exactly what I needed to help explain why my team only tests with an average of 10 users. 


April 2, 2010 | John Haugeland wrote:

Unfortunately, despite that these lessons are clearly wrong, the author has chosen to leave them up anyway. Note that the author has hundreds of users on his blog, yet one user found more than a dozen bugs on their blog alone. (Indeed, several of the located bugs, including one of the ones in that video, several weeks later remain unresolved.)

What this author is measuring is the probability of finding one bug, not all bugs, with five users. If you\'re willing to accept finding one bug, by all means, follow this extremely poor and non-measured advice. 


March 23, 2010 | jetm wrote:

Nice Articles 


March 22, 2010 | Thanassis wrote:

Jeff, if you don't use/know git, the link to the code is at github dot com slash ttsiodras slash binomialProbabilities 


March 22, 2010 | Thanassis wrote:

The code proving the error in "tossing a die" was messed up in the comments. I placed a copy in: git@github.com:ttsiodras/binomialProbabilities.git
The code has both the "theory" and the "experiment" (using Python's random module) to prove it. 


March 22, 2010 | Thanassis wrote:

Tossing a coin: \"3 or fewer\" => \"3 or more\".
Tossing a die: \"10 or fewer\" => \"11 or more\".

The \"tossing a die\" is wrong, it needs 11, not 10:

# Python code
p = 1./6.
for n in xrange(2, 15):
print \"The theory says...\", sum([choose(n,i)*(p**i)*((1-p)**(n-i)) for i in xrange(1,n+1)])

def choose(n,k):
return fact(n)/(fact(k)*fact(n-k))

def fact(n):
return 1 if n<=1 else n*fact(n-1) 


March 19, 2010 | David Travis wrote:

This is a great article to give people an intuitive understanding of why you can test with 5 users. Thanks for writing it!

But this got me thinking… In most usability tests, participants carry out more than one task. And with multiple tasks, people are more likely to run into the problem (for example, let's say it's a problem with search, and every task involves reviewing the search results). Although the participant might not experience the problem on the first task, they may do if they carry out 6 tasks.

Doesn't the maths you outline above assume a one-shot scenario, rather than the multiple task scenario? 


March 18, 2010 | Jeff Sauro wrote:

Jason,

Thanks for your comment I'm not familiar with the Pascal Model. The derivation from the binomial, Geometric and Poisson all generate the same results, so I suspect this might be another way to arrive at the same answer. 


March 18, 2010 | Jeff Sauro wrote:

To the anonymous post below, perhaps you could elaborate a bit on what you disagree with. 


March 18, 2010 | Take a probability course, take a stats course wrote:

You could always try to learn stats and probability before you tried to use concrete examples to convince those in your situation otherwise.

P-values are not used that way either. 


March 18, 2010 | Jason wrote:

As a stats minor you should be using the Pascal Model for the first two examples, not the Binomial... 


March 17, 2010 | John Haugeland wrote:

Hm, looks like you're fixing bugs on the fly, without scrubbing the bad data out of the database. Two of my old comments have quoting bugs presenting, though new ones don't, and you've still left an unknown number of fake ratings 10 in the database. 


March 17, 2010 | John Haugeland wrote:

Awesome, if I post a comment with a URL in it, the comment disappears without notice.

Third attempt at posting:

----------

Ok, so here are the two videos. The first shows the quoting bug, and the second shows the posting bug hitting twice in a row.

The encoder got seriously slow for 20 seconds at the beginning of the second video; that clears up. Sorry about that; I'm on a laptop.

http colon slash slash sc.tri-bit.com/outgoing/BrokenForm.wmv

http colon slash slash sc.tri-bit.com/outgoing/BrokenForm2.wmv 


March 17, 2010 | John Haugeland wrote:

Awesome, twice in a row this time. (Three really, but one wasn't on tape.) Note in the video that the name is correct, the email address is valid, and the math is right. 


March 17, 2010 | John Haugeland wrote:

Well, maybe I'll figure it out if I see source. Otherwise, meh, I'll just look stupid. 


March 17, 2010 | John Haugeland wrote:

Maybe it's about having a colon in place? Trying: I'll stop after. 


March 17, 2010 | John Haugeland wrote:

One last try, then I'll just look dumb. 


March 17, 2010 | John Haugeland wrote:

It only breaks sometimes. ' " ' " 


March 17, 2010 | John Haugeland wrote:

Testing the broken form on video, so that Jeff can see the bug. Please note that the name and email address will be the same each time, the number will be correct, and that apostrophes ( ' ) will come back incorrectly quoted. I haven't tried double quotes ( " ) yet. 


March 17, 2010 | John Haugeland wrote:

Arthur: good catch. Bet there's a lot of other stuff like that too. 


March 17, 2010 | Jeff Sauro wrote:

Arthur, yes, you're right, it's a temporary fix. Thanks for finding it though! 


March 17, 2010 | Arthur wrote:

Jeff Sauro wrote:

> I've since fixed the page width issue

Okay, Jeff, you didn't fix *that*, either. 


March 17, 2010 | Arthur wrote:

Okay, now I just have to test the "really wide comment" thing for myself. :)
 


March 17, 2010 | John Haugeland wrote:

Also, it looks like if you make a rating then post a comment, the blog is trying to re-make the rating also; it continues to protest that my rating is already in place when I post a comment.

If he's found 85% of defects with his however many more than five users, and in less than 10 minutes of site usage I've found nine defects, that suggests that if he's worked on this codebase for just a few weeks, there are tens of thousands of defects already solved.

The mystery math just doesn't hold up. 


March 17, 2010 | John Haugeland wrote:

Unfortunately, it also seems that Jeff hasn't cleared the vote-10s out of his database, meaning that the score is wildly distorted; I know because I just tried to vote 0 again, and it still thinks I already voted 10. 


March 17, 2010 | John Haugeland wrote:

Earl Franklin: it's frustrating when someone just recites things they've heard from false sources without actually checking the work. I found four more bugs in this site already, one a security defect, and I'm requesting access to the code so that I can show Jeff how well this five user principle actually works.

The fact of the matter is simple: Jeff has a lot more than five users, and there are a whole *bunch* of bugs about to be discovered, if he plays ball.

Also, the captcha adder appears to fail one time in three, give or take, in current Firefox.

The security defect is the most frustrating part of all of this. It makes clear how appropriate for this guy to be giving this kind of advice. 


March 17, 2010 | Jeff Sauro wrote:

Thank you to those who pointed out the coding bugs on the site. I've since fixed the page width issue and the 0 rating problem, so rest assured your 0 votes for this article are being recorded. 


March 17, 2010 | Logi Ragnarsson wrote:

(You want to delete or edit the post by AnonymousCoward below so it stops expanding the width of the page. Then you want to get some better code, which won't allow that to happen. I still don't know or sufficiently car what you wrote above.) 


March 17, 2010 | Logi Ragnarsson wrote:

After not reading this post, since it's completely unusable, I noticed that it had an average rating of 8.48. I rated it at 0 since I couldn't read it without my eyes bleeding. However, I happened to glance at a comment saying that 0-votes weren't counted, so I re-voted at 1 instead, and I got this gem:

"You already rated this page a 10"

I really, really, really, hope that your content is better than your packaging. But I don't suppose I'll ever know. 


March 17, 2010 | Logi Ragnarsson wrote:

Usability test this. I have no idea what you wrote beyond the title, since the lines are about a meter wide and I just don't are enough. 


March 17, 2010 | Ken Zutter wrote:

If usability is the topic, then why is this webpage about 50 million pixels wide? The content does not flow. I have to scroll to the right.
I cannot even see the submit button for this form.
LOL
FF 3.6 maximized on 1152 wide screen 


March 17, 2010 | AnonymousCoward wrote:

I ran the last test 1% for a while. It took 200 samples to reach 85%.
At 100 samples I had about 79% IIRC.

445, 73, 17, 107, 16, 514, 176, 54, 9, 164, 92, 447, 126, 127, 331, 11, 62, 190, 123, 229, 16, 88, 109, 58, 15, 76, 62, 116, 279, 26, 36, 120, 127, 128, 270, 110, 3, 51, 47, 46, 73, 47, 223, 136, 155, 90, 326, 135, 142, 95, 361, 18, 34, 237, 61, 31, 40, 14, 18, 181, 175, 12, 322, 245, 36, 45, 325, 12, 111, 229, 6, 3, 33, 52, 151, 11, 49, 121, 237, 199, 301, 153, 43, 35, 325, 2, 31, 42, 4, 6, 27, 158, 58, 179, 55, 119, 15, 138, 44, 261, 21, 26, 62, 49, 31, 13, 21, 54, 33, 17, 75, 115, 187, 181, 328, 26, 19, 32, 225, 390, 1, 117, 23, 216, 36, 66, 5, 138, 75, 59, 13, 29, 54, 298, 41, 32, 122, 4, 98, 26, 240, 78, 15, 26, 66, 54, 95, 77, 201, 33, 28, 78, 34, 168, 26, 64, 346, 84, 11, 10, 147, 76, 71, 434, 99, 47, 50, 120, 137, 47, 135, 39, 98, 91, 180, 280, 152, 148, 83, 82, 43, 93, 5, 55, 52, 57, 46, 8, 25, 60, 131, 78, 77, 32, 12, 31, 20, 7, 85, 50 


March 17, 2010 | Jeff Sauro wrote:

John,

Regarding your point about the 31%, that probability comes from the article by Nielsen and some of his papers (his article is linked to a couple times). Whether the probability a user will encounter a problem that frequently will depend on a number of things and you really don't know until you run the users. You would expect this high of a problem frequency early in a design and not typically on released software.

But the nice thing is, you don't need to know the probability ahead of time. When you test with only 5 users, you've most likely seen problems that affect this many users. It's both a caution and reassurance. Many people have been using the magic number 5 as a guide and think they are discovering 85% of all problems when they run five users. Instead, my hope was to show that you're just going to see 85% of the more obvious problems that impact (31%-100% of users). And on a well tested application, one should expect that number to be below 10 or 5%, meaning you're gonna need more users.  


March 17, 2010 | Tom wrote:

That comment below should have said "pointy-hairs", as in the pointy-haired boss from Dilbert, not "point hairs". 


March 17, 2010 | Tom wrote:

If it\'s detectable 31% of the time, it\'s going to be found long before user testing.

.31% (0.0031) might be a more reasonable number for code that has undergone any testing before user testing.

And, of course, the probability is a function of users/unit of time.

The sad thing is some point hairs are going to read the article, miss the qualifications, and this comment, and think they now know how much testing they need (5 users, for any period of time). 


March 17, 2010 | Earl Franklin wrote:

Probability John Haugeland woke up on the wrong side of the bed today: 100% 


March 17, 2010 | John Haugeland wrote:

Just to see how the blog would react, I tried to vote your article a 1 a second time.

"You already rated this page a 10."

And of course, it's also stripping whitespace out of comments.

Clearly, you're to whom to go for software quality advice.

Did you even bother to test your platform? 


March 17, 2010 | John Haugeland wrote:

Also, your blog is discarding 0 ratings on articles; I had to re-vote as a 1 before the vote count would go up, or the rating change (which says a lot about your readiness to talk about finding defects, and the validity of your current article ratings). 


March 17, 2010 | John Haugeland wrote:

> The five user number comes from the number of users you would need to detect approximately 85% of the problems in an interface, given that the probability a user would encounter a problem is about 31%.

Well, if you\'re satisfied with 85 percent, or if you actually take this unsourced 31 percent number seriously, then this is probably enough to get you to believe. 


March 17, 2010 | Lou Rawls wrote:

Oh wow, thats incredible.

Lou 


March 17, 2010 | Richard Metzler wrote:

How do you usability test your website? Is there any script in the background that measures where my cursor goes and where I try to click?

I just wondering, because I just tried to click on your heading to jump to your home page but that did not work until I realized you only linked your logo and not your heading to the main page. Adding this would increase usabillity.

Thanks for the article- found it really useful. 


March 17, 2010 | polat alemdar wrote:

testing a user is like what? rolling dice? rolling 20 number dice? toss a coin? that is not mentioned. The axiomatic condition is not explained. 


March 16, 2010 | Bill Wun wrote:

@ton bil:

Results: 88% of 300 samples found a tails in less than 3 tosses 


March 15, 2010 | Ton Bil wrote:

At the first test I'm pretty stable in the range 89 - 91 % after 50 samples (I went up to 250). 


March 9, 2010 | Doug Baker wrote:

This is a response to murph.

Based on Jeff's recommended strategy above, I'd say that you have to figure out for your projects what threshold constitutes an outlier. Is it below 20%? 15%? 10%? If you want to discover all issues that are not outliers, then you use the binomial formula as Jeff describes and test 9, 11, or how ever many participants to discover those issues.

If you decide to that an outlier is below is a 10% population, and then run a test with 5 participants, then Jeff says that all problems will affect at least a third (roughly) of your user population. That means that the issues you discover are not outliers.

Now, if you are concerned about a participant being an outlier, there is always a risk of that. The way that risk is mitigated is through a careful recruitment/screening process. This is important no matter how many participants one chooses to test, but is even more important for a small sample size. 


March 9, 2010 | murph wrote:

Discovery of an issue is one thing - establishing that it is not the result of an outlier is another thing entirely.

A client who must spend substantial development hours to correct an issue is going to want to know how likely this issue will occur across the user population.

It\'s certainly easier to rationalize away an issue if there is no solid basis for determining the return on investment. 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 2 + 5: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4183 Subscribers]

Connect With Us

UX Bootcamp

Denver CO, Aug 20-22nd 2014

3 Days of Hands-On Training on User Experience Methods, Metrics and Analysis.Learn More

Our Supporters

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Loop11 Online Usabilty Testing

Use Card Sorting to improve your IA

Usertesting.com

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.