Jeff Sauro • July 21, 2010

Wondering about the origins of the sample size controversy in the usability profession? Here is an annotated timeline of the major events and papers which continue to shape this topic.

1981: Alphonse Chapanis and colleagues suggest that observing about five to six users reveals most of the problems in a usability test

1982: Wanting a more precise estimate of a sample size than 5-6, Jim Lewis published the first paper describing how the binomial distribution can be used to model the sample size needed to find usability problems. It is based on the probability of discovering a problem with probability "p" for a given set of tasks and user population given a sample size "n."

Papers cited in this section

1990: Robert Virzi details three experiments at the HFES conference replicating earlier work from Nielsen. His paper explicitly uses 1-(1-p)

- Additional subjects are less and less likely to reveal new information
- The first 4-5 users find 80% of problems in a usability test (avg. p of .32)
- Severe problems are more likely to be detected by the first few users.

1991: Wright and Monk also show how using 1-(1-p)

1993: Jakob Nielsen and Tom Landauer in a separate set of eleven studies found that a single user or heuristic evaluator on average finds 31% of problems. Using the Poission distribution they also arrive at the formula 1-(1-p)

1994: Jim Lewis was sitting in the audience of Robert Virzi's 1990 HFES talk and wondered how severity and frequency could be associated. His 1994 paper confirmed Virzi's first finding—the first few users find most of the problems, partially confirmed the second (his p was .16 not .32) . His data did not show that severity and frequency are associated. It could be that more severe problems are easier to detect or it could be that it is very difficult to assign severity without being biased by frequency. There has been little published on this topic since then.

Papers cited in this section

2000: Nielsen publishes the widely cited web-article: "Why you only need to test with five users", which summarizes the past decade's research. Its graph comes to be known as the "parabola of optimism."

2001: Jared Spool & Will Schroeder show that serious problems were still being discovered even after dozens of users (disagreeing with Virzi's but agreeing with Lewis's findings). This was later reiterated by Perfetti and Landesman. Unlike most studies, these authors used open-ended tasks allowing users to freely browse up to four websites looking for unique CD's.

2001: Caulton argues that different types of users will find different problems and suggests including an additional parameter for the number of sub-groups of users.

2001: Hertzum and Jacobsen caution that estimating an average problem frequency from the first few users will be inflated

2001: Lewis provides a correction for estimating the average problem occurrence from the first 2-4 users

2001: Woolrych and Cockton argue that problems don't uniformly affect users so a simple estimate of problem frequency (p) is misleading. Instead they state a new model is needed to account for the distribution of problem frequency.

2002: Carl Turner, Jim Lewis and Jakob Nielsen respond to criticisms of 1-(1-p)

2003: Laura Faulkner also shows variability in users encountering problems. While on average five users found 85% of problems in her study, some combinations found as few as 55% or as much as 99%.

2003: Dennis Wixon argued the discussion about how many users are needed to find problems is mostly irrelevant and the emphasis should be on fixing problems (RITE method).

2003: A CHI Panel with many of the usual suspects defends and debates the legitimacy of the "Magic Number 5"

Papers cited in this section

2006: Jim Lewis provides a detailed history of how we find sample sizes using "mostly math, not magic." It includes an explanation of how Spool and Schroeder's results can be explained by estimating the value of p for their study and putting that value into 1-(1-p)

2007: Gitte Lindgaard and Jarinee Chattratichart using CUE-4 data remind us that if you change the tasks you'll find different problems.

2008: In response to calls for a better statistical model, Martin Schmettow proposes the beta-binomial to account for the variability in problem frequency but with limited success.

2010: I wrote an article visually showing how the math in the binomial predicts sample sizes fine--the problem is in how it's often misinterpreted. The article reiterates the important caveats made for the past decades about the magic number 5:

- You won't know if you've seen 85% of ALL problems, just 85% of the more obvious problems (the ones that affect 31% or more of users).
- The sample size formula only applies when you test users from the same population performing the same tasks on the same applications.
- As a strategy don't try and guess the average problem frequency. Instead, choose a minimum problem frequency you want to detect (p) and the binomial will tell you how many users you need to observe to have a good chance of detecting problems with at least that probability of occurrence.

- Al-Awar, J., Chapanis, A., and Ford, R. (1981). Tutorials for the first-time computer user. IEEE Transactions on Professional Communication, 24, 30-37.
- Lewis, J. R. (1982). Testing Small System Customer Setup. in Proceedings of the Human Factors Society 26th Annual Meeting (pp. 718-720). Santa Monica, CA: HFES. [pdf]

The Cambrian Explosion (1990-1994)

- Virzi, R. A. (1990). Streamlining the design process: running fewer subjects. Proceedings of the Human Factors Society 34th Annual Meeting (pp. 291-294). Santa Monica, CA: HFES.
- Wright, P. C., and Monk, A. F. (1991). A cost-effective evaluation method for use by designers. International Journal of Man-Machine Studies, 35, 891-912.
- Virzi, R. A. (1992). Refining the test phase of usability evaluation: How many subjects is enough? Human Factors, 34, 457-471.
- Nielsen, J., & Landauer, T. K. (1993). A mathematical model of the finding of usability problems. In Proceedings of the SIGCHI conference on Human factors in computing systems (pp.206-213). Amsterdam: ACM.
- Lewis, J. R. (1993). Problem discovery in usability studies: A model based on the binomial probability formula. In Proceedings of the Fifth International Conference on Human-Computer Interaction (pp. 666-671). Orlando, FL: Elsevier. [pdf]
- Lewis, J. R. (1994). Sample sizes for usability studies: Additional considerations. Human Factors, 36, 368-378.[pdf]

Rebellion (2000-2005)

- Caulton, D. A. (2001). Relaxing the homogeneity assumption in usability testing. Behaviour & Information Technology, 20, 1-7. [pdf]
- Spool J., & Schroeder W. (2001). Testing web sites: five users is nowhere near enough, CHI '01 extended abstracts on Human factors in computing systems, March 31-April 05, Seattle, Washington. [pdf]
- Perfetti, C., & Landesman, L. (2001). Eight is not enough. Retrieved July 15, 2010 from
- Turner, C. W., Lewis, J. R., & Nielsen, J. (2002). UPA Panel: How many users is enough? Determining usability test sample size
- Wixon, D. (2003) Evaluating usability methods: why the current literature fails the practitioner, interactions, v.10 n.4, July + August.
- Lewis, J. R., 2001, Evaluation of procedures for adjusting problem-discovery rates estimated from small samples. International Journal of Human-Computer Interaction, 13, 445-479.[pdf]
- Hertzum, M. & Jacobsen, N. J. (2003 – corrected version, original published in 2001). The evaluator effect: A chilling fact about usability evaluation methods. International Journal of Human-Computer Interaction, 15, 183-204. [pdf]
- Woolrych, A. & Cockton, G., (2001), Why and when five test users aren't enough. In Vanderdonckt, J., Blandford, A. and Derycke A. (eds.) Proceedings of IHM-HCI 2001 Conference, Vol. 2 (Toulouse, France: Cépadèus Éditions), pp. 105-108. [pdf]
- Bevan, N., Barnum, C., Cockton, G., Nielsen, J., Spool, J., and Wixon, D. 2003. The "magic number 5": is it enough for web testing?. In
*CHI '03 Extended Abstracts on Human Factors in Computing Systems*(Ft. Lauderdale, Florida, USA, April 05 - 10, 2003). CHI '03. ACM, New York, NY, 698-699

Clarifications (2006 – Present)

- Turner, C. W., Lewis, J. R., & Nielsen, J. (2006). Determining usability test sample size. In W. Karwowski (ed.), International Encyclopedia of Ergonomics and Human Factors (pp. 3084-3088). Boca Raton, FL: CRC Press. [pdf]
- Lewis, J. R. (2006). Sample sizes for usability tests: mostly math, not magic. interactions 13, 6 (Nov. 2006), 29-33.
- Lindgaard, G., & Chattratichart, J. (2007). Usability testing: what have we overlooked?. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (San Jose, California, USA, April 28 - May 03, 2007). CHI '07. ACM, New York, NY, 1415-1424. [pdf]
- Schmettow, M. (2008), "Heterogeneity in the Usability Evaluation Process," in Proceedings of the 22nd British HCI Group Annual Conference on HCI 2008: People and Computers XXII: Culture, Creativity, Interaction - Volume 1, ACM, Liverpool, UK, pp. 89-98. [pdf]
- Sauro (2010) Why you only need to test with five users (explained) Retrieved July 15, 2010

Muffet

elleinad

The Experiment Requires That You Continue: On The Ethical Treatment of Users

28 Resources for Getting Started In UX

The Five Most Influential Papers in Usability

Does better usability increase customer loyalty?

How common are usability problems?

How to Conduct a Usability test on a Mobile Device

Nine misconceptions about statistics and usability

A Brief History of the Magic Number 5 in Usability Testing

Should you use 5 or 7 point scales?

Why you only need to test with five users (explained)

8 Ways to Show Design Changes Improved the User Experience

97 Things to Know about Usability

10 Things to Know about Usability Problems

5 Examples of Quantifying Qualitative Data

.

Quantifying the User Experience: Practical Statistics for User ResearchThe most comprehensive statistical resource for UX Professionals Buy on Amazon | |

Excel & R Companion to Quantifying the User ExperienceDetailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R Buy on Amazon | Download | |

A Practical Guide to the System Usability ScaleBackground, Benchmarks & Best Practices for the most popular usability questionnaire Buy on Amazon | Download | |

A Practical Guide to Measuring Usability72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software Buy on Amazon | Download |

.

.

.