Jeff Sauro • September 17, 2004

One of the biggest and usually first concerns levied against any statistical measures of usability is that the number of users required to obtain "statistically significant" data is prohibitive. People reason that one cannot with any reasonable level of confidence employ quantitative methods to determining product usability. The reasoning continues something like this:"I have a population of 2500 software users. I plug my population (2500) into a sample size calculator and it tells me I need a sample almost 500 to obtain statistically significant results. The time spent recruiting, testing and compiling such a large sample would take months or years and cost tens or hundreds of thousands of dollars!"The result is usually some qualitative testing done with 5-8 users and the recipient of the report inevitably asks: "How can we make such strong conclusions with so few users?" At best the research by Nielsen and Landauer gets mentioned and at worst, both the test administrators and recipients walk away with very little confidence in the testing conclusions. Spool and Schroder (2001) and Woolrych & Cockton (2001) both showed that relying on only five users was problematic in certain contexts.

We usability professionals report our findings with such enthusiasm and confidence yet wished our conclusions were supported with equal confidence from widely accepted statistical methods. The solution is to transform our thinking about what's being analyzed.

Those nifty ubiquitous sample size calculators that you can find all over the Web are deriving their acceptable sample size based on your unit of analysis. If your unit of analysis is a person's binary opinion (agree, disagree) to one question then your sample size will be the number of people. Most usability literature also assumes that people are the unit of measurement. The unit of analysis in usability testing however is the

Step 1: Measure Tasks and Users--Not Just Users

The users aren't the unit of analysis--the tasks are. Don't think of the number of users as your opportunity for a defect or your unit of analysis. Yes, we're usability professionals, and weAssuming the sample of users we're testing is somewhat representative of the total user population then we should also assume their observed errors and ability level is indicative of how users will actually behave outside the lab setting. Look at your users as a representative sample of "operators" of the product and the tasks you have them perform as representative of real-life tasks. Therefore, the product's usability can be described as the sum of all the usability defects that operator experiences. To operationalize this you would say, the opportunity for defects then are the number of users multiplied times the number of tasks.

Defect Opportunity = (# of users) * (# of tasks)

For example, think of the last time you tried a new product and then told someone about your experience. Chances are, you mentioned something about its ease of use (or lack of). When you described how easy something was to use, your comments were influenced by items such as:- How long a task took to complete based on your expectations
- Whether you were able to complete a task in one try or at all
- How many errors you encountered and whether you were able to recover from them

In order to properly measure usability, each of these aspects should be measured for a set of core tasks, say the ten most common tasks performed, or where the users spend 80% of their task time. Then, the number of users multiplied times the core tasks will give you a quantitative measure of usability for that product (insofar as those tasks represent the products well). So if you had ten core tasks, times your twelve users, you'd have a sample of 120 observations to analyze.

10 x 12 = 120 sample opportunities for a defect

This sample would provide more variability and hence a better representation of your total universe of tasks then only looking at a sample of 12. If your product had 2500 users, each performing roughly 30 total tasks (most of them being infrequently completed) you'd roughly have a universe of30 x 2500 = 75,000 total opportunities for a defect

Of course not all of those 2500 users use the software to the same extent. Many subcategories exist of novice, expert, administrator and other special classes of users, but for this illustration lets assume they all belong to the same category of user.Step 2: Understanding Sample Size Calculations

Most sample size calculators, including this one, calculate the sample based on a binomial data set. In an opinion poll which asks you to agree or disagree and you're pretty certain the results wont be overwhelming in one direction (95% agree or disagree to a statement) you assume the worst case scenario for binary data: a 50/50 split in opinions. When it's such a close call the chances of your sample not representing the total population increase. A greater sample is then required to decrease the confidence interval and therefore chance of error.

The other draw-back to most online-calculators is that they assume your
sample size is in the hundreds, which is rarely the case in usability studies. That assumption means the calculations
will be off (creating margins of error that are too wide) for smaller samples and especially for smaller samples with
completion rates closer to 100%. For better
results use the interactive sample size calculator for completion rates, which was designed for small sample studies.

The confidence interval is that + or - margin of error always shown next to the result of
those polls on TV. You know then that if pollsters have a larger
margin of error, say +/- 6, they polled a smaller sample size than
a similar poll with a +/- 4 margin of error. Of course when the
result of the poll is 51% and 49% a +/- 4 margin of error tells
you that it could go either way.

Task completion is also a binary data set. If you're
testing a set of experienced users who by definition complete the
task weekly, then you should assume a 99% completion rate. For this
example, let's be more conservative and assume a 95% completion
or success rate. The proportion (p) is then .95. This lop-sided
result in binary data means you don't need a very large sample.
In fact, 12 users completing 10 tasks would be representative enough
to draw many conclusions. However, if one of those users doesn't
complete the task and it's not attributable to some testing environment
issue, than it's either a rare event, an unrepresentative user or
an unrepresentative task.

Again, all this assumes your data set is binomial,
which is a discrete measurement. For a continuous measurement you
would calculate the sample size using a different
technique.

So while I just said to focus on the tasks and the users as an opportunity for a defect, there's one complication. Statistical computations don't work well when the measures are not independent. In this case, there isn't independence since a users performance on a task is likely to tell us something about her performance on subsequent tasks. This means changing the task is like changing the application tested, it has a dramatic effect on what you observe, a point raised in the early 1980's by Jim Lewis and also recently described in a CHI paper by Lindgaard and Chattratichart.

Once you understand the importance the task plays in affecting the population, then use the interactive sample size calculator for completion rates to see how the sample sizes affects your uncertainty about an a completion rate. If you're just looking to find UI problems then use the Sample Size Calculator for Discovering Problems in a User Interface.

**Update Note: 6/10/2005**: The old defect opportunity used to say (# users)
* (# of task measurements)*(# tasks). I was a bit optimistic
in doing this. I removed the # of task measurements since those are
not independent measures.

The New Face of Usability Testing

7 Techniques for Prioritizing Customer Requirements

How to Compute a Confidence Interval in 5 Easy Steps

The Experiment Requires That You Continue: On The Ethical Treatment of Users

What five users can tell you that 5000 cannot

A Brief History of the Magic Number 5 in Usability Testing

97 Things to Know about Usability

5 Examples of Quantifying Qualitative Data

Nine misconceptions about statistics and usability

The Five Most Influential Papers in Usability

Confidence Interval Calculator for a Completion Rate

Why you only need to test with five users (explained)

How common are usability problems?

Should you use 5 or 7 point scales?

8 Ways to Show Design Changes Improved the User Experience

10 Things to Know about Usability Problems

.

Quantifying the User Experience: Practical Statistics for User ResearchThe most comprehensive statistical resource for UX Professionals Buy on Amazon | |

Excel & R Companion to Quantifying the User ExperienceDetailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R Buy on Amazon | Download | |

A Practical Guide to the System Usability ScaleBackground, Benchmarks & Best Practices for the most popular usability questionnaire Buy on Amazon | Download | |

A Practical Guide to Measuring Usability72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software Buy on Amazon | Download |

.

.

.