Measuring Usability Homepage
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro
Calculating Sample Size for Task Completion (Discrete-Binary Method)
by Jeff Sauro | September 17, 2004 :: 4 Related Articles:: 23 Related Questions
How useful was this article?

Avg. Rating: 0 ( 0 ) | 2 Comments


Page Tags

Tag Name#Vote
Discrete Data129
Task Completion56
Tasks56
Sample Size55


New Tag:   


One of the biggest and usually first concerns levied against any statistical measures of usability is that the number of users required to obtain "statistically significant" data is prohibitive. People reason that one cannot with any reasonable level of confidence employ quantitative methods to determining product usability. The reasoning continues something like this:

"I have a population of 2500 software users. I plug my population (2500) into a sample size calculator and it tells me I need a sample almost 500 to obtain statistically significant results. The time spent recruiting, testing and compiling such a large sample would take months or years and cost tens or hundreds of thousands of dollars!"
The result is usually some qualitative testing done with 5-8 users and the recipient of the report inevitably asks: "How can we make such strong conclusions with so few users?" At best the research by Nielsen and Landauer gets mentioned and at worst, both the test administrators and recipients walk away with very little confidence in the testing conclusions. Spool and Schroder (2001) and Woolrych & Cockton (2001) both showed that relying on only five users was problematic in certain contexts.

We usability professionals report our findings with such enthusiasm and confidence yet wished our conclusions were supported with equal confidence from widely accepted statistical methods. The solution is to transform our thinking about what's being analyzed.

Those nifty ubiquitous sample size calculators that you can find all over the Web are deriving their acceptable sample size based on your unit of analysis. If your unit of analysis is a person's binary opinion (agree, disagree) to one question then your sample size will be the number of people. Most usability literature also assumes that people are the unit of measurement. The unit of analysis in usability testing however is the observation of user performance per task.

Step 1: Measure Tasks and Users--Not Just Users

The users aren't the unit of analysis--the tasks are. Don't think of the number of users as your opportunity for a defect or your unit of analysis. Yes, we're usability professionals, and we should know our user, but that doesn't mean we want to evaluate users' software skill levels. Instead we want to measure the software.

Assuming the sample of users we're testing is somewhat representative of the total user population then we should also assume their observed errors and ability level is indicative of how users will actually behave outside the lab setting. Look at your users as a representative sample of "operators" of the product and the tasks you have them perform as representative of real-life tasks. Therefore, the product's usability can be described as the sum of all the usability defects that operator experiences. To operationalize this you would say, the opportunity for defects then are the number of users multiplied times the number of tasks.

Defect Opportunity = (# of users) * (# of tasks)

For example, think of the last time you tried a new product and then told someone about your experience. Chances are, you mentioned something about its ease of use (or lack of). When you described how easy something was to use, your comments were influenced by items such as:

ISO has attempted to narrow these fuzzy aspects into three major areas of effectiveness, efficiency and satisfaction in their standard for defining and measuring usability 9241 part 11.

In order to properly measure usability, each of these aspects should be measured for a set of core tasks, say the ten most common tasks performed, or where the users spend 80% of their task time. Then, the number of users multiplied times the core tasks will give you a quantitative measure of usability for that product (insofar as those tasks represent the products well). So if you had ten core tasks, times your twelve users, you'd have a sample of 120 observations to analyze.

10 x 12 = 120 sample opportunities for a defect

This sample would provide more variability and hence a better representation of your total universe of tasks then only looking at a sample of 12. If your product had 2500 users, each performing roughly 30 total tasks (most of them being infrequently completed) you'd roughly have a universe of

30 x 2500 = 75,000 total opportunities for a defect

Of course not all of those 2500 users use the software to the same extent. Many subcategories exist of novice, expert, administrator and other special classes of users, but for this illustration lets assume they all belong to the same category of user.

 

Step 2: Understanding Sample Size Calculations

Most sample size calculators, including this one, calculate the sample based on a binomial data set. In an opinion poll which asks you to agree or disagree and you're pretty certain the results wont be overwhelming in one direction (95% agree or disagree to a statement) you assume the worst case scenario for binary data: a 50/50 split in opinions. When it's such a close call the chances of your sample not representing the total population increase. A greater sample is then required to decrease the confidence interval and therefore chance of error.

The other draw-back to most online-calculators is that they assume your sample size is in the hundreds, which is rarely the case in usability studies. That assumption means the calculations will be off (creating margins of error that are too wide) for smaller samples and especially for smaller samples with completion rates closer to 100%. For better results use the interactive sample size calculator for completion rates, which was designed for small sample studies.

The confidence interval is that + or - margin of error always shown next to the result of those polls on TV. You know then that if pollsters have a larger margin of error, say +/- 6, they polled a smaller sample size than a similar poll with a +/- 4 margin of error. Of course when the result of the poll is 51% and 49% a +/- 4 margin of error tells you that it could go either way.

Task completion is also a binary data set. If you're testing a set of experienced users who by definition complete the task weekly, then you should assume a 99% completion rate. For this example, let's be more conservative and assume a 95% completion or success rate. The proportion (p) is then .95. This lop-sided result in binary data means you don't need a very large sample. In fact, 12 users completing 10 tasks would be representative enough to draw many conclusions. However, if one of those users doesn't complete the task and it's not attributable to some testing environment issue, than it's either a rare event, an unrepresentative user or an unrepresentative task.

Again, all this assumes your data set is binomial, which is a discrete measurement. For a continuous measurement you would calculate the sample size using a different technique.

So while I just said to focus on the tasks and the users as an opportunity for a defect, there's one complication. Statistical computations don't work well when the measures are not independent. In this case, there isn't independence since a users performance on a task is likely to tell us something about her performance on subsequent tasks. This means changing the task is like changing the application tested, it has a dramatic effect on what you observe, a point raised in the early 1980's by Jim Lewis and also recently described in a CHI paper by Lindgaard and Chattratichart.

Once you understand the importance the task plays in affecting the population, then use the interactive sample size calculator for completion rates to see how the sample sizes affects your uncertainty about an a completion rate. If you're just looking to find UI problems then use the Sample Size Calculator for Discovering Problems in a User Interface.


Update Note: 6/10/2005: The old defect opportunity used to say (# users) * (# of task measurements)*(# tasks). I was a bit optimistic in doing this. I removed the # of task measurements since those are not independent measures.



If you'd like an email when a new article or calculator is posted sign up for Email Updates.



 
Related Articles
Restoring Confidence in Usability Results
Calculating Sample Size for Task Times (Continuous Method)
What's a Z-Score and Why Use it in Usability Testing?
Calculating a Sigma Level from Task Success


 
Related Questions

Ask a Question
Philadelphia is conducting a study on the characteristics of tourists who drive to Eagles football games. Previous studies indicate that approximately 70% of all game attendees are people who decided to drive from out of town. If the researcher leading the study desires a 99% confidence level and an interval range of plus or minus 10%, what size should the sample be?
A random sample of 10 miniature Tootsie Rolls was taken from a bag. Each piece was weighed on very accurate scale. The results in grams were 3.087 3.131 3.241 3.241 3.270 3.353 3.400 3.411 3.437 3.477 (a)Construct a 90% confidence intervalfor the true mean weight. (b)What sample size would be necessary to estimate the true weight an error of +/- 0.03 gram with 90% confidence?
My question is really about sample size. Say you work at a facility and want to perform an assessment of your safety culture – this would involve multiple topics of questions expecting answers like agree, tend to agree, not sure, tend to disagree, disagree. How would you estimate the sample size if your total facility population is only 80 persons? Would the manner in which you estimate the sample size be different if you used any combination of the following methods to conduct the assessment: a written survey, individual interviews, one on one observations, or focus groups? What if these assessments must be performed at eight different facilities that have no relationship to each other and each of their total populations range from 15-1000, with a mean of 153? To make it even more complicated, would the manner in which you estimate sample size change if you really wanted each of the facilities to assess eight work groups or divisions in the workforce (e.g., management, operations, maintenance, engineering, etc.) at each of their facilities? Greatly appreciate your input.
When estimating the mean of a population, how large must the sample be in order that the 95% error margin is 1/8 the standard deviation?
Annual starting salaries for college graduates with degrees in business administration are generally expected to be between $30,000 and $45,000. Assume that a 95% confidence interval estimate of the population mean annual starting salary is desired. What is the planning value for the population standard deviation (0 decimals)? __________ How large a sample should be taken if the desired margin of error is as shown below (0 decimals)? a. $500? __________ b. $200? __________ c. $100? __________ d. Would you recommend trying to obtain the $100 margin of error? _________________
How to determine the sample size for comparing mulltiple parameters like Height , weight, Blood pressure , Blood parameters like blood glucose, total cholesterol, etc in two different populations?
In a survey of 500, 60% responded positively to an value question. Calculate a confidence level at 95% to get an interval estimate for proportion?
Survey to determine what proportion of new car buyers continue to have their car serviced at the dealership after warranty ends. Estimates 30% of customer do so. Results should be accurate within 5%. Also 95% confident of the results. What sample size is necessary?
In a national wide survey a researcher expects 30 percent of the population will agree with an value statement. He wishes to have less than 2% error and 95% confident. What sample size is needed??
A central university has a student population of 60,000. The university is interested in determining what proportion of them is in favour of a new grading system. Determine a sample size with confidence level of 95% that will show the true proportion of population in favour of the new system within plus and minus 0.02.
A sample of 20 pages was taken without replacement from the 1,591-page phone directory Ameritech Pages Plus Yellow Pages. On each page, the mean area devoted to display ads was measured (a display ad is a large block of multicolored illustrations, maps, and text). The data (in square millimeters) are shown below: 0 260 356 403 536 0 268 369 428 536 268 396 469 536 162 338 403 536 536 130 (a) Construct a 95 percent confidence interval for the true mean. (b) Why might normality be an issue here? (c) What sample size would be needed to obtain an error of ±10 square millimeters with 99 percent confidence? (d) If this is not a reasonable requirement, suggest one that is. I am new at this and it would help if you could give me the formula and break it down step by step so I can understand. Thanks
State the main points of the Central Limit Theorem for a mean. B. Why is population shape of concern when estimating a mean? What does sample size have to do with it?
A researcher is interested in estimating the average salary of fire fighters in a large city. He wants to be 95% confident that his estimate is correct. If the standard deviation is $1050, how large a sample is needed to get the desired information and to be accurate within $200?
An automobile manufacturer wants to estimate the mean gasoline mileage that its customers will obtain with its new compact model. How many sample runs must be performed in order that the estimate be accurate to within 0.25 mpg at 90% confidence? (Assume that ó = 2.0.)
I am charged with sampling expense report submissions for accuracy. We get 8000 T&E claims, and want to sample a subset making inferences about the population. I can probably get a good estimate of the populations' SD, how would i calculate required sample size? I think i would be measuring the delta between actual and claimed - the majority be 0. Is this a one sided issue? Thanks in advance - FT Also, would i need the acceptable level first? ie. we would accept an average difference of $5? How can i work backwards if the first sample size yields an average of $1? (if that made sense)
How do you determine the sample size for data for which the mean and standard deviation are not known?
How do I determine what test statistic to use if given a sample of test scores for a present year and a previous year using a .05 significance level to retain or reject the null hypothesis.
How do you calculate a z-score on discrete data?
Can you give me a formula to calculate sample size?
Suppose you are planning a sample of employees to determine the monthly average # of vacation days. Standards set: Confidence level of 99% and an error of less than 5 units. Standard deviation be 6 units. What would be the required sample size?
Popcorn kernels take between 100 and 200 seconds to pop. What sample size (number of kernels) would be needed to estimate the true mean seconds to pop with and error of 5 seconds and 95% confidence level?
A central university has a student population of 60,000. The university is interested in determining what proportion of them is in favour of a new grading system. Determine a sample size with confidence level of 95% that will show the true proportion of population in favour of the new system within plus and minus 0.02.
A researcher expects the population proportion of the Cubs Fans in Chicago to be 80%. Error of less than 5% confident of an estimate to be made from a mail survey. What is the sample size required?

Ask a Question


Comments
Name
Email Address
Not Published

To prevent comment spam, please answer the following question before submitting (tags not permitted) :
What is 2 + 5: (enter the number)