We already saw how a manageable sample of users can provide meaningful data for discrete-binary data like task completion. With continuous data like task times, the sample size can be even smaller.
The continuous calculation is a bit more complicated and involves somewhat of a Catch-22. Most want to determine the sample size ahead of time, then perform the testing based on the results of the sample size calculation as in the binary sample calculation. In this case, we need to have some data already or at least a strong hypothesis of our user population.
As with the binary calculation for task completion, we know when testing experienced users (those who complete the task at least weekly) they should overwhelmingly complete the task successfully. With task times we should also have a rough estimate of the mean and standard deviation ( there's the Catch 22). If you're performing a benchmarking study and already have some data, then you can use that data. If you have time, sample a pretest of users, say four, to get a sense of the range in times. Of course when all else fails you can have some internal folks complete the tasks--perhaps some sales or service employees or whomever comes close to matching the speed and accuracy of you target users. You'll need to have an idea of the standard deviation(in seconds) for each task you're testing.
For example, lets use the sample task, "Looking up a balance on an account number" (a very common task in accounting software). You write up a scenario and try the task yourself and have three of you colleagues complete it. Chances are you're probably completing the task faster than your users, nevertheless it will still provide you a range of times. Here are the times in seconds
| Time (in seconds) | |
| You | 101 |
| Colleague 1 | 132 |
| Colleague 2 | 125 |
| Colleague 3 | 145 |
| Mean | 125.75 |
| St Deviation | 18.46 |
| Range | 44 |
That ten second range will become the confidence interval. The confidence interval is that + or - fudge factor seen with the polls on TV. With this confidence interval we can work backwards to arrive at our sample size. Because we don't know the standard deviation of the whole population of users(again the Catch 22) we need to estimate it from the small sample we have. For small samples (less than 30) where the parent standard deviation (σ) is not known you use what's called the student t distribution. The student t distribution uses values from a t table instead of the more familiar z table of normal values.
The confidence interval is calculated by multiplying this t-statistic (t*) by the Standard Error (SE). The Standard Error is just the sample standard deviation divided by the square root of the sample size. So the confidence interval formula usually looks something like this:
![]()
To arrive at the elusive "significant"
sample size, you need to try a few reasonable sample sizes and see which
ones fall within the limits of the confidence interval. The values (n)
you choose will affect the the critical value for t and the Standard
Error since both use n in their equation. We'll use 25, 20, 15, 10 and
5 and which ever value has a confidence interval at about 10 seconds
we'll use as the ideal sample. (Again all this assumes that our internal
sample did a good job of determining the standard deviation of the larger
population).
| Sample | 95% CI | SE | SQRT N | Stdev | t * |
| 25 | 7.61 |
3.692 |
5 |
18.46 | 2.063 |
| 20 | 8.63 |
4.12 |
4.47 |
18.46 | 2.093 |
| 15 |
10.22 |
4.76 |
3.87 |
18.46 |
2.144 |
| 10 | 13.20 |
5.83 |
3.16 |
18.46 | 2.262 |
| 5 | 22.92 |
8.25 |
2.23 |
18.46 | 2.776 |
At about 15 users, the conifdence interval
narrows close enough to ten seconds that it will probably be sufficient.
I'd use this 15 as the approximate number of users you'd need to sample
and know that to get more precise, you'd need to sample more than 15
users. This result is much better than thinking you need to test 100
or 1000 in order to get "statistically significant results. If
+/- 10 seconds isn't precise enough you can:
- Decrease your confidence level to 90% or 85%.
- Sample more users.
- Decrease your confidence interval and increase your sample.
Sample Sizes in the Real World of Usability Testing
If you've run enough usability tests, in many cases your sample size is usually determined ahead of time--that is, you know your budget and time frame and therefore approximately how many users you'll be sampling--usually somewhere between 10 and 30. I then approach sampling as getting as many users as I can within that range and then compute the statistics later.For example, lets say we followed our initial indication and sampled 15 users (assuming our budget and time fit nicely with this figure). We had them complete the same task of looking up an account balance as our small internal employee sample. Here are the results next to our initial internal sample:
|
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
![]()
| mean time of your sample (126.6) | |
| true mean time of the entire population of users | |
| n | number of users in the sample (15) |
| s | the standard deviation of the sample (16.33) |
| t* | t statistic = (2.144789) or use the excel function =TINV(.05,14) [confidence level(.05) and degrees of freedom n-1 (14) ] |
Plugging in the numbers, for the estimated mean of the total population
of users on this task we get:
= 126.6 + or -
9.08
If you'd like an email when a new article or calculator is posted sign up for Email Updates.
| November 28, 2009 | abey varughese wrote: |
| i want know about HOW TO CALCULATE SAMPLE SIZE FOR an experimental study to determine effectiveness of music therapy on sleep quality of older adult of selected oldage homes. please help.. |
| October 16, 2008 | AKHILESH wrote: |
| 5. A telescope manufacturer wants its telescopes to have standard deviations in resolution to be significantly below 2 when focusing on objects 500 light-years away. When a telescope is used to focus on an object 500 light years away 30 times, the sample standard deviation turns out to be 1.46. a. State explicit null and alternate hypotheses b. Test your hypothesis at the á=0.01 level. |
| July 16, 2008 | Jackie Aylsworth wrote: |
| can you tell me what an application of split-half would be as well as the appropriateness (ie: when or when not to use it as well as strengths and weaknesses of split half? |


