Measuring Usability Homepage
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro
Task Times in Formative Usability Tests
by Jeff Sauro | June 6, 2008 :: 2 Related Articles:: 7 Related Questions
How useful was this article?

Avg. Rating: 84 ( 1 ) | 0 Comments


Page Tags

Tag Name#Vote
Formative Evaluation2
t-statistic2
Task Time2
Type I Error2


New Tag:   


It is common to think of time-on-task data gathered only during summative evaluations because, during a formative evaluation, the focus is on finding and fixing problems, or at least finding the problems and delivering a report. For a variety of reasons, time-on-task measures often get left out of the mix. In this article, I show that time-on-task can be a valuable diagnostic and comparative tool during formative evaluations.

The three most common reasons I’ve heard for not using time-on-task in formative studies are:

  1. Using quantitative measures requires larger samples (>20).
  2. Average task-times are an inaccurate metric when users think-out-loud.
  3. Task times are only for benchmarking and not for identifying problems.

Below I discuss why these reasons should NOT prevent you from collecting time-on-task in your next formative evaluation.

Small Samples are Fine

One can collect time-on-task and use parametric statistics for the small (<10) sample sizes in usability tests. The major caveat is that small-sample statistical parameters should be used. For example, when calculating confidence intervals for task time, use t-statistics instead of the normal deviate (z-statistic) because t-statistics take into account the size of the sample in generating the interval. The smaller the sample the larger this value will be and, as your sample gets larger (especially above 30), then these two figures converge. For task completion or problem occurrences, the Adjusted Wald procedure for computing confidence intervals around a proportion also performs well for small samples (Sauro & Lewis 2005).  In short, your sample size alone does not preclude the use of taking time-on-task metrics or using statistics to describe them.

Task-Time as a Benchmark between Designs

If you are doing a formative evaluation as part of an iterative testing plan and you have used think-aloud during all iterations, the mean time-on-task becomes a benchmark to help judge the efficacy of subsequent designs. Although you will need larger differences between iterations, statistically significant differences are well within reach (for example see Bailey 1993).  That is, assuming you use the same tasks and have users think-aloud concurrently with their task attempts, you can compare the mean completion times across iterations. For improving the usability of a system, practitioners should also strongly consider relaxing their Type I rejection criteria (Sauro 2006) from the conventional publication threshold of p <.05 to say <.10 or .20. While this is always context dependent, in business applications one should look for a sufficient amount of evidence--not necessarily a preponderance of evidence--to conclude a design improves over its predecessor (Kirakowski 2003).

Getting an accurate and stable measure of the actual user time-on-task is more problematic that comparing designs. One would expect task times to increase as users are asked to think-aloud while completing tasks. The published data, however, is mixed, with some published studies actually showing faster performance while thinking-aloud possibly due to the invocation of cognitive processes that improve rather than degrade performance (Berry and Broadbent (1990).  For a good summary of the evidence, see Lewis 2006 p. 1282.  More research is needed to draw a conclusion on this aspect.  Regardless, I recommend focusing on relative task time improvements between designs because this avoids this issues altogether.

Task Times as Symptoms of UI Problems

While the absolute time might not be the best measure of the true task completion time, it allows analysis of outliers and patterns as a diagnostic tool. It might not tell you exactly what the problem is, but it can help tell you where there is a problem.  For example, the following task data graphed in Figure 1 were taken from the publicly available CUE-4 (Molich 2004) data from Team M, which timed 15 users while they thought-out-loud as they completed tasks on a hotel reservation website. This task asked the users to cancel a reservation.


Figure 1: Time to cancel a reservation on a hotel-website (in log-transformed seconds). One user took over 4 times the mean time to complete the task. Red solid line is the geometric mean and the green-dashed lines are the upper and lower bounds of the 95% Confidence Interval.

In graphing the report we quickly see that one user took over 4 times longer than the mean time to cancel the reservation (I graphed the data using the Graph and Calculator for Confidence Intervals for Task Times). This simple graph of the task times allows the investigator and reader of a report to zero in on potential causes of such a long task time (relative to the other users). While it’s unclear from the report as to what was occurring during this task, an analysis of this user’s profile shows that she had never visited a hotel website or ever made a reservation at a hotel website prior to the test. Her comments also reinforce her being a “novice” Internet user: “I feel that my inexperience with the web had a lot to do with difficulties.”  Whether it was just the user’s inexperience or some specific interface problems, perhaps particularly damaging to a novice, it is clear this user had trouble during the task. A few pixels tell the story.

Time-on-task is an under-utilized tool for formative evaluations. It costs nothing (just start and stop the time), is useful with any-number of users and it can be a valuable tool for diagnosing problems as well as making objective comparisons between iterations. I encourage you to collect time-on-task during your next formative evaluation.

 

References

  1. Berry, D. C., and Broadbent, D. E.  (1990).  The role of instruction and verbalization in improving performance on complex search tasks.  Behaviour & Information Technology, 9, 175-190.
  2. Bailey, G. (1993) Iterative methodology and designer training in human-computer interface design. In Proceedings of the INTERACT '93 and CHI '93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands, April 24 - 29, 1993). CHI '93
  3. Kirakowski, J, (2005)"Summative Usability Testing: Measurement and Sample Size" in R.G. Bias and D.J. Mayhew (Eds): "Cost Justifying Usability: An Update for the Internet Age." Morgan Kauffman Publishers, CA, 2005.
  4. Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.
  5. Molich, Rolf (2004) Comparative Usability Evaluation CUE-4.
  6. Sauro, J. (2006) "The User is in the Numbers" in ACM Interactions Volume 13, Issue 6 November-December.
  7. Sauro, J & Lewis, J R (2005) " Estimating Completion Rates from Small Samples using Binomial Confidence Intervals: Comparisons and Recommendations" in Proceedings of the Human Factors and Ergonomics Society Annual Meeting (HFES 2005) Orlando, FL


If you'd like an email when a new article or calculator is posted sign up for Email Updates.



 
Related Articles
Graph and Calculator for Confidence Intervals for Task Times
Measuring & Analyzing Task Times


 
Related Questions

Ask a Question
Will the t statistics calculated from 4 observations valid?
A random sample of 25 households finds that an average of 2.3 people reside in each house (the standard deviation is 0.35). With a 95% confidence level, what is your estimation of the population average?
You are interested in whether men participate more than women in political activities, so you survey 632 men and 675 women and ask them how many political events they attended within the past year. Here are your results: MEN WOMEN mean = 1.58 mean = 1.42 s.d. = 0.10 s.d. = 0.23 N=432 N=375 With 95% confidence what do you conclude?
What is the formula for finding the t-statistic
Various temperature measurements are recorded at different times for a particular city. The mean of 25C is obtained for 60 temperatures on 60 different days. Assuming that pop st/deviation is 1.5C, test the claim that the population mean is 23C. Use a 0.05 significance level. Identify the null hypothesis, alternative hypothesis, test statistic, P-value, conclusion about the null hypothesis, and final conclusion that address the original claim.
What is the Trimmed Mean For?
Why should you use ANOVA instead of several t tests to evaluate mean differences when an experiment consists of three or more treatment conditions?

Ask a Question


Comments
Name
Email Address


To prevent comment spam, please answer the following question before submitting (tags not permitted) :
What is 1 + 2: (enter the number)