Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Task Times in Formative Usability Tests

Jeff Sauro • June 6, 2008

It is common to think of time-on-task data gathered only during summative evaluations because, during a formative evaluation, the focus is on finding and fixing problems, or at least finding the problems and delivering a report. For a variety of reasons, time-on-task measures often get left out of the mix. In this article, I show that time-on-task can be a valuable diagnostic and comparative tool during formative evaluations.

The three most common reasons I've heard for not using time-on-task in formative studies are:

  1. Using quantitative measures requires larger samples (>20).
  2. Average task-times are an inaccurate metric when users think-out-loud.
  3. Task times are only for benchmarking and not for identifying problems.

Below I discuss why these reasons should NOT prevent you from collecting time-on-task in your next formative evaluation.

Small Samples are Fine

One can collect time-on-task and use parametric statistics for the small (<10) sample sizes in usability tests. The major caveat is that small-sample statistical parameters should be used. For example, when calculating confidence intervals for task time, use t-statistics instead of the normal deviate (z-statistic) because t-statistics take into account the size of the sample in generating the interval. The smaller the sample the larger this value will be and, as your sample gets larger (especially above 30), then these two figures converge. For task completion or problem occurrences, the Adjusted Wald procedure for computing confidence intervals around a proportion also performs well for small samples (Sauro & Lewis 2005).  In short, your sample size alone does not preclude the use of taking time-on-task metrics or using statistics to describe them.

Task-Time as a Benchmark between Designs

If you are doing a formative evaluation as part of an iterative testing plan and you have used think-aloud during all iterations, the mean time-on-task becomes a benchmark to help judge the efficacy of subsequent designs. Although you will need larger differences between iterations, statistically significant differences are well within reach (for example see Bailey 1993).  That is, assuming you use the same tasks and have users think-aloud concurrently with their task attempts, you can compare the mean completion times across iterations. For improving the usability of a system, practitioners should also strongly consider relaxing their Type I rejection criteria (Sauro 2006) from the conventional publication threshold of p <.05 to say <.10 or .20. While this is always context dependent, in business applications one should look for a sufficient amount of evidence--not necessarily a preponderance of evidence--to conclude a design improves over its predecessor (Kirakowski 2003).

Getting an accurate and stable measure of the actual user time-on-task is more problematic that comparing designs. One would expect task times to increase as users are asked to think-aloud while completing tasks. The published data, however, is mixed, with some published studies actually showing faster performance while thinking-aloud possibly due to the invocation of cognitive processes that improve rather than degrade performance (Berry and Broadbent (1990).  For a good summary of the evidence, see Lewis 2006 p. 1282.  More research is needed to draw a conclusion on this aspect.  Regardless, I recommend focusing on relative task time improvements between designs because this avoids this issues altogether.

Task Times as Symptoms of UI Problems

While the absolute time might not be the best measure of the true task completion time, it allows analysis of outliers and patterns as a diagnostic tool. It might not tell you exactly what the problem is, but it can help tell you where there is a problem.  For example, the following task data graphed in Figure 1 were taken from the publicly available CUE-4 (Molich 2004) data from Team M, which timed 15 users while they thought-out-loud as they completed tasks on a hotel reservation website. This task asked the users to cancel a reservation.

Figure 1: Time to cancel a reservation on a hotel-website (in log-transformed seconds). One user took over 4 times the mean time to complete the task. Red solid line is the geometric mean and the green-dashed lines are the upper and lower bounds of the 95% Confidence Interval.

In graphing the report we quickly see that one user took over 4 times longer than the mean time to cancel the reservation (I graphed the data using the Graph and Calculator for Confidence Intervals for Task Times). This simple graph of the task times allows the investigator and reader of a report to zero in on potential causes of such a long task time (relative to the other users). While it's unclear from the report as to what was occurring during this task, an analysis of this user's profile shows that she had never visited a hotel website or ever made a reservation at a hotel website prior to the test. Her comments also reinforce her being a "novice" Internet user: "I feel that my inexperience with the web had a lot to do with difficulties."  Whether it was just the user's inexperience or some specific interface problems, perhaps particularly damaging to a novice, it is clear this user had trouble during the task. A few pixels tell the story.

Time-on-task is an under-utilized tool for formative evaluations. It costs nothing (just start and stop the time), is useful with any-number of users and it can be a valuable tool for diagnosing problems as well as making objective comparisons between iterations. I encourage you to collect time-on-task during your next formative evaluation.



  1. Berry, D. C., and Broadbent, D. E.  (1990).  The role of instruction and verbalization in improving performance on complex search tasks.  Behaviour & Information Technology, 9, 175-190.
  2. Bailey, G. (1993) Iterative methodology and designer training in human-computer interface design. In Proceedings of the INTERACT '93 and CHI '93 Conference on Human Factors in Computing Systems (Amsterdam, The Netherlands, April 24 - 29, 1993). CHI '93
  3. Kirakowski, J, (2005)"Summative Usability Testing: Measurement and Sample Size" in R.G. Bias and D.J. Mayhew (Eds): "Cost Justifying Usability: An Update for the Internet Age." Morgan Kauffman Publishers, CA, 2005.
  4. Lewis, J. R. (2006). Usability testing. In G. Salvendy (ed.), Handbook of Human Factors and Ergonomics (pp. 1275-1316). New York, NY: John Wiley.
  5. Molich, Rolf (2004) Comparative Usability Evaluation CUE-4.
  6. Sauro, J. (2006) "The User is in the Numbers" in ACM Interactions Volume 13, Issue 6 November-December.
  7. Sauro, J & Lewis, J R (2005) " Estimating Completion Rates from Small Samples using Binomial Confidence Intervals: Comparisons and Recommendations" in Proceedings of the Human Factors and Ergonomics Society Annual Meeting (HFES 2005) Orlando, FL

About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...

Learn More

Related Topics

Task Time, Formative Testing

Posted Comments

There are 1 Comments

February 11, 2010 | Dana Chisnell wrote:


I like your position on time-on-task during formative testing being an indicator of something gone wrong. I would argue that it\'s just one way of telling that something went wrong and what the issue was. In fact, I\'d say that in the problem you describe -- the person spending a longer time canceling a hotel reservation -- there were many other bits of evidence indicating that this person was having a problem. For example, with the think aloud, she probably was pretty verbal about her issues and questions.

There *is* a cost to the moderator (or the team). Usually, there are a lot of things going on in a formative test. Tracking time on task adds to that overhead, and adds to the data analysis time, as well.

This also assumes that the amount of time that the think aloud slows people down is constant from participant to participant. This seems very unlikely.

Anyway, neat idea, but for most of the teams I work with who are developing new designs, time is not the paramount indicator of success, failure, or progress. 

Post a Comment


Your Name:

Your Email Address:


To prevent comment spam, please answer the following :
What is 1 + 4: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4241 Subscribers]

Connect With Us


Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download