Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Do users fail a task and still rate it as easy?

Jeff Sauro • October 9, 2009

Have you ever watched a user perform horribly during a usability test only to watch in amazement as they rate a task as very easy to use?  I have, and as long as I've been conducting usability tests, I've heard of this contradictory behavior from other researchers. Such occurrences have led many to discount the collection of satisfaction data altogether. In fact I've often heard that you should watch what users do and not what they say because attitudes are often at odds with their behavior.  For example:
  • Participants' answers [to likert questions] are often at odds with their behavior (UPA 2004)
  • Objective measures of performance and preference/satisfaction do not often correlate (Mayhew 1999)
  • First Rule of Usability: Don't Listen to Users…pay attention to what users do, not what they say. Self-reported claims are unreliable (Nielsen 2001)

So there appears to be some well formed opinions about the collection of subjective usability data. I wanted to see if I could find a systematic pattern either supporting or refuting the apparent truism that users say and do different things.

The third quote from above "Don't Listen to Users …" comes from a Nielsen article from 2001. A further look at this article suggests the issue might be a bit more nuanced. In it he reports on a relatively strong correlation (r = 0.44) between users' measured performance and their stated preference (in reference to Nielsen & Levy 1994).  In fact, in an earlier analysis I did with Jim Lewis (Sauro & Lewis, 2009) we found an even stronger correlation of .51 between post-task ratings of usability and completion rates. Post-task ratings of usability are those few questions asked immediately after a task is completed. While they can measure different things, they all tend to correlate highly with each other and attempt to measure something like the perceived ease of use or the overall satisfaction with the task. For brevity, I'll refer to what they are measuring as satisfaction.

When users fail a task 14% rate the experience at the highest possible score in satisfaction.
To dig into what a correlation of .51 between completion rates and post-task satisfaction means, I returned to the database from which we did that analysis. It has since grown to contain 123 usability tests representing data from over 3000 users. They are summative usability tests from 30 institutions and are a mix of mostly lab-based (attended) and remote (un-moderated) summative usability tests. In this dataset 53 tests contain some type of post-task usability questionnaire.  From these tests there are 677 tasks generating 19,576 observations of users attempting a task and rating their perception of the usability. If users say and do systematically different things then it's going to show up here.

Almost all questionnaires (94%) use either 5 or 7 point scales and are either a single questions such as "Overall, this task was" (very easy to very difficult) or averages of 2 to 5 questions (see for example the ASQ, Lewis 1993).  The remaining questionnaires are the Subjective Mental Effort Questionnaire (aka SMEQ) (4%), some Usability Magnitude Estimation (1%) and some other 3 point scales (1%) See (Sauro & Dumas 2009) for more information.

To analyze the different scale types together, I converted the raw scores into percentages of the maximum score and reversed some scales so they all pointed in the same direction—where higher scores mean higher satisfaction. For example, on a five point scale raw scores of 1,2,3,4 and 5 become 0, 25, 50, 75 and 100%. On a seven point scale they become 0, 16, 33.3, 50, 66.6, 83.3 and 100%. There would be more points in-between for questionnaires that used more than one question.

Failing a Task

Let's first look at the users who failed a task and see how they rated it in terms of satisfaction. Of the 19,576 task observations, 5,594 were failures (around 29%). Of these failed tasks, 790 or 14% rated the task the highest possible score in satisfaction (with a 95% Confidence Interval between 13.2 and 15.1%). That would be like rating a task a 5 out 5 after failing. Users are about 6 times more likely to rate satisfaction something less than the maximum score when they fail a task (86%/14%). Figure 1 below shows the distribution of responses when users fail as a percent of maximum scores.

When we lower the ratings to 90% of the maximum score, 16% rated at or above this point. (95% CI between 15.2 and 17.2%). The combination of 90% is only possible from questionnaires with at least two questions or scales (e.g. a rating of a 4 and a rating of a 5 gets you 4.5 which is 90% of 5) or from the single question, 150 point SMEQ.

When we lower the threshold to 80% of the maximum score we get 27.3% of users rating it at this point or higher. (95% CI: 26.1 to 28.4%). A rating of 6 out of 7 would fall in this range. Lowering the threshold again to 70% shows that 32.5% of users (95% CI: 31.3 to 33.7%) who fail a task rate it at this point or higher. This would include users who rate 4 out of 5 (75% of the maximum score).

If we look at just 7 point scales (53% of all scales), given a task failure, the probability of a rating of the maximum satisfaction score of 7 is 18%. The probability of a 1—the minimum score—is 24%.

If we look at just 5 point scales (42% of all scales), again given task failure, the probability of a rating of the maximum satisfaction score of 5 is 9%. The probability of the minimum satisfaction score of 1 is 17%.


Figure 1: Distribution of responses when users fail a task. % of Max score is used so different point scales can be combined (e.g. 5 and 7). For example, a raw rating of 6 on a 7 point scale is a % Max Score of 85.7.

Passing a Task

As a point of comparison, Figure 2 shows scores from users who successfully completed ("passed") the task. Twenty-seven percent rated it as the maximum score (95% CI: 26.3 to 27.8%), 33% rated it at 90% of the maximum score or above (95% CI: 32.3 to 33.8%) and 51% rated it at or above 80% of the maximum score (95% CI: 50 to 51.6%). Only 2.8% of users (2.5 to 3.0%) who passed a task rated it at 0% of the maximum score (e.g. a 1 out of 7 or 1 out of 5).

Figure 2: Distribution of responses when users pass a task. % of Max score is used so different point scales can be combined (e.g. 5 and 7). A raw rating of 6 on a 7 point scale is a % Max Score of 85.7.

Rating an Extreme Score

80% of users who rate maximum satisfaction pass and 80% of the users who rate minimum satisfaction fail the task.

We looked at what people rate when they pass or fail a task, but we can also look at satisfaction data from the opposite perspective—what percent pass or fail given a certain rating. When an item gets rated at its maximum or minimum, it appears a better indication of the task's outcome (pass or fail). In fact, 82.7% of users who rated at 100% satisfaction also successfully completed the task (95% CI: 81.6% to 83.8%). Conversely, users who rated the lowest satisfaction scores completed 22% of tasks (95% CI: 20% to 24.2%). In other words, it's an 80/20 rule of satisfaction and completion. Approximately 80% of users who rate at the maximum level of satisfaction will pass and 20% fail the task. Approximately 80% of users who rate at the minimum satisfaction level will fail the task and 20% will pass it.  Users are 4 times as likely to complete a task when they rate it at the maximum satisfaction versus the minimum satisfaction score.


Figure 3: Average completion rates when users rate the minimum satisfaction score and maximum satisfaction score. Users are 4 times as likely to complete a task when they rate it at the maximum satisfaction versus the minimum satisfaction score.

Conclusion

So do users who fail a task still rate it as easy?  If we restrict "easy" to only mean the maximum level of satisfaction, then about 14% of users do. If we more loosely define easy as anything above 75% of the maximum score (3.75 out of 5 and 4.9 out of 7), then this happens about 1/3 of the time.

I'm inclined to think that it's those extreme responses that garner all the attention. There are likely a number of reasons for the less than perfect correlation between task success and satisfaction rating. One major factor is that users aren't always aware they "failed" a task, and so, thinking they successfully completed it, give it a high rating. The noticeable task failures lead to low satisfaction ratings most of the time. The 80/20 rule tells us that extreme measures are a decent indication of task outcomes. I'm also inclined to think that rating tasks in the extreme after poor task performance are such salient events that it contributes to an Availability Heuristic. We researchers are generally interested in improving ease of use, so we tend to focus more on task failures. We also tend to remember those 5's much more than all those 3's and 4's.  In other words, we remember the task failures and we really remember the task failures which elicit favorable satisfaction scores.  But in general the extreme satisfaction responses tend to be informative.  Users are about 6 times more likely to rate satisfaction something less than the maximum score when they fail a task. They are also 4 times more likely to complete a task when they rate it the maximum satisfaction versus the minimum satisfaction score.

References

  1. Nielsen (2001) "First Rule of Usability? Don't Listen to Users" http://www.useit.com/alertbox/20010805.html
  2. UPA Idea Market 2004
  3. Mayhew, Deborah, "The Usability Engineering Lifecycle" 1999 p 129
  4. Lewis, J. R. (1991). Psychometric evaluation of an after scenario questionnaire for computer usability. studies: The ASQ. SIGCHI Bulletin, 23, 1, 78-81.
  5. Nielsen, J. and Levy, J. (1994). Measuring usability: preference vs. performance. Commun. ACM 37, 4 (Apr. 1994), 66-75.
  6. Sauro, J. & Dumas J. (2009) "Comparison of Three One-Question, Post-Task Usability Questionnaires." in Proceedings of the Conference in Human Factors in Computing Systems (CHI 2009) Boston, MA.
  7. Sauro, J., & Lewis, J. R.  (2009) "Correlations among Prototypical Usability Metrics:Evidence for the Construct of Usability" in Proceedings of the Conference in Human Factors in Computing Systems (CHI 2009) Boston, MA.

About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More


UX Bootcamp: Aug 20th-22nd in Denver, CO
Best Practices for Remote Usability Testing
The Science of Great Site Navigation: Online Card Sorting + Tree Testing Live Webinar

Related Topics

Task Completion, Post-Task Ratings
.

Posted Comments

There are 4 Comments

October 13, 2009 | Jeff Sauro wrote:

Since the data was a combination of studies done by dozens of usability engineers it contained all sorts of scales--some with adjectives some without, with all varieties of anchors and pointing in different directions. But I agree, there is some percentage that are simply picking the wrong choice, and in fact, in might be the bulk of the 14% who rate paradoxically. It would certainly be a topic for future research. The good news is that it seems to happen enough to be a nuisance but not too much to make the ratings useless. 


October 13, 2009 | Mark Sheldon wrote:

Possibly the people who fail tasks but report a high level of satisfaction are in the lowest quartile of ability level and are used to failure.

Also a possible means to measure satisfaction is the "How likely would you be to recommend to a friend?" question identified by Reicheld, F. as the "One number you need to grow" - the only measure of customer satisfaction correlated to long run financial performance. "Satisfaction" on its own may mean different things to different people. 


October 13, 2009 | David Travis wrote:

Great article as usual Jeff.

In your studies, did people respond to the survey question by picking a number (e.g. 1-5) or picking a descriptive adjective (e.g. very hard, hard, easy, very easy)? The reason for asking is that another possibility is that people read the scale wrong: they rate the task as 1 (easy) when they meant to choose 5 (hard). In other words, they just make a mistake filling in the survey.

I doubt this can account for all of the paradoxical data, but if you're asking 12 participants to carry out 8 tasks (96 ratings), it wouldn't surprise me if you get 10 ratings that are simply erroneous. 


October 12, 2009 | Michael Gaigg wrote:

Hey Jeff, I found it very interesting to see that task success is not necessarily a criterion for satisfaction (reminds me a little of Las Vegas where people tend to be happy despite the fact that they are losing) which is one more reason to not trust what users are saying (telling)...

So, when users are happy, they are 4 times more likely to succeed - then heck, give them cake ;)
Any ideas what actually makes them happy? It seems that a tasty cake is not all, just the fact that they receive the cake in the first place might be enough?! 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 1 + 3: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4083 Subscribers]

Connect With Us

UX Bootcamp

Denver CO, Aug 20-22nd 2014

3 Days of Hands-On Training on User Experience Methods, Metrics and Analysis.Learn More

Our Supporters

Usertesting.com

Loop11 Online Usabilty Testing

Use Card Sorting to improve your IA

Userzoom: Unmoderated Usability Testing, Tools and Analysis

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.