Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

How Effective are Heuristic Evaluations?

Jeff Sauro • September 6, 2012


It's a question that's been around since Nielsen and Molich introduced the discount usability method in 1990.

The idea behind discount usability methods, like heuristic evaluations in particular and expert reviews in general, is that it's better to uncover some usability issues --even if you don't have the time or budget to test actual users.

That's because despite the rise of cheaper and faster unmoderated usability testing methods, it still takes a considerable amount of effort to conduct a usability test.

If a few experts can inspect an interface and uncover many or most of the problems users would encounter in less time and for less cost, then why not exploit this method?

But, can we trust heuristic evaluations? How do we know the problems evaluators uncover aren't just opinions? 

Do they uncover legitimate problems with an interface? How many problems are missed? How many problems are false positives?

Heuristic Evaluation and Usability Testing

To help answer these questions, we conducted a heuristic evaluation and usability test to see how the different evaluation methods compared.

We recently reported on a heuristic evaluation of the Budget and Enterprise websites. Four inspectors (2 experts and 2 novices) independently examined each website for issues users might encounter. They were asked to limit their inspections to two tasks (finding a rental location and renting a car).

In total, 22 issues were identified across the four evaluators. How many of these issues would users encounter and what was missed?

Prior to the heuristic evaluation we conducted a usability test on the same websites but didn't share the results with the evaluators.

In total we had 50 users attempt the same two tasks on both websites. The test was an unmoderated study conducted using userzoom. Each participant was recorded using Usertesting.com so we could playback all sessions with audio and video to identify usability issues. Two researchers viewed all 50 videos to record usability problems and identified 50 unique issues.

The graph below shows the 22 issues identified by the evaluators and the number and percent of users that encountered the issue.


Figure 1: Problem matrix for Budget.com ("B") and Enterprise.com ("E") from four evaluators (E1-E4) and the number and percentage of 50 users who encountered the issue in a usability test.

For example, three evaluators and 24 of the 50 users (48%) on Enterprise had trouble locating the place where rental locations were listed (issue #1 "Locations Placement"). 

Two evaluators and 14 users (28%) had a problem with the way the calendar switches from the current month to the next month on Budget.com (issue #16), as shown in the figure below.


Figure 2: When selecting certain return dates, the Budget  calendar will switch the placement of the month (notice how October goes from being on the right to the left).

All four evaluators found that adding a GPS to your rental after you added your personal information was confusing on Enterprise.com—an issue 62% of users also had (issue #11).

We found that the evaluators identified 16 of the 50 issues users encountered in the usability test (32% of the total). In terms of false positives, only two of the issues identified by the evaluators weren't found by any of the 50 users (9%).

How this study compares

There is a rich publication history comparing heuristic evaluations and usability testing. In fact, two of the most influential papers in usability cover usability testing and heuristic evaluations. In examining some of the more recent publications comparing HE and UT we looked for specific examples like our experiment, where the overlap in problems between the two methods is shown.

The table below shows four studies, in addition to the current one, that on average heuristic evaluations find around 36% of the problems in usability tests (ranging from 30% to 43%).

StudyOverlap
(Hits)
In HE not  UT
(False Alarm)
In UT not  HE  (Misses)InspectorsUsers
Current Study
32%9%68%450
Doubleday, et. al.36%40%39%520
Law & Hvannberg  200230%38%32%210
Law & Hvannberg  200443%46%48%1819
Hvannverg  et al. 200640%37%60%1010

Average

36%34%49%  

The overlap is called a "hit," meaning the discounted method hit on the same issue as found in the traditional evaluation method of usability testing.

To get an idea about potential false alarms, we see that on average, 34% of problems identified in Heuristic Evaluations aren't found by users in a usability test (ranging from 9% to 46%). These have come to be known as "False Alarms," suggesting these problems would not be encountered by users. 

Finally we see on average Heuristic Evaluations miss around 49% of the issues uncovered from watching users in a usability test.  Note: The percentages don't always add up to 100% because different problem numbers are used to derive the percentages.

This study had by far the most users (2.5x more than any other). This likely explains the much lower false alarm rate (9% vs. 34% average) and higher miss rate (68% vs. 49% average). With more users, you increase the chances of seeing new issues and you increase the chances of "hitting" the issues identified by the evaluators.

The lower False alarm rate might also be explained by our task-based inspection approach. Often inspectors aren't confined to specific tasks when evaluating an interface and detect problems in the less used parts—parts that often aren't encountered by users in a 30-60 minute study.

This exercise also illustrates the shortcoming of this approach for judging the effectiveness of heuristic evaluations. Just because an issue wasn't detected in a usability test doesn't mean it won't be encountered by a user. In fact, one of the advantages of an expert review is that it uncovers issues that are harder to find in usability tests because users rarely visit enough parts of a website or software application outside of the assigned tasks(s).

What's more, even 50 users represent less than 1% of the daily number of users on these websites, meaning it's presumptuous to assume that no users would have the issue. If we don't see an issue with 50 users we can be 95% confident between 0% and 6% of all users still might encounter it. For example, the two issues found in the heuristic evaluation and not found in our usability test both seem like legitimate issues--with enough users, we'd probably eventually see them.
 

Conclusions

The most effective approach at uncovering usability problems is to combine both heuristic evaluations and usability testing. 

Heuristic evaluations will miss issues: Even four evaluators failed to find an issue that 28% of users encountered in the usability test (a bizarre message on Enterprise to pick another country when 0 search results are returned for a rental location).

Heuristic evaluations will uncover many issues: The four evaluators did find all of the top ten most common usability issues and most (75%) of the 20 most common issues.

Heuristic evaluations will typically find between 30% and 50% of problems found in a concurrent usability test—a finding also echoed by Law and Hvannberg.

It's hard to conclude that issues identified in a heuristic evaluation and not in a usability test are "false positives."  It could be that the issues are encountered by fewer users and just weren't detected with the sample tested. It's probably better to characterize them as less frequent or "long-tail" issues than false positives.


So how effective are heuristic evaluations? While the question will and should continue to be debated and researched, I like to think of heuristic evaluations like sugary cereal. They provide a quick jolt of insight but should be part of a "balanced" breakfast of usability evaluation methods.

References

  1. Doubleday, Ann; Ryan, Michele; Springett, Mark; Sutcliffe, Alistair. "A Compaison of Usability Techniques for Evaluating Design". Centre for HCI Design, School of Informatics. Northampton Square, London. 1997. Pp. 104-108.

  2. Law, Lai-Chong and Hvannberg, Ebba Thora. "Complementarity and Convergence of Heuristic Evaluation and Usability Test: A Case Study of UNIVERSAL Brokerage Platform". NordiCHI. Arhus, Denmark. October 19-23, 2002. Pp. 71-80.

  3. Law, Lai-Chong and Hvannberg, Ebba Thora. "Analysis of Strategies for Improving and Estimating the Effectiveness of Heuristic Evaluation". NordiCHI. Tampere, Finland. October 23-27, 2004. Pp. 241-250.

  4. Hvannverg, Eba Thora; Law, Effie Lai-Chong; Larusdottir, Marta Kristin. "Heuristic evaluation: Comparing ways of finding and reporting usability problems". October 11, 2006. Elsevier B.V. Interacting with Computers. Volume 19. 2007. Pp. 225-240.


About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More



You Might Also Be Interested In:

Related Topics

Heuristic Evaluation, Usability Testing
.

Posted Comments

There are 6 Comments

June 13, 2014 | eeklipzz wrote:

Thanks so much for sharing your method. I really liked the bit about heuristic evaluations capturing 30-50% of issues. Always wondered. When I conduct a heuristic evaluation, I generally keep it light. All the while, I re-iterate to the business that the intention is to hopefully identify potential usability focus points (good and bad) with the caviet that it's only a first step. To really guage how a design is doing, there needs to be some formative testing done.rnrnI recently documented my process. Here it is if you are interested.

Creating a UX Assessment
http://uxfindings.blogspot.com/2014/06/creating-ux-assessment.html 


April 10, 2013 | Charles Lambdin wrote:

Aren't you using the usability test as the 'gold standard' here? I'm curious in that I recently read Rolf Molich's latest report out on his famous Comparative Usability Evaluation studies, which found that heuristic evaluation, if performed by experts, actually tends to outperform usability testing at identifying usability issues. He also found that depending on the pre-specified tasks used, who moderates the usability test and who analyzes the results that there is often actually very little overlap in the usability issues uncovered in different usability tests of the same interface. I'd be curious to hear your comments on this.  


September 10, 2012 | Gena wrote:

Interesting post, Jeff. I have the following observation: 2 evaluators E1 and E4 were able to discover 90% (20 out of 22 issues disovered by all 4 evaluators). In this particular case adding two more evaluators had minimal additional value. Both individual evaluators E1 and E4 were able to identify 15 issues each (68% of all issues discovered.) Not a bad number in case of a good individual evaluator.

Talking about effectiveness of heuristic evaluations. The most effective approach will be to first conduct a quick expert evaluation, fix issues and then start with usability testing. Thus being able to get a deeper insight. 


September 7, 2012 | David Bishop wrote:

Thanks, Jeff. You had me worried there at 32%, but the " four evaluators did find all of the top ten most common usability issues" fact, I think, is the real reason we recommend HE.
QUESTION: Did you / can you look at the cost, to determine the relative cost-effectiveness of the two methods? Was the HE roughly 10% of the cost of the U-test, or closer to 50%, maybe? 


September 7, 2012 | Eric wrote:

Would application of the KJ-technique (http://www.uie.com/articles/kj_technique/) help with lifting the HE and ER results? 


September 7, 2012 | Asbjørn wrote:

Thanks for a good post. Concerning the "false postive" issue, it could also be added that some problems identified in heuristic evaluations may be of a kind that will not be encountered in a usability test, but still represent a potential improvement. For example, an inefficient task flow may not generate users having problems, or even complaining, in usability testing, but may easily be detected by an experienced inspector. 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 5 + 2: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4289 Subscribers]

Connect With Us

Our Supporters

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Usertesting.com

Use Card Sorting to improve your IA

Loop11 Online Usabilty Testing

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.