Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Are Severe Problems Uncovered Earlier in Usability Tests?

Jeff Sauro • October 30, 2013

After I conducted my first usability test in the 1990's I was struck by two things:
  1. just how many usability problems are uncovered and
  2. how some problems repeat after observing just a few users
In almost every usability test I've conducted since then I've continued to see this pattern. 

Even after running 5 to 10 users in a moderated study, there are usually too many problems for even the most dedicated and well funded development team to address. Providing a prioritized list is an obvious and essential approach.

Problems can be prioritized by both how many users encountered the problem (frequency) and the impact the problem has on performance or other key metrics (severity).

Some History

For about as long as the modern usability profession has, well, been a profession, an important question has been asked: If you only test with say 5 to 10 users, are you more likely to see the critical usability problems in those first few users?  Put more directly, are problem frequency and problem severity correlated?  That is, small sample sizes will uncover the more frequently occurring issues (this is demonstrated easily with probability), but if frequency and severity are correlated, then small sample sizes will also uncover the most frequent and severe issues.
 
The relationship between problem frequency and severity has been the subject of an ongoing discussion in usability labs and to a lesser extent, the literature. Most famously, Robert Virzi found[pdf] that more severe usability issues tended to happened more frequently across two usability studies reported in 1990 and 1992. He found a positive correlation between problem severity and frequency (r = .463).  In other words, those first few users were likely to uncover the more critical issues. 

A conclusion from these findings is that practitioners conducting usability tests would need fewer users to detect more severe problems.  In his studies, virtually all the problems rated high in severity were found with the first 5 participants! This is important as many lab-based usability studies today are still run with a small number of participants, typically between 4 and 10.

When attempting to replicate Virzi's findings, Jim Lewis did find support for the idea of using small sample sizes for uncovering problems, but he failed to find a similar relationship between the frequency of a problem and its severity. In 1994, Lewis examined[pdf] the usability data from 15 participants attempting tasks on productivity software (word processing, spreadsheets, calendar, and mail). The correlation he found between severity and frequency was not significant (r = .06).

Lewis recommended treating severity and frequency as independent. That is, a usability problem is just as likely to be one of low severity as it is of high severity. Despite the obvious importance of this topic, the only other study we've found that has addressed this issue was one by Law and Hvannberg in 2004. Their results supported Lewis in finding no correlation.  We decided to investigate this relationship with some of our datasets.

Method

We looked at usability problems from nine usability tests conducted on websites and mobile applications.  The tests included both in-person and remote moderated data on ecommerce websites, a sports merchandise website, an iPhone and iPad app from a cable provider, and an ecommerce website used on a tablet.

To help reduce the bias of knowing problem frequency before rating problem severity, we had multiple evaluators (between 2 and 4) rate the severity of the problems independently on a 3 point severity scale with defined levels (1=minor, 2=moderate, 3=critical). 

Problem severities were then aggregated and an average problem severity was generated. For example, a problem from one study was "Software screenshots appeared interactive," which received a moderate severity rating (2) from one evaluator and a minor (1) from another evaluator who did not observe the sessions. The average problem severity from these two evaluators was a 1.5.  The average problem severity for all problems was then correlated with the problem frequency for each of the problem sets.

Results

The correlations for each study are shown below. For example, in Study 1, 75 issues were reported from observing 17 users. Four evaluators rated the severity of the issues and the correlation between severity and frequency was r = .09.   This correlation is both low and not statistically different than 0. 
StudyIDUsersEvaluatorsIssuesCorrelation
1174750.09
2123370.25
374360.27
47224-0.12
55225-0.39
652320.22
76237-0.38*
8202290.47*
920236-0.05
   Mean r.056
Table 1: Polychoric correlations for the nine usability studies. * Indicates correlations statistically different than 0 at the p < .05 level. The Fisher transformation was applied to the correlations before averaging.

The correlations range from a low of r = -.39 to a high of r = .47 with an average correlation of r =.056. This average is not statistically different from 0. Of the two datasets that were significantly different than 0 (studies 7 and 8), one showed a significant negative correlation!  That is, for study 7, less severe problems actually happened more frequently than more critical ones.

Study 8 had the highest positive correlation between frequency and severity which is similar and not statistically different than the correlation reported by Virzi twenty years ago (r = .463).  One possible reason for the correlation was the evaluators may have remembered some of the more frequently occurring issues when rating severity. In fact, in every study, one of the evaluators assigning severities WAS the facilitator for the test—so we should expect at least some influence and therefore some positive correlation.  

What's more, all of the evaluators used in these studies work in our lab and therefore had some idea about what issues were more frequent, even if they didn't facilitate the studies. 

To help mitigate the bias, we sent the 29 problems from Study 8 to an independent evaluator with decades of experience conducting usability tests. He was provided the same three point rating scale and same problem descriptions the two evaluators also received.  His correlation between severity and frequency was positive, but at a smaller r = .21, which was not statistically different than 0.

Conclusions

The advantage of looking at multiple studies using different devices, facilitators, and evaluators is that we don't need to rely on a single study with its potential flaws and idiosyncrasies to draw a conclusion about the relationship between frequency and severity.  Here are some of the key takeaways:
  1. Frequency & severity aren't correlated: The analysis of these nine studies suggests there is as much evidence that more severe problems happen less frequently than trivial ones as there is evidence that more severe problems happen more frequently.

  2. The first few users are NOT more likely to find the more critical issues: With little evidence supporting a correlation, it means those first five users are NOT more likely to uncover the more severe issues.

  3. Small sample size testing is still valuable: Just because the first few users won't be more likely to uncover more severe problems DOES NOT mean that testing with smaller sample sizes should be dismissed. Problem severity ratings are subjective and the first few users still will uncover the most frequent issues (it's basically a mathematical tautology: high frequency issues will be seen more of the time).

  4. Frequent issues often just appear more critical: When a problem affects a lot of users, even a trivial one, it just seems to be more critical, even if the impact is minimal on the experience. This co-mingling of the concepts affects our ability to accurately judge a correlation. While this is a problem for assessing the correlation between frequency and severity it's probably not that harmful in practice.

  5. It's difficult to independently test severity and frequency: In actual practice it doesn't make sense NOT to have the facilitator of a usability test assign the severity ratings. Even the best written problem descriptions are difficult to understand without context. To mitigate the bias we have at least one additional evaluator rate the severity and then average them.

  6. Don't be surprised by a correlation: Given the biases and co-mingling of frequency and severity, don't be surprised to see a positive correlation in your data.  We were surprised that most studies didn't show a positive correlation despite these biases.

  7. Problem severity ratings can be inconsistent: One of the reasons we recommend averaging ratings is that it's a difficult and subjective job to assign severity ratings.  For one study we had the same evaluators re-assign severity ratings after a two-day delay.  While these intra-rater ratings correlated reasonably high (r ~ .5) it still showed that there is inherent unreliability in this task. Future studies may examine the impact of using more reliable rating methods and examine its impact on the correlation with problem frequency.




About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More


UX Bootcamp: Aug 20th-22nd in Denver, CO
Best Practices for Remote Usability Testing
The Science of Great Site Navigation: Online Card Sorting + Tree Testing Live Webinar


You Might Also Be Interested In:

Related Topics

Usability Problems
.

Posted Comments

There are 6 Comments

November 14, 2013 | louis vuitton handbags wrote:

We do know that customers are always looking for bargain prices online, with cheap Wholesale Organic Hemp And Bamboo Fabric Supplier, and Geeta India Clothing Wholesale. 


November 14, 2013 | bottes de moto wrote:

Absolutely think that which you explained.Your favorite reason appeared to be on the internet the best issue to be informed of.I say to you, I surely get aggravated although people consider about problems that they plainly don't know about.You managed to hit the nail upon the best as effectively as outlined out the complete point without having obtaining facet-effects , men and women can just take a sign.Will likely be back again to get far more.Thanks 


November 14, 2013 | kernel r4i gold wrote:

1st tour to Italy. That must be extremely thrilling to them. The imagined of the expertise would be quite unforgettable. The picture using could give inspire them too. 


November 14, 2013 | ヒンクレール ダウン η§‹ε†¬ζ–°δ½œ wrote:

素敵な<a href="http://www.arizonafoodstore.com/nixon-timeteller-c-85/">ニクソン タイムテラー</a> http://www.arizonafoodstore.com/nixon-timeteller-c-85/ 


November 14, 2013 | barbour sale wrote:

Our <a href="http://goo.gl/iO2xFR"target="_blank"><span style="color:#FF6600"><em><strong>canada goose</strong></em></span></a> products are only using the best materials, is not beyond the quality, high-tech high-end design, welcome to place an order from us! 


November 14, 2013 | Barbour Quilted Jackets wrote:

Find the largest selection of <a href="http://goo.gl/iO2xFR"target="_blank"><span style="color:#FF6600"><em><strong>canada goose</strong></em></span></a> parka on sale.Get free shipping on all orders. 


Newsletter Sign Up

Receive bi-weekly updates.
[4105 Subscribers]

Connect With Us

UX Bootcamp

Denver CO, Aug 20-22nd 2014

3 Days of Hands-On Training on User Experience Methods, Metrics and Analysis.Learn More

Our Supporters

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Usertesting.com

Loop11 Online Usabilty Testing

Use Card Sorting to improve your IA

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.