Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Nine misconceptions about statistics and usability

Jeff Sauro • March 7, 2012

There are many reasons why usability professionals don't use statistics and I've heard most of them.

Many of the reasons are based on misconceptions about what you can and can't do with statistics and the advantage they provide in reducing uncertainly and clarifying our recommendations. 

Here are nine of the more common misconceptions.

Misconception 1: You need a large sample size to use statistics. 

Reality: Small sample sizes limit you to detecting large differences between designs or to discovering common usability problems. Fortunately, large differences and common problems are what matter to users most and especially in early design iterations.

For example, in an early design test of a new homepage design for a large ecommerce website I had the same 13 users attempt 11 tasks on a mockup of the existing homepage and the new design. Four of the 11 tasks had statistically different task times (see Figure 1 below).


Figure 1: Difference in tasks times for 11 tasks on a new vs existing homepage design. Error bars are 80% confidence intervals.

They were statistically different because users were able to complete three of the tasks more quickly on the old design and one task more quickly on the new design. Even tasks that showed no significant difference provided meaning, even with a major design change, the bulk of tasks are being completed in about the same amount of time. 

Technical Note: Don't think you always need to use 95% confidence intervals or consider something significant only when the p-value is less than .05, this is a convention that many publications use. For applied research you should more evenly balance the priority you give to Type I and Type II errors. See Chapter 9 in Quantifying the User Experience.

Misconception 2:  Usability problems are obvious and don't need quantifying. 

If a user trips on a carpet, how many users do you really need to test, much less quantify? (Quote attributed to Jared Spool).

Reality: If usability problems are all as obvious as tripping on a carpet, then what value do trained usability professionals really add? The reality is that many usability problems aren't that obvious and don't affect all users. If one user says your carpet is uncomfortable and four others don't, do you replace it? What if after testing 5, 10 or 15 users no one trips on the carpet? What can you say about the probability someone will?

For example, the first of 13 users we tested on the design of the new homepage had no idea what "Labs" meant (it was referring to upcoming products). Should we change it to something else? Is this a carpet trip or carpet uncomfortable issue?   After testing all 13 users only one other user had the problem with the term (2 out of 13 in total). Now should we change it?  Understanding the prevalence of this issue can help make a more informed decision and is the subject of the next misconception.

Misconception 3:  Statistics are only for summative (benchmark) testing, not formative (find and fix) testing.


Reality: Even if all you collect is a list of usability problems in formative evaluations you can still estimate the prevalence of the problems you observed by providing confidence intervals.

For example, for the two out of 13 users that had a problem associating "labs" with new and upcoming products we can be 90% confident between 4% and 38% of all users would also have some problem understanding the label.  There might be a better term, but given the data, at most 38% of users would have some difficulty making the association.

Misconception 4: When users think aloud the data is too variable.

Reality: Thinking aloud and measuring task time aren't mutually exclusive. You can measure time on task when users think aloud, just wait to probe on issues and ask questions between tasks. The research is mixed on whether thinking aloud actually increases or decreases task times.  When possible, have the same users attempt the same tasks on comparable interfaces.

For example, in a test of a new homepage design I had the same 13 users attempt the same 11 tasks while thinking aloud on a mockup of the existing homepage and a new design (this is the data shown in Figure 1 above). The average task time was statistically faster on one task and statistically slower on three tasks (addressing misconceptions 1,2 and 4). Even though the users were thinking aloud, they acted as their own control. Loquacious users were talkative on both the old and new version and reticent users were quiet on both versions.

Misconception 5: You need to show all statistical calculations in your presentations.

 
Reality: It's always important to know your audience. Just because you use statistical calculations doesn't mean you need to bore or confuse your audience with detailed calculations and figures.  Even though I advocate a use of statistics doesn't mean I start every conversation with z-scores.
 
Often adding error bars to graphs or asterisks to means can allow the audience to differentiate between sampling error and meaningful differences.  If Consumer Reports and TV news can provide information about confidence intervals (usually called margin of errors) so can you.

For example, I presented Figure 1 in a presentation to illustrate the difference in task times. During the presentation a Vice President quipped:  "I can't believe you're showing me confidence intervals on a sample size of 13" (misconception #1). 

In response I pointed out that even at this sample size we are seeing significant differences, some better and some worse and that confidence intervals are actually more informative for small sample sizes. With large sample sizes the differences are often significant, but the size of the difference is often modest and unnoticeable to the users. 

Misconception 6: Statistics don't tell you the "why." 

Reality: Statistics tell you the impact of the why. Showing that a usability problem leads to statistically longer task times, lower completion rates and lower satisfaction scores qualifies the severity of a problem and assists with prioritization.

The misconception is that somehow statistics replace descriptions of usability problems. It's not statistics OR qualitative problem descriptions, it's statistics AND qualitative problem descriptions.

Misconception 7: If there is a difference it's obvious from just looking at the data.

Reality: Eye-balling bar charts without any indication of sampling error (shown using error bars) is risky. If the differences really are statistically different, make the computation and show it.

For example, the graph below shows the difference in confidence ratings for two tasks on the homepage comparison test discussed above. The value shown is the average confidence rating for the new design minus the old design (so higher values favor the new design). Both tasks show that users were more confident on the new design than the old because the mean difference is greater than zero. But are the scores just a by-product of small sample sizes?


Figure 2: Difference in average confidence ratings for two tasks (New Design-Old Design). Higher numbers indicate more confidence on completing tasks for the new design.

The next graph shows the same tasks with 90% confidence intervals around the mean difference in confidence ratings.  Only Task 1's error bars do not cross zero and show that the higher confidence was statistically significant.


Figure 3: Same values shown as in Figure 2 now with 90% confidence intervals. The confidence interval for Task 1 doesn't cross zero showing that users are statistically more confident on the new design.

Misconception 8: Usability data isn't normally distributed.

Reality: Task time, rating scale data and completion rates when graphed don't follow a nice bell-shaped "normal curve". However, the sampling distributions of the mean follows a normal distribution, even for small sample sizes in most situations. When there is a significant departure from normality, there are some simple adjustments to make the calculations accurate at any sample size.  See Chapter 3 in Quantifying the User Experience.

Misconception 9: Using statistics costs more money.

Reality: The calculations don't cost money. Learning statistics, like learning how to conduct a usability test does require a commitment to learn and apply. Ironically, this is also a common argument for not conducting usability testing: it costs too much.

There are free calculators, books and tutorials to get you started improving the rigor of your usability test.  The first lesson is that statistics and usability analysis are a natural fit for making a quantifiably better user experience.



About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More



You Might Also Be Interested In:

Related Topics

Statistics, Sample Size
.

Posted Comments

There are 8 Comments

March 21, 2012 | Daniel Ponech wrote:

Loved the article and will be passing the URI to colleagues. I will, however, take a qualified exception to "Misconception 1: You need a large sample size to use statistics -- Small sample sizes limit you to detecting large differences between designs or to discovering common usability problems. Fortunately, large differences and common problems are what matter to users most and especially in early design iterations." (And in so doing align, somewhat, with Gaël Laurans.)

I'm going to suggest the distinction here is not "You need a large sample size to use statistics", but, perhaps, "Small sample sizes can point to usability issues as well as large samples, but without the high overhead."

Statistically valid sample sizes are well established and a crucial part of scholarly research that typically fall outside the bounds of the business realm. Using sample sizes beneath scholarly thresholds, does not mean results are invalid, only that they incorporate compromises that account for content and circumstance relevant to the discovery and application of insights gained from the research in question.

Less than 'statistically valid' doesn't mean 'less than valid' because small sample sizes still reveal viable results when it comes to interpreting user interactions with interfaces.

This being the case, seeking hard statistical validation for trends surfaced in smaller samples is unnecessary (and, I'd suggest, presenting those results in statistical terms is, well, sleasy).

I'm probably belabouring the point (oaky, I admit it, I'm *totally* belabouring the point), but perhaps "Misconception 1" is about how 'statistically valid' cannot be the boundary of what is knowable and actionable. 


March 21, 2012 | Carolyn Snyder wrote:

Great article, Jeff. I'll second what Jen said - you explain this stuff really well. I still don't use stats very often (misconception #2 is usually true for the projects I work on), but now I have more confidence about using confidence intervals :-). 


March 13, 2012 | Renate wrote:

A clear Compilation of important points. 


March 9, 2012 | Jen McGinn wrote:

Jeff, you're a brilliant presenter and clear and convincing writer. I agree with everything you say. But I'll tell you that for many of us it can be a challenge to convey the meaning nd value of even traditional statistical findings, much less ones that you have to convince your audience of. I know you are sure and smart enough to defend CIs on small samples, but we're not all you. I think that much of this is good to have on your/my back pocket, but we need to be equipped to defend it, and that can be much harder than the calculations that went into it.  


March 8, 2012 | John Clark wrote:

Good article. In my opinion, statistics in a usability context is all about spotting trends and trying to put some objective measurement into what might otherwise be considered as 'subjective insight'. For example, one could ask ten subjects what they thought about a particular design aspect and receive ten opinions. However, frame the question in terms of a discrete scale and you can then summarise across the sample set.

Shameless plug time (with my apologies) but this is an area I've tried to address in the Kupima system (http://kupima.com) by allowing test owners to specify questions in discrete forms (e.g. checkboxes, ratings lists, radio buttons) in addition to subjective forms (i.e. written responses). The big benefit is of course that it allows for quick summarised insight without too much interpretation.

In other words, it saves you time and therefore money.

I wrote about it here:
http://blog.kupima.com/new-feature-user-test-reports/

...which shows that it needn't be either expensive or time consuming to introduce discrete summarisation into remote user testing. 


March 7, 2012 | tim from intuitionHQ wrote:

Great article. I enjoyed the webinar. Looking forward to read more articles from you.  


March 7, 2012 | Jeff Sauro wrote:

Gael,

Good observation. I'm a big fan of dot-plots for showing the spread in data and I've included them for both times below.



You can see the higher variability for task two but in this case it's really hard to tell if it's statistically different than 0. 


March 7, 2012 | GaŽl Laurans wrote:

Regarding misconception number 8, you're cheating a little by showing only the means. This type of graphs is not very common in business contexts but in many cases looking at the actual data (say with a dotplot) can be enough (of course, some representation of the confidence interval for the mean can be overlaid on it as well).

It most likely would have revealed that uncertainty is much bigger in task 2 because confidence scores would be all over the place. It would also provide more insight than a regular confidence interval on the shape of the distribution (outliers, asymmetry, etc.) 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 3 + 3: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4239 Subscribers]

Connect With Us

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.