by Jeff Sauro | March 8, 2004 ::
RSS
Summary
The discerning usability analyst should employ a mix of both qualitative and quantitative methods when discovering usability problems. The risks of relying heavily on a qualitative approach can lead to a severe misdiagnosis especially when usability problems are difficult to detect. This article is a response to Nielsen’s "The Risk of Quantitative Studies" and shows how the problems voters had with the “butterfly-ballot” in the Florida 2000 election would not have been detected with popular discounted qualitative methods. The problems with relying on one-size-fits all usability guidelines such as “testing with only five users” and the inherent bias of pay-for-hire guru’s are also discussed.Introduction
In Jakob Nielsen’s most recent article “The Risk of Quantitative Studies”[8] Jakob presents some valid points on the limits of quantitative methods yet his points are lost in a sea of bombastic exaggerations and over-generalizations. Jakob essentially warns against using unnecessarily “number-fetishism” via half-baked quantitative methods and not to go “fishing for a significant p value.” Yet risks are inherent to every method of usability engineering. There are as many risks (in my contention more risks) by relying solely or even heavily on qualitative methods and dismissing the importance of quantitative analysis in usability testing. The emphases in Nielsen’s article are clear:Qualitative Studies: Even More Intrinsic Risksn
Testing hundreds of users is time consuming and expensive. Doing so is not required to use quantitative methods or statistics. Instead, statistics are used to understand and manage your uncertainty about a problem. A research hypothesis should be clearly defined and the appropriate methodology should be used to test the hypothesis. The careful analyst should always be aware of the limitations of their data, the “observer effect” and other unknown factors. Take the example given by Jakob [8] about the "Butterfly-Ballot" problem:The "butterfly ballot" in the 2000 election in Florida is a good example: a study of 100 voters would not have included a statistically significant number of people who intended to vote for Al Gore but instead punched the hole for Patrick Buchanan, because less than 1% of voters made this mistake. A qualitative study, on the other hand, would likely have revealed some voters saying something like, "Okay, I want to vote for Gore, so I'm punching the second hole ... oh, wait, it looks like Buchanan's arrow points to that hole. I have to go down one for Gore's hole." Hesitations and almost-errors are gold to the observant study facilitator, but to translate them into design recommendations requires a qualitative analysis that pairs observations with interpretive knowledge of usability principles.Let’s hypothetically use Jakob’s recommendation of using 5 users [7] to test the ballot (he really only recommends 3-4 for testing [1], but we’ll use 5). To test this ballot we would have watched five users attempt to cast a vote for their candidate. Also assume that we split our sample of five into two users who intended to vote for Gore, two for Bush and one for Buchanan. Because we’ve sub-grouped to be thorough we probably need 3-4 users for each intended vote since the two votes for Bush in this case will reduce our chances again of detecting the problem with Gore/Buchanan voters, but we’ll keep things simple with a total of five users.
Studies show that most users don't make mistakes when confronted with bad interfaces, they just slow way down. (The Florida results bear this out. One in ten people have trouble with the ballot, but only one in one hundred end up making an error.)[14]I’m not sure what “studies” Tog is referring to but I agree with the implication: You need to take measurements of task time! In other words, quantitative methods in this case will provide information qualitative methods cannot.
The Pay-for-Hire Guru Bias
Every researcher and analyst has a bias and the prudent reader should identify and understand the implications of that bias when reading any research report (including this one). A drug study sponsored by a drug company should raise a red flag as should a report of online advertising sponsored by an advertising agency—as Jakob rightly points out [8]. The publication bias of “highlight[ing] new and interesting stories”[8] is real and should not be dismissed, however, such a bias can be mitigated over time as other researches attempt to replicate results or uncover flaws in the methodology. Time is the final arbiter of veracity. More research should always be encouraged not discouraged.One-Size Doesn’t Fit all in Usability
The Florida Ballot example among many other examples is reason to be cautious about one-size fits all usability testing guidelines such as “You only need to test with five users.” [7]Today, almost everyone who does user testing has concluded that they learn most of what they'll ever learn with about five users. [8]I’m not sure who Jakob is talking to, but since the publication of his “Curve of Optimism” [6,7] there have been several articles pointing out significant limitations of his formula for calculating the number of users you need to test. Most notably are:
Different Articles, Conflicting Views
Jakob’s valuable contribution to the field of HCI is unquestionable. His Alertbox articles of late are sending out mixed messages. Just a few months prior, I whole-heartedly agreed with Jakob’s article on the value of using Six Sigma Quality Assurance methods in usability engineering.[11] His recent article seems to undercut that suggestion: “quantitative studies are often too narrow to be useful and are sometimes directly misleading.” [8] In Advocating Six Sigma, Jakob advises: “We'd be wise to adapt some of the Six Sigma methodologies to aid our quest for improved Web quality.” [11] Six-Sigma’s main tenet is that if you don’t measure something you truly don’t know it. In short-quantitative methods allow you to make usability improvements unattainable through qualitative methods. This concept is made very clear on the link Jakob provides to find more information on Six Sigma:Six Sigma is a rigorous and disciplined methodology that uses data and statistical analysis to measure and improve a company's operational performance by identifying and eliminating "defects" in manufacturing and service-related processes. From isixsigma
I usually advocate qualitative usability studies, because usability's main goal is to drive the design. For formal quality assurance, however, you must run quantitative studies to collect hard numbers that show how well or poorly your design scores on the usability criteria you defined above.[11]I’ll give Jakob the benefit of the doubt and assume he’s somehow making a distinction between Quality Assurance Usability Testing and the rest of the Usability Testing. To me, there is no such distinction as his Butterfly Ballot example shows. The diligent analyst should have approached the Florida Butterfly Ballot Usability Test like any usability test and have at hand all the quantitative and qualitative methods that provide for a thorough understanding of potential problems. When done properly a usability test should provide both qualitative problem descriptions and quantitative measures such as frequency of occurrence, impact on task completion rates and task time.
Bad Quantitative = Bad Qualitative
In a side-bar clarification to The Risk of Quantitative Studies, Jakob goes further to explain the problem of statistical analysis in “Probability Theory and Fishing for Significance”[9] by using an example most of us encountered in an undergraduate statistics class.This is why it is not valid research to conduct a study, collect lots of data about lots of variables, and then claim significance because some of the variables seem to correlate. Doing so is exactly the same as tossing lots of quarters, then reporting on the few coins that had an unusual outcome.[9]I’m not sure Jakob’s motivation for the article [8] but this example presents a dangerous rationale. It implies that usability practitioners who use statistics somehow cannot tell the difference between an unfair coin fluke from the binary-probability formula and relevant user behavior. One should always be aware of the limitations of research methods—confounding effects, covariates and normality assumptions to name a few--but these limitations should not prevent them from being used. Was Nielsen tossing quarters when he reported that “There is a strong positive association between users’ average task performance and their average subjective satisfaction…” in his 1994 article “Measuring Usability: Preference vs. Performance”[12] one of many papers relying heavily on analyzing “a bunch of variables and looking for a correlation”? I certainly don’t think so.
Discount Usability’s Cost, not its Methods
As more companies understand the importance of User Centered Design methods and use them as part of their product development, the easily detected low-hanging fruit identified through “expert-reviews” and other discount methods will provide less and less usability value. Usability practitioners will need to continue to refine their skills and understand the importance of quantitative assessments—something you can’t teach a product manager from a “Three-Day Usability Boot-Camp.” The only thing about usability testing that should be discounted is the cost, not the depth of analysis; discounting your methods by relying on qualitative methods when a thorough quantitative analysis is warranted will result in discounted results and a discounting of your reputation. In time the discounted methods will clear-themselves “off the HCI's shelves.[19]”References
View All Articles |
Subscribe to RSS
|
Follow on Twitter |
Get Email Updates
Does better usability increase customer loyalty?
If 1 of 5 users has a problem in a usability test will it impact 1% or 20% of all users?
What five users can tell you that 5000 cannot
The Five Most Influential Papers in Usability
Can you use the SUS for websites?
6 things you didn’t know about Heuristic Evaluations
Why you only need to test with five users (explained)
Confidence Interval Calculator for a Completion Rate
Featured Product
Copyright © 2004-2010 Measuring Usability LLC
