Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Should you care if your rating scale data is interval or ordinal?

Jeff Sauro • June 1, 2011

It's fine to compute means and statistically analyze ordinal data from rating scales.

But just because one rating is twice as high as another does not mean users are really twice as satisfied.

When we use rating scales in surveys, we're translating intangible fuzzy attitudes about a topic into specific quantities. 

Overall, how satisfied are you with your cell-phone service?

Very Unsatisfied
1
234Very Satisfied
5

What does a response of a 4 really mean? How much faith should we put into this number? 

Can we treat it the same as say as the temperature outside or the time it takes a user to create an invoice in accounting software?  Are all numbers created equally or should we be more critical of how some numbers were obtained?

Levels of Measurment

For as long as we've been responding to these ubiquitous scales there have been debates about what we can and cannot do with them. One of the earliest and most influential papers on how we should classify the numbers we analyze comes from S.S. Stevens. In 1946 he wrote the seminal paper "On the theory of scales of measurement,[pdf]" where he put forth four levels of measurements:

Nominal: Numbers that are really just labels like a zip-code, a phone number, a checking account number or numbers on football uniforms. This one is easy to remember because nominal sounds like name. 

Ordinal: Numbers that have an order like a runner's finishing place in a race, the rank of a sports team and the values you get from rating scales used in surveys or questionnaires like the Single Ease Question

We can't say how the differences between the numbers correspond to the differences in the attribute they represent and we certainly can't say the differences are all equal.  Some teams may be separated by 1 win whereas others may have a 10 game gap.  We can't say for certain if the difference in satisfaction between a 4 and a 5 is the same as that between a 3 and a 4 on a rating scale.

Interval:  If we can establish equal distances between ordinal numbers they become interval. The most common example is temperature in degrees Fahrenheit. The difference between 29 and 30 degrees on a thermometer is the same magnitude as the difference between 78 and 79 (I prefer the latter).  Rating scales can be scaled to have equal intervals. For example, the Subjective Mental Effort Questionnaire (SMEQ) has values that correspond to the appropriate labels. You can see the distance between the numbers is equal, but the labels vary depending on how enough people interpreted their meaning (originally in Dutch).

You can drag the maroon slider up and down (apologies to iPad and iPhone users this is in Flash).

Ratio: Interval numbers that have a true or natural zero point are called ratio and represent the "highest" point in Stevens' hierarchy. These would be task-times, reaction times or degrees Kelvin.  In all cases 0 means the absence of something--time or heat in these examples. There has been some work in using ratio scales in usability data.

Levels dictate the appropriate analysis

Stevens' didn't create the classification just for taxonomic joy; instead he argued that only certain calculations are permissible with each level of data. In fact, he said that you can't add, subtract much less compute a mean or standard deviations on anything less than interval data.  

This restriction is a problem for many academics and applied researchers because rating scale data is at the heart of marketing, usability and much of social sciences research.  If we cannot use means and standard deviations we also cannot use most statistical tests (which use means and standard deviations in their calculations). Even most non-parametric tests convert raw values to ranks (ordinal data) and then compute the mean or median.

Almost immediately after Stevens' publication, counter arguments appeared which were critical of tethering statistical procedures to a number's classification.  A potent and now famous rebuttal[pdf] came from the eminent statistician Fredric Lord (who went on to help create things like the SAT at the Education Testing Service). 

I can illustrate Lord's criticism with Stevens' rigid classification with a simple example. 

Here are 6 task times (ratio data): 

7,6,4,2,9,10

Here are 6 high temperatures in Celsius from a Northeastern US city (interval data):

7,6,4,2,9,10

Here are 6 responses to the Likelihood to Recommend Question (ordinal data):

7,6,4,2,9,10

Now here are 6 numbers that came from the back of football jerseys (nominal data):

7,6,4,2,9,10

Can you tell the difference?  The numbers don't know where they came from and in fact they are all the same. You can compute the mean (6.33) and standard deviation (3.01) and perform statistical calculations regardless of where the numbers came from.

Lord in his satirical rebuttal showed how you could take means of football numbers to show that sophomores were leaving lower numbers to the freshman (a statistical analysis on Nominal data).

Despite the criticisms, Stevens' classification system caught on and over 60 years later it is still taught in many introductory statistics courses. Some statistical packages still use Stevens' language in guiding the appropriate test to use (see the figure below).


 
Figure 1: A screen shot from SPSS showing vestiges of Stevens' hierarchy still in use.

You should care a little about where the numbers came from

You're likely to encounter strong opinions on this matter but my recommendation is to go a head and compute means, standard deviations and statistical tests on ratings scales and other ordinal data. The large number of publications in the social sciences that use rating scales suggest I'm not alone despite the somewhat antiquated warnings. The numbers don't know where they came from, however, this doesn't mean we should ignore where they came from when we interpret them.

If the mean satisfaction score on Product A was 2 and 4 on Product B, this doesn't necessarily mean that users are twice as satisfied on product B.  Four is definitely twice two, but equating the mean response to actual levels of satisfaction is risky unless you can show that the feeling of satisfaction of a four is really twice as much as a two.

Beyond even satisfying the most stringent measurement theorist, there are other potential advantages to interval and ratio scales.  A few years ago, Joe Dumas and I compared three single questions[pdf] which assessed the perceived usability of a task. One was ordinal (the SEQ), one interval (the SMEQ) and one ratio (a version of Usability Magnitude Estimation (UME) ).

We found that there was indeed some added benefit to the interval scaled SMEQ but it was only modest and it may have simply come from providing users with more response options. The SEQ has only 7 whereas the SMEQ had thousands of response options.

In applied research we are in most cases interested in determining which product or design generates higher scores, whether these be on satisfaction, usability or loyalty. The magnitude of the difference is also important–a 2 point difference is likely more noticeable to users than a ¼ point difference. But even if you were to commit the error and say that users were twice as satisfied on one product you've almost surely identified the better of two products even if the actual difference in satisfaction is more modest.

About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 20 journal articles and 4 books on statistics and the user-experience.
More about Jeff...


Learn More


UX Bootcamp: Aug 20th-22nd in Denver, CO
Best Practices for Remote Usability Testing
The Science of Great Site Navigation: Online Card Sorting + Tree Testing Live Webinar


You Might Also Be Interested In:

Related Topics

Rating Scale, Survey, Ordinal, Interval
.

Posted Comments

There are 8 Comments

April 27, 2014 | Glenn wrote:

It is just as easy to use a 5-level ordinal verbal (or number) scale, and then assign ratio scale values to the ordinal responses. Then you get both usability and accuracy. Using AHP to calculate ratio scale values for a 5-level scale you get (from highest to lowest) 0.503, 0.260, 0.134, 0.068, and 0.035. If you look at the ratio of level 4:2 (which is 2:1 in ordinal thinking), it is nearly 4:1 in ratio scale. That is some big distortion. 


March 28, 2014 | asong edwin wrote:

well am curiouis about chris' question but more worried about calculating mean and STANDARD DIVIATION of a data


edwin 


November 18, 2013 | Jeff Sauro wrote:

Chris,

Yes, the confidence interval will provide you with the most plausible range of the population mean--which is 2.96 to 3.44. I'm not sure what the 15 confidence interval refers to though.  


November 17, 2013 | Daniel Mocciolo wrote:

I also am curious about the answer to Chris' comment.  


March 14, 2013 | KDM wrote:

But why settle at all when there are systems like AHP that can process customer responses on an interval scale?rnrn"But even if you were to commit the error and say that users were twice as satisfied on one product you've almost surely identified the better of two products even if the actual difference in satisfaction is more modest."rnrnI don't know that this is necessarily true, especially when presented with 3+ items of similar weight. It's a very large assumption that the ordinal ranking is good enough, and it still doesn't help you when you have to do further calculations (such as involving costs) with the number, just as Richard described. Every single time you do a further calculation with an averaged ordinal number, you're compounding the problem and decreasing your accuracy. By how much? You really don't know.rnrnThe other major weakness is that my 4/5 isn't the same as your 4/5. The whole situation is doing invalid math on a subjective scale.rnrnJust save the uncertainty and use better methods, ie AHP. Ratio scale numbers, accurate magnitudes and surprisingly easy. 


February 6, 2013 | Richard wrote:

Just what I've been looking for! I am dealing with Project Managers who use a nominal scale (1-5) to rank the Probability and Impact of project risks, multiply P*I to create a ranking for each risk, they average the P*I rankings of all risks in each of 19 categories, then multiply the average of each category by a weighed percentage for the category. These results are then added to the risk management scores of other projects to perform portfolio management decisions. yeah! 


July 29, 2012 | Katie wrote:

I'm curious about the answer to Chris' comment...I had a similar question :) 


March 2, 2012 | Chris Collingridge wrote:

By far the clearest explanation I've read on this topic - thank you! One follow-up question though: do confidence interval apply when you calculate means of ordinal data (e.g. from a Likert scale question). And if so, how?

Example:
I have 30 responses from a population of a couple of thousand, and the mean rating on a 5 point scale is 3.2. Does this mean I can be 90% confident that the true mean is between 2.96 and 3.44 (15 confidence interval)? Or is that the worst kind of ignorant abuse of statistics?

Chris. 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 5 + 3: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[4196 Subscribers]

Connect With Us

UX Bootcamp

Denver CO, Aug 20-22nd 2014

3 Days of Hands-On Training on User Experience Methods, Metrics and Analysis.Learn More

Our Supporters

Loop11 Online Usabilty Testing

Use Card Sorting to improve your IA

Userzoom: Unmoderated Usability Testing, Tools and Analysis

Usertesting.com

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professionals

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.