Measuring Usability
Quantitative Usability, Statistics & Six Sigma by Jeff Sauro

Are Net Promoter Scores Normally Distributed?

Jeff Sauro • January 26, 2011

Responses to rating-scale data typically don't follow a normal distribution.

However, this is unlikely to affect the accuracy of statistical calculations because the distribution of error in the measurement is normally distributed.

Top-box scoring of rating-scale data can provide an easy way to summarize or segment your data in the absence of a benchmark or comparison test.

Another reason top-box scores are used with rating-scale data like the Net Promoter Score is that there is a concern that the data are not normally distributed and thus make statistical calculations inaccurate.

By reporting on just the frequencies for each response you avoid problems with assumptions about normality. Unfortunately, condensing 11 responses into 2 or 3 sacrifices important information about precision and variability.

There will always be value in segmenting responses into groups for concise reporting (especially to executives).  But when you want to determine whether your score has statistically improved, you'll want to use the mean and standard deviation because they provide more precision at smaller sample sizes. Doing so means that you need to consider the distribution of your data.

What It Means To Be Normal

Even if you know enough about statistics to be dangerous, you've probably heard the warning that you need to be sure your data are normally distributed.

Fortunately, you don't need to sit through a semester of statistics to understand the role of the normal distribution in analyzing rating-scale data like the question used to compute Net Promoter Scores.

A normal distribution (sometimes called Gaussian just to confuse people) refers to data that, when graphed, "distributes" in a symmetrical bell shape with the bulk of the values falling close to the middle.

Normal distributions can be found everywhere: height, weight and IQ scores form some of the more famous normal distributions.The chart below shows the distribution of the heights of 500 North American men.

You can see the characteristic bell shape. The bulk of values fall close to the average height of 5'10" (178 cm) and roughly the same proportion of men are taller or shorter than average.

 
Figure 1: Distribution of heights of 500 men from North America. The apostrophe: (e.g. 5') means feet.

Net Promoter Data Don't Look Normal

The popular Net Promoter Score measures customer loyalty using the following question: "How likely are you to recommend a product to a friend?" with responses on an 11-point rating-scale.

Here is the graph of the 673 responses to the "likelihood to recommend" question for a consumer software product. The mean response is 8.4 with a standard deviation of 1.8.

 
Figure 2: Distribution of 673 responses to the "Likelihood to Recommend" question for a consumer software product.

The graph hardly looks like a bell and certainly isn't symmetric. It's no wonder researchers have concerns using common statistical techniques like confidence intervals, t-tests or even the mean and standard deviation. When they see non-normal data like this they run!

Why Normality is Important

Normality is important for two reasons:
  1. Statistical tests assume the error in our measurement is normally distributed.
  2. We can't speak accurately about the percentage of responses above and below the mean if our data is not normal.

Error in Measurement

By error in measurement I'm not talking about the kind that happens when someone misunderstands a question or miscodes the data from a survey.  I'm talking about the unsystematic kind that comes from any sample.

When we calculate the mean from a sample, it estimates the unknown population mean. It is almost surely off—over or under—by some amount.

The difference between our sample mean and population mean is called sampling error, and it forms its own distribution. We want this distribution to be normal. If our sample of data is normal, then the distribution of sample means (the sampling error) is also normal.

Unfortunately, almost all rating scale data is not normal, so we need to examine the distribution of sample means. But how can we know what this distribution of all sample means looks like if we have only one sample mean?

If we had a lot of time on our hands, we could randomly ask 30 people if they'd recommend the product to a friend. We'd find the mean, graph it, and then rinse and repeat a million times. Or we could simulate that exercise by taking a lot of smaller random samples from a larger sample of data and using a few lines of code. 

I chose the latter approach.

The Distribution of Sample Means

I took the large sample of 673 responses and wrote a short program which sampled random responses and computed the mean. I did this at sample sizes of 30, 10 and 5 and repeated it 1000 times for each sample size. The graphs of each distribution of sample means are shown below.

 n=30  n=10  n=5

The distribution of 1000 means at sample sizes of 30 and 10 are bell-shaped, symmetrical and normal.  The distribution at a sample size of 10 is a bit wider because smaller sample sizes have more variability. 

At the sample size of 5, the distribution has less symmetry and a bit of a negative skew (more values in the lower scores).  We have evidence that our sampling error deviates from normality.

Technical Note: Some normality tests generate p-values. These tend to be overly sensitive to minor deviations from normality and are not recommended. Looking at the data in a normal probability plot (also called a Q-Q plot) provides the most reliable assessment of normality. I used histograms here since it is easier to recognize the famous bell-shape.

The Central Limit Theorem

What we're seeing in action is something called the Central Limit Theorem. It is the most important concept in statistics. It basically says that the distribution of sample means will be normal regardless of how ugly and non-normal your population data is, especially when the sample size is above 30 or so.

As we can see from my re-sampling exercise, the Central Limit Theorem often kicks-in at sample sizes much smaller than 30 (the sample of 10 is basically normal).  Exactly how normal the data appear, and at what sample size, will depend on the data you have.
 
Fortunately, you don't have to code a software program to know if your sampling distribution is normal (like you needed another reason not use statistics).

Even when sampling distributions are not normal for small sample sizes (less than 10), statistical tests like confidence intervals, t-tests and ANOVA still perform quite well. When they are inaccurate, the typical error is only a manageable 1% to 2% [ See GEP Box (1953)  Non-normality and test on variance. Biometrika, 40 ].

In other words, when you think you're computing a 95% confidence interval, it might be only a 94% confidence interval.

In short, for rating scale data from larger sample sizes (above 30) don't worry about normality. For smaller samples sizes (especially below 10) you will find a modest and manageable amount of error in most statistical calculations.

Population Distribution

While the shape of your sample data probably doesn't affect the accuracy of statistical tests, it does affect statements about what percent of the population scores fall above or below the average or other points. 

For example, a statement such as "We can be 95% sure half of all users rate their likelihood to recommend above the average score of 8.4."  Using the mean to generate statements like this assume the data are symmetrical and roughly normal. We can see from the graph of the responses above that this is not the case. This is the same problem you run into with task-time data which is also non-normal.

With rating scale data the solution is easy. If you want to make statements about the percent of users that score above a certain score, then just count the discrete responses.  For example, 362 of the 673 users (54%) provided scores of 9 or 10 (these users are classified as Promoters).  Using a binomial confidence interval we can be 95% between 50% and 58% of all users are Promoters.

Another alternative is to transform the scores so that they follow a normal distribution. This is also the corrective procedure done when working with task-time data. With transformed data that are normally distributed even these percentage statements are accurate.

Normality Summary

In summary, normality should not be a concern for large sample sizes (above 30). For smaller sample sizes, the distribution of errors is probably normal or close to normal.  When the data do depart from normality, most statistical tests still generate reliable and accurate results.

Normality is a concern when making statements about percentages of the population that score above or below certain values. In such situations, using the response frequencies or transforming the data are appropriate alternatives.

While I've only shown examples of responses to the "likelihood to recommend" question, the concept applies to all rating-scale data (like the System Usability Scale or Single Ease Question ).

My suggestion is to worry less about the normality of your data and worry more about the representativeness of it.  That is, be sure your sample is representative of the population you're making inferences about.

Whatever inaccuracies result from non-normal data are dwarfed by drawing the right conclusions about the wrong people. No statistical manipulations can account for an unrepresentative sample.

See the Crash Course in zScores if you want to learn more about the normal distribution or brush up on its critical role in statistics.


Editorial services courtesy of Marcia Riefer Johnston. See her "Word Power" blog.


Learn More

Jeff Sauro hosted a live webinar on February 28th, 2012 on Best Practices for Remote Usability Testing. The event was overbooked so if you missed it you can now view a recording.


You Might Also Be Interested In:

Rate this Blog

Avg. Rating 7.13 (15)

Poor         Excellent
012345678910

Related Topics

Net Promoter Score, Statistics
.

Posted Comments

There are 7 Comments

December 2, 2011 | Jeff Sauro wrote:

Michaeal,

Great points. The issue of representativeness for companies is a common one. If a company cherry-picks their best or biggest customers and uses this as a proxy for all customer attitudes then you're totally right to be concerned.



In my experience, the problem is less blatant and obvious. It's usually an issue of response bias or when the questions are asked (after a purchase, after a support call, in-product, by a 3rd Party).



Even with the best sampling strategy, you'll see different Net Promoter Scores depending on when you ask it and in the end customers must choose to respond to your survey. Which is the real NPS? They all are wrong but informative. My recommendation to clients is to be aware of the differences but instead focus on relative improvements from the same source and sampling strategy. In fact, it's an advantage having the multiple views (especially ones taken from outside your company).



So don't compare customer support NPS (which is usually low) to post-purchase NPS (which is usually high). But do compare the NPS from Q1 for Support to Q2. Sure, it's likely a biased picture, but relative improvements (or the lack thereof) should be the focus.

 


December 2, 2011 | michael wrote:

Jeff...I loved your post. It got me thinking about this:rnWhat kind of random sampling of customers does a company have to do to get an accurate NPS? In other words, if a company just surveyed their biggest customers, I would assume the NPS would be inaccurate because those customers would probably rate the company too highly. rnrnWhat's your recommendation? Thanks! 


March 7, 2011 | Rithesh M Mohan wrote:

Hi Jeff,

Any input?

Thanks,
Kevin 


February 25, 2011 | Kevin wrote:

Jeff,

Thanks for your input.

I totally agree and understand that applying the usual significance testing on NPS is not valid. So based on inputs from you, some statistical books and some experts, I have now decided to perform test for two proportions (Z test for proportions) separately on %promoters, % detractors or %passive (2009 vs 2010). If there is a significant change in either one of these we can assume (only assume) that the change in NPS is also significant.

I am keen on reporting the NPS or the %promoters, % detractors, %passive as we have been reporting NPS to the leadership team for quite some time now and I feel proportions are easy to understand and interpret.

Please let me know your thought on my approach.

Thanks,
Kevin
 


February 23, 2011 | Jeff Sauro wrote:

Kevin,
That's a good question. One of the problems with comparing Net Promoter scores is that through the process of degrading an 11point scale into a 3-point scale you lose information. You can always compare the raw means from the responses by using a simple 2-sample t-test.

I suspect you want to compare the actual NPS %. Unfortunately the NPS isn't really a percentage bound by 0 and 1 and instead can go from -1 to 1 so its statistical properties need further investigation. This affects both the computation of confidence intervals and typical statistical tests like the ones you mentioned.

The easiest thing to do is compute a Chi-Square test of independence on the 6 categories. I computed it on your data and got a p-value of .0106. This suggests that there is a difference between the population NPS scores. However, a problem with the Chi-Square is that it only tells you whether there is a difference between the observed counts in each cell and the expected counts. It doesn't tell you what cells are unusually high or low.

An examination of the contribution to the Chi-Square suggests it is the number of detractors in each year that is higher than expected in 2008 and lower than expected in 2010.

I also conducted a test of 2 proportions on the net promoter score and proportion of promoters. This test uses the normal distribution as an approximation but provides good results for large sample sizes. The p-values are both significant (less than .01) suggesting there was an improvement in NPS year over year, beyond what you'd expect from chance fluctuations alone.

I performed the calculations using the Expanded statsPackage 


February 23, 2011 | Kevin wrote:

Hi Jeff,

Great job. This article is one of the best I found online on customer satisfaction data and normal distribution.

You article certainly helped me clear some of my doubts on non-normal distribution and customer satisfaction data.

I have few more question for you on Significance testing for proportions (NPS).

I have 2009 and 2010 NPS (Net Promoter Score) survey results. Now I want to test if the change in 2010 NPS from 2009 is statistically significant or not.

Eg. Year Detractors Passives Promoters NPS n
2008 25% 25% 50% 25% 240
2010 15% 25% 60% 45% 290

Please suggest a test to prove that the change in 2009 to 2010 NPS is statistically significant or insignificant.
1>Per my understanding we cannot apply t or Z test directly on NPS as it is a computed or derived percentage. Can you please let me know other methods used to test the hypothesis on NPS?
2>Is it statistically valid to apply Z or t test on %Detractors, %passives or %promoters individually and check if the change in 2010 is statistically significant or not?

Thanks,
Kevin
 


January 27, 2011 | Steve Bernstein, Waypoint Group wrote:

Jeff,
Many thanks for the well-written educational information and background in statistics. This is especially important material for anyone that is managing a customer feedback program as it can help answer the very common question, "How many responses do I need?'' This article certainly clears up some common misconceptions.

In addition, I hope your readers also don’t lose sight of the idea that a sample size of 1 (or even a full census) can be critically important. A single person’s opinion can make-or-break a major purchasing decision. I think we’d agree that depending on the follow-up activities, more responses in a customer feedback program can generally be better since feedback provides an opportunity to engage customers in new dialogs. In other words, it’s often less about the data and analysis and more about if/how improvement actions take place.

Well thought-out and executed analysis is critical. Organizations should make sure they analyze the data correctly in order to set the right priorities for major initiatives. They shouldn’t be making “big bets” on future investment if instead they could gain accurate insights that remove uncertainty. And that doesn’t prevent anyone from picking up the phone and speaking to an individual customer in pursuit of improving that relationship.

I wonder what percentage of customer survey programs actually take action on the results (the “voice of the customer”) to either improve individual or aggregate scores as a leading indicator of financial health…?

Thanks again.
/Steve 


Post a Comment

Comment:


Your Name:


Your Email Address:


.

To prevent comment spam, please answer the following :
What is 5 + 3: (enter the number)

Newsletter Sign Up

Receive bi-weekly updates.
[1977 Subscribers]

Connect With Us

Our Supporters

Use Card Sorting to improve your IA

Loop11 Online Usabilty Testing

About Jeff Sauro

Jeff Sauro is the founding principal of Measuring Usability LLC, a company providing statistics and usability consulting to Fortune 1000 companies.
He is the author of over 15 journal articles and 3 books on statistics and the user-experience.
More about Jeff...

.

Jeff's Books

Quantifying the User Experience: Practical Statistics for User ResearchQuantifying the User Experience: Practical Statistics for User Research

The most comprehensive statistical resource for UX Professions (JUST RELEASED)

Buy on Amazon

Excel & R Companion to Quantifying the User ExperienceExcel & R Companion to Quantifying the User Experience

Detailed Steps to Solve over 100 Examples and Exercises in the Excel Calculator and R

Buy on Amazon | Download

A Practical Guide to the System Usability ScaleA Practical Guide to the System Usability Scale

Background, Benchmarks & Best Practices for the most popular usability questionnaire

Buy on Amazon | Download

A Practical Guide to Measuring UsabilityA Practical Guide to Measuring Usability

72 Answers to the Most Common Questions about Quantifying the Usability of Websites and Software

Buy on Amazon | Download

.
.
.