Analysing Likert scale satisfaction scores

Introduction

Although there are many possible ways of scoring a response to a question, the Likert scale has recently become a default option. In particular, the national student satisfaction survey (NSS) is based on questions socred on the Likert scale. NSS scores are used in league tables, and are influential metrics that affect decision making within universities. The relative performance of courses based on these scores can impact strategic planning. It is therefore vital to use rigorous statistical methods when analysing NSS scores to ensure that the evidence they provide is neither misinterpreted nor over interpreted.

NSS scores are based on the following responses to questions such as “Overall, I am satisfied with the quality of the course.”

Definitely disagree
Mostly disagree
Neither agree nor disagree
Mostly agree
Definitely agree

A Google search for the terms “Likert scale”" and “statistical analysis” produces a large number of documents of varying academic reliability. Many are not particularly helpful. Some are misleading and provide unsatisfactory advice regarding appropriate analytical techniques. I will base my suggestions on consideration of very well known and uncontroversial statistical principals rather than citing any “conventional” method that has been specifically suggested for the Likert scale, due to number of inapppriate analytical teechniques that have been applied to this type of data.

Suggestions for finding the right “test” for a null hypothesis of no difference in the responses between courses are not generally helpful. The term “significant” can be misinterpreted by decision makers. It is particularly important to base findings not just on a p-value but to use reliable and interpretable measures of effect size along with statistically justifiable confidence intervals. It is also vital to provide a clear visualisation of results that can be quickly scanned and interpreted.

Before considering the statistical nature of the data it is worth considering the potential for unseen biases in the scoring. If an online form presents the scale as a series of checkboxes to choose between, the response could depend on the order in which the choices are listed. If a slider is used on an online form it may provoke different responses than woud be chosen using check boxes. The wording in which the question is framed may lead to aquiescence bias or social desirability bias. Some forms of wording may deter respondents from choosing extreme responses while others may encourage them. These are all psychological aspects that should be taken into account when the survey is designed.

Assuming that the designers of the survey are aware of these issues, the problem I address concerns the mumerical part of the data analysis.

Establishing differences in the response to a single question between courses involves a simple bivariate analysis with one numerical variabe (score) and one factor (course) which has as many levels as the number of courses. However despite the apparent simplcity, there are several complications.

While a mean can be calculated for any set of numbers, most of the advice found online regarding the Likert scale quite rightly points out that a mean Likert score is difficult to interpret. The responses are not on a simple linear scale. The nature of the scale prevents the calculation of a valid standard deviation. Classical parametric methods based on assumption of normality are very clearly not appropriate when analysing responses to any single question. If the responses to many individual questions are pooled it has been suggested that the central limit theorem kicks in and the resulting pooled scores can be treated as if they follow a Gaussian distribution. However this is not very good advice either, as the assumption is being made that there is independence in the responses. This is rarely justifiable in the case of satisfaction scores. Students who are satisfied regarding one aspect of the course will also tend to be satisfied about other aspects, violating the assumption of independence.

Other suggestions for null hypothesis testing include using non-parametric procedures such as Kruskall Wallis tests. However this does not test a very meaningful hypothesis nor does it provide a basis for comparing effect sizes.

As a result of these problems the conventional approach to analysis that is widely used involves splitting the data an looking at the proportion of responses that fall above or below some cut off point. This leads to discusions of the proportion of students who are satisfied with a course, sometimes without using any statistical analysis at all.

I will illustrate these issues by setting up some very artificial data to show the consequences of simplistic analyses, before going on to suggesting more informative and statistically robust approaches.

The “marmite effect”

One of the difficult aspects of the Likert scale is the potential for confounding a set of neutral responses with a set of strongly held likes and dislikes. I’ll refer to this as the “marmite effect”. In the case of student satisfaction scores, courses with more challenging material often provoke more extreme “marmite” like responses than courses with blander material. Some forms of analysis may not be able to distinguish between these pattern of results. Others may overemphasise the marmite effect.

To illustrate the issue I’ll simulate some data with the same mean for the response scores. Course 1 is the marmite course. The simulation has set an equal probability of a respondent strongly agreeing or strongly disagreeing, with nothing in between. Course two never provokes any strong responses at all. Course three has a flat, equal probability of choosing any of the five responses. All are centred around a score of three on the Likert scale.

set.seed(3)
library(ggplot2)
n<-50
x1<-sample(c(1,5), n, replace =TRUE)
x2<-sample(c(2,3,4), n, replace =TRUE)
x3<-sample(c(1,2,3,4,5), n, replace =TRUE)
x<-c(x1,x2,x3)
q<-rep(c("Course_1","Course_2","Course_3"),each=n)
d<-data.frame(q,x)
g0<-ggplot(d,aes(x=x))
g0+geom_bar()+facet_wrap("q")+xlab("Likert score")

d %>% group_by(q) %>%summarise(mean=mean(x),median=median(x)) ->dd
kable(dd)

q	mean	median
Course_1	2.84	1
Course_2	3.04	3
Course_3	3.00	3

While such an extreme pattern will never occur in practice, the simulation illustrates the issue. The pattern of responses are clearly quite different, but all have similar mean scores. Statistical tests of central tendency will therefore not distinguish between the courses. There is an argument that this could be a desirable result as it avoids any prejudice against “marmite” courses. However note that the median for the marmite course is one, i.e strongly disagree, even though the probability of choosing either of the extremes was set to be identical. “By chance” over 50% of the answers happened to be stongly disagree rather than strongly agree. The next time a sample is drawn the result may be reversed. So if no statistical analysis was used at all it could be stated that “a majority of the students on the course were very disatisfied”, but next time the course is run the “majority”" could be very satisfied. This sort of effect can also occur when respondents are forced into making a binary choice on a nuanced question. such as the EU referendum.

It might be argued that a fair analysis should completely ignore any marmite effect and just score the three courses equally. However an even better analysis would quantify the effect fairly and transparently while taking into account the way random variability influences any conclusions.

Anova

A naive but totally statistically invalid test for differences between the courses would be to use one way analysis of variance. It is easy to run, although it is the wrong analysis to use.

mod<-lm(x~q)
anova(mod)

## Analysis of Variance Table
## 
## Response: x
##            Df Sum Sq Mean Sq F value Pr(>F)
## q           2   1.12  0.5600  0.2475 0.7811
## Residuals 147 332.64  2.2629

The ANOVA shows no significant differences between the mean scores for the three courses. While it might be argued that this is being fair on the marmite course, it is also very uninformative and misses the obvious difference. It also absolutely unjustifiable statistically.

Although an analysis based on the mean score is rather difficult to interpret, it is still possible to find robust confidence intervals for the mean through bootstrapping rather than using an invalid anova model based on assumptions of normality. The general idea of analysing differences in means as a measure of central tendency should not be totally ruled out as a complementary option in combination with other forms of analysis. We will return to this later.

Kruskall-Wallis test

The Kuskall-Wallis test is often advised for Likert scale data. As it is based on ranks it is technically valid. It is tempting to apply it as there is no strong mathematical objection. The null being tested is that the location parameters of the distribution of the scores are the same in each group. This is sometimes assumed to be a test that the median’s are not significantly different, but this is not quite the correct interpretation. This also shows no significant difference between the three courses. However it is very hard to interpret and doesn’t lead to any clear measure of effect size.

kruskal.test(d$x~d$q)

## 
##  Kruskal-Wallis rank sum test
## 
## data:  d$x by d$q
## Kruskal-Wallis chi-squared = 0.5704, df = 2, p-value = 0.7519

The overall results for each question are indeed centred in the same position so the test correctly fails to reject the null. However it is clear that the are differences in the pattern of responses that are not being picked up by this valid, but uninformative procedure.

Chi-squared test

One way to deal with the “marmite”" issue is to simplify the data into binary classes and look at the number of responses falling into each class. This is in fact the commonest approach taken when analysing NSS scores. The measure typically used is the proportion of students who are satisfied i.e. giving a score above 3.

d$x1<-1*(d$x>3) ## Take only sores above 3 and convert to ones and zeros
tb<-table(d$x1,d$q) ## Tabulate
round(prop.table(tb,margin=2)*100,1)

##    
##     Course_1 Course_2 Course_3
##   0       54       58       60
##   1       46       42       40

The data can be quite validly analysed using a chi-squared test, although there is a much better way that I will show later

chisq.test(tb)

## 
##  Pearson's Chi-squared test
## 
## data:  tb
## X-squared = 0.38154, df = 2, p-value = 0.8263

This again, quite correctly, shows no significant difference between the reponses when scores over 3 are scored as ones and scores below are scored as zeros.

Chi-squared could be used to look at stronger feelings towards the course by changing the splitting rule.

d$x1<-1*(d$x>4) ## Score extremely satisfied as ones
tb<-table(d$x1,d$q)
round(prop.table(tb,margin=2)*100,1)

##    
##     Course_1 Course_2 Course_3
##   0       54      100       82
##   1       46        0       18

chisq.test(tb)

## 
##  Pearson's Chi-squared test
## 
## data:  tb
## X-squared = 32.018, df = 2, p-value = 1.115e-07

Now there are very clear differences between courses. The test is highly signficant and it could be re-run using other splitting criteria.

So, ANOVA is statistically invalid, Kruskal-Wallis is hard to interpret and doesn’t lead into the calculation of confidence intervals. An approach based on a binary split is capable of either clearly distinguishing between marmite courses or ignoring the marmite effect, depending on how the data are split. It does provide a fairly clear, interpretable score, measured in terms of “proportion of satisfied students” or “proportion of students holding strong feelings regarding their course”. To be most informative the binary split based approach does need to be run more than once. The Chisq-test establishes statistical signficance, but is not particularly useful when used in this form.

So, what is the general solution?

Varying sample sizes

An even more important issue that the marmite effect that must be taken into account in any analysis is the influence of sample size on the overall score. Courses with low numbers of students are much more likely to have either very high or very low scores as result of chance. Statistical methods are specifically designed to handle the sample size issue, but the correct analytical method must be chosen.

Splitting the data into just two classes leads to a situation in which classic binomial theory may be used to calculate confidence intervals which do take into account sample size.

Let’s make up some data again to show the issue more clearly. The simulation sets exactly the same probability of a student providing any one of the five responses on all three courses. However the number of students on the courses varies between 5 and 100.

n1<-5
n2<-20
n3<-100
n<-n1+n2+n3

course<-c(rep("Course_1",n1), rep("Course_2",n2), rep("Course_3",n3))
nss<-sample(c(1,2,3,4,5), n, replace =TRUE)
d<-data.frame(course,nss)
d$satisfied<-(d$nss>3)*1

d %>% group_by(course) %>% summarise(n=n(),satisfied=sum(satisfied),percent=(100*satisfied/n)) ->dd
datatable(dd,autoHideNavigation=TRUE)

Note that the data were simulated with exactly the same probabilities. However the percent satisfied on the course with only five students appears to be much higher than the on the other two courses. This is the typical effect of small sample sizes. More extreme results are more likely when the sample size is low.

tb<-table(d$course,d$satisfied)
chisq.test(tb)

## 
##  Pearson's Chi-squared test
## 
## data:  tb
## X-squared = 1.0417, df = 2, p-value = 0.594

So. the Chi-squared test confirms the lack of any significant differences. However we haven’t measured the effect size yet, nor extracted confidence intervals.

One possibility is to use logistic regression.

mod<-glm(data=d,satisfied~course,family="binomial")
anova(mod,test="Chisq")

## Analysis of Deviance Table
## 
## Model: binomial, link: logit
## 
## Response: satisfied
## 
## Terms added sequentially (first to last)
## 
## 
##        Df Deviance Resid. Df Resid. Dev Pr(>Chi)
## NULL                     124     168.25         
## course  2   1.0226       122     167.23   0.5997

The overall p-value is almost identical to the ch-squared test, as it should be. The coefficients of this model could be back transformed and then combined to form estimates with confidence intervals. However there is an even simpler way.

Using the Binomial test to extract confidence intervals

The binomial test in R by default tests the null hypothesis that the true proportion is not equal to 0.5. This can be set to the proportion over all courses in order to test whether any course is “significantly different” from the general overall satisfaction level across all the courses. The binomial test alse provides confidence intervals through the Clopper and Pearson procedure. This guarantees that the confidence level is at least the confidence level, but does not necessarily give the shortest-length confidence interval.

prop<-sum(d$satisfied)/length(d$satisfied)
binom.test(3,5,prop)

## 
##  Exact binomial test
## 
## data:  3 and 5
## number of successes = 3, number of trials = 5, p-value = 0.3952
## alternative hypothesis: true probability of success is not equal to 0.4
## 95 percent confidence interval:
##  0.1466328 0.9472550
## sample estimates:
## probability of success 
##                    0.6

We can now use the binomial test to build a function that takes the original vector of Likert scores and returns the percentage satisfied, with upper and lower bounds for a 95% confidence interval along with a p value for the significance of differences from the baseline value.

satisfied_ci<-function(x,score=3,baseline=0.4){
  satisfied<-(x>score)*1 
  n<-length(x)
  p<-sum(satisfied)
  mid<-round(p/n*100,0)
  b<-binom.test(p,n,baseline)
  c(mid,round(b$conf.int*100,0),round(b$p.value,3))
}

Tabulating confidence intervals

The results can be tabulated for each group using dplyr.

cut<-3
baseline<-sum(d$nss>cut)/length(d$nss)
baseline

## [1] 0.4

d %>% group_by(course)%>% 
  summarise(lwr=satisfied_ci(nss,cut,baseline)[2],
            med=satisfied_ci(nss,cut,baseline)[1],
            upr=satisfied_ci(nss,cut,baseline)[3],
            n=n(),
            n_sat=sum((nss>3)*1),
            p_val=satisfied_ci(nss,cut,baseline)[4]
            )->dd
datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)

Now we have added the 95% confidence intervals for each course plus a p-value which represents the probability of obtaining the data for that course if the null hypothesis were true. The null is that there is no difference from the baseline, so in the case of course 3 which happened to throw up a proportion that was exactly the same as the baseline, the p-value is 1. None of the p-values are significant, as they should not be as the data were simulated from the null. It is more useful to look directly at the confidence intervals, as they show the range of possible scores that could be obtained by chance. Note that this is very broad for the course with only five students. Evaluation of course performance should not be based only on a simple consideration of the statistics, but the analysis clearly points out the issue surrounding the unreliability of measures based on low sample sizes and the potential for random fluctuations.

Illustrating the analysis with a larger simulated data set

To show how this can be used with realistic data for thirty courses that do have underlying differences in the pattern of responses I will simulate some data by varying both the number of respondents and the pattern of response.

sim_course<-function(i){
  course<-paste("course",i,sep="_")
  n<-sample(4:30,1)
  shape<-5
  scale<-runif(1,2,5)
  nss<-round(rbeta(n,shape,scale)*4)+1
  d<-data.frame(course=course,nss=nss)
  d
}

d<-do.call("rbind",lapply(1:30,sim_course))

Data table

An interactive data table can be set up that can be ordered by clicking on the column headings.

cut<-3
baseline<-sum(d$nss>cut)/length(d$nss)
baseline

## [1] 0.492986

d %>% group_by(course)%>% 
  summarise(lwr=satisfied_ci(nss,cut,baseline)[2],
            med=satisfied_ci(nss,cut,baseline)[1],
            upr=satisfied_ci(nss,cut,baseline)[3],
            n=n(),
            n_sat=sum((nss>cut)*1),
            p_val=satisfied_ci(nss,3,baseline)[4]
            )->dd
datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)

Eyeballing the table suggests that courses with very high, or very low, differences from the baseline tend to have low mumbers of students. Click the headings to order the data to look at this in differnt ways. The differences also tend to be non-significant for small courses. So, in a real life situation the interpretation of the raw scores must be approached with due caution. Differences between course scores may be indicative but not definitive evidence of a problem with a course, or may be an indication to consider a course to be outstanding but provide insufficient evidence to suggest that it is radically outperforming. A 95% confidence interval may be rather conservative, and decision makers may prefer a narrower interval, but the statistical principle remains.

Plotting the confidence intervals

A quick and intuitive way of looking at the data is to plot the confidence intervals after ranking the scores.

dd<-dd[order(dd$med),]
dd$course = factor(dd$course, levels=dd$course[order(dd$med)], ordered=TRUE)
#dd<-subset(dd,dd$med>90)
g0<-ggplot(dd,aes(x=course)) 
g0<-g0+geom_point(aes(y=med),colour="red")
g0<-g0+ geom_hline(yintercept=baseline*100, col="green") +ylab("Percent satisfaction")
g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip()
library(plotly)
g1

Notice that the confidence intervals are asymetrical. A course with 100% recorded satisfaction can still have a statistical probability of a lower score.

This analysis allows courses that are significantly underperforming or overperforming with regard to student satisfaction to be quickly identified. There are typically not many courses that differ significantly from the baseline. There may still be indications that courses do need attention even if they do not differ significantly from the baseline at the confidence interval chosen, but it is more important to focus on those that do show significant differences that are not attributable to chance.

Notice that the confidence interval for the top performer (course 17) just clips the baseline. The evidence that this course is exceptional is therfore not quite as compelling as that for course 26, which is ranked at #3. The confidence intervals of most courses include the baseline, which is a rather typical result with real life data. Many of the differences are rather illusory and so can be easily reversed if a different sample of students take exactly the same course.

Changing the cut off level

The analysis can be re-run using any cut off point in order to add depth and identify marmite courses with unusual numbers of highly satisfied or highly disatisfied students. I’ll just show one figure this time without the table.

cut<-4
baseline<-sum(d$nss>cut)/length(d$nss)
baseline

## [1] 0.05811623

d %>% group_by(course)%>% 
  summarise(lwr=satisfied_ci(nss,cut,baseline)[2],
            med=satisfied_ci(nss,cut,baseline)[1],
            upr=satisfied_ci(nss,cut,baseline)[3],
            n=n(),
            n_sat=sum((nss>cut)*1),
            p_val=satisfied_ci(nss,3,baseline)[4]
            )->dd

dd<-dd[order(dd$med),]
dd$course = factor(dd$course, levels=dd$course[order(dd$med)], ordered=TRUE)
#dd<-subset(dd,dd$med>90)
g0<-ggplot(dd,aes(x=course)) 
g0<-g0+geom_point(aes(y=med),colour="red")
g0<-g0+ geom_hline(yintercept=baseline*100, col="green") + ylab("Percent highly satisfied")
g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip()
library(plotly)
g1

Note that this is now highly aymetrical. The simulation produced many courses with no extreme responses, but this is just a feature of the particular simulated data. Course 3 and course 1 stand out as producing significantly high levels of extreme satisfaction. A careful inspection of these courses with respect to other measures may be needed. The same analysis could be applied to dissatisfaction to identify courses with levels that are statistically significantly different to the baseline after accounting for chance effects and course numbers.

Bootstrapping mean scores

Although parametric analysis of differences in mean scores are not valid, and the mean itself may be difficult to interpret it is possible to produce confidence intervals for the mean through bootstrapping. This involves resampling with replaecements from the data. In other words, if a course has only five students who score the course as 1,3,3,4,5 we could take a sample from this at random that could very ocassionally be 5,5,5,5,5 or 1,1,1,1,1 but will typically include a mixture of the values. If repeat the resampling thousand of times and exclude the extreme values which occur very infrequently we can get a bootstrapped confidence interval for the mean by calculating it for all the random samples. This approach will occasionally break down for small samples (for example when all the values are identical) but in general it is quite robust and will never throw up values outside the bounds of the data. So this does provide a potential complementary tool to identify courses with unusual scores.

boot_mean<-function(x){
 
  n<-length(x)
  x<-replicate(1000,mean(sample(x,n,replace=TRUE)))
  round(quantile(x,c(0.025,0.5,0.975)),2)
}

d %>% group_by(course)%>% 
  summarise(n=n(),
            mean=boot_mean(nss)[2],
            lwr=boot_mean(nss)[1],
            upr=boot_mean(nss)[3],
            )->dd
datatable(dd,autoHideNavigation=TRUE,fillContainer=FALSE)

dd<-dd[order(-dd$mean),]
dd$course = factor(dd$course, levels=dd$course[order(dd$mean)], ordered=TRUE)
g0<-ggplot(dd,aes(x=course)) 
g0<-g0+geom_point(aes(y=mean),colour="red")
g0<-g0+ geom_hline(yintercept=mean(d$nss), col="green") +xlab("Mean Likert score with bootstrapped 95% confidence intervals")
g1<-g0+geom_errorbar(aes(ymin=lwr,ymax=upr))+coord_flip()

g1

This analysis generally picks up the same courses in the simulated data set with confidence intervals that do not overlap the baseline. This is useful, as it suggests that the conclusions do not rely too much on the assumptions, providing a statistically justifiable procedure is being used. The bootstrap does break down in this example as one course threw up a set of identical responses. However, again, in a real life situation some additional evidence would be used when considering the results in detail.