Statistical Rethinking - Likert Type Survey Inference and Bootstrap Confidence Interval


August 14, 2022


Carlito O. Daarol
Mathematics Department
Mindanao State University
General Santos City, Philippines


Abstract

In this paper, we voice concerns about the practice of ignoring the ordinal nature of Likert Scale data and instead use the mean, together with make-believe interval as basis for making statistical inference about qualitative research.


Research in education, social science and behavioral science relies on ordinal data to measure sentiment of the respondents. We will argue in this paper that analysis of Likert scale Items should not involve parametric statistics but should rely on the ordinal nature of the data. A noted educator, Dr. Achilleas Kostoulas had this comprehensive article “On Likert scales, ordinal data and mean values” https://achilleaskostoulas.com/2013/02/13/on-likert-scales-ordinal-data-and-mean-values/


In the present time, the use of bootstrap confidence interval has become the corner stone of modern statistical inference. In this paper, we will outline a proof by contradiction, by providing a counter example, to establish that this practice of computing the mean and interpreting it with make-believe intervals is full of inconsistencies. Statistical inference about measuring sentiments should be not be based on this erroneous approach.


In the present time, bootstrapping confidence intervals has earned the respect of world class researchers. It has come to the point that ditching the p-values and replacing it with bootstrap confidence intervals becomes the rallying point for modern researchers today. A respectable researcher and author Dr. Florent Buisson published this article https://towardsdatascience.com/ditch-p-values-use-bootstrap-confidence-intervals-instead-bba56322b522.


Introduction

For the benefit of those readers who may not be familiar with Likert type of surveys, I am presenting here a discussion of the concept from a layman side of view. This may also improve the understanding for existing researchers and the plenty who would become researchers in the future. Specifically, this article is intended for all people out there who are actively doing social research.


To add credible content supporting this article, established theories in Mathematical Statistics will be introduced. Some of these are the Central Limit Theorem and the statistical inference using the bootstrap process.


Likert type surveys is a type of survey that becomes the de facto standard when a researcher wants to gather data to get an overall measurement of sentiment for topics like opinion, agreement, experiences, consumer beliefs, attitudes etc.

Sentiment analysis is known as opinion mining or emotional analysis(Wikipedia). It uses ordinal qualitative data to identify, quantify or extract subjective information from target respondents. For measuring satisfaction, the familiar approach is to ask respondents whether they are Very Satisfied, Satisfied, Neutral, Dissatisfied and Very Dissatisfied. To measure degree of agreement, it is common to use Strongly Disagree, Disagree, Neutral, Agree and Strongly Agree.


The joke that goes around is that Likert type surveys become the next big thing that happens since the birth of slice bread because almost every topic can be structured and subjective data collected using this method. The structure of this type of survey is unbelievably simple that it dominates the data gathering process for qualitative data.


What are the two components of a Likert Type Survey?

The first component consists of Likert items. This refers to the individual items or statements used to ask the respondents regarding their opinion or degree of agreement.


The type of data collected by Likert items are ordinal responses such as Very Low, Low, Average, High and Very High which measures the degree of occurence, just like the degree of satisfaction and degree of agreement as mentioned earlier.

The second component is called the Likert scale. It is defined as a group of Likert items. Within the group, it is possible to compute statistics like the sum, average, standard deviation, etc.


With due respect to the owner, Dr. Paul E. Spector, I am including his Job Satisfaction instrument to provide complete illustration about the difference in the terminology Likert Items and Likert scale. Dr. Paul E. Spector is a faculty from the Department of Psychology, University of South Florida. The first figure below consists of the 36 Likert items.



The figure below provides an example of a Likert scale survey with all the Likert scale variable and the Likert items. The Likert scale variables are represented by job satisfaction with respect to pay, promotion, supervision, benefits, rewards, operating conditions, co-workers relationship, nature of work and goals and communication.



What is the technique of managing ordinal data

To manage ordinal data like Very Satisfied, Satisfied, Neutral, Dissatisfied, Very Dissatisfied the technique is to use integers to represent the values. The common approach is to use the integers 1,2,3,4,5 as representative of a 5-point ordinal data. The choice of the integers to represent a 5-point ordinal data is arbitrary. Other set of integers like {6,7,8,9,10} or {96,97,98,99,100} are also possible. A detailed comparison will follow.


Example of Likert Scale Survey Dataset

This is a portion of a sample dataset downloaded from a Google Survey.

Sample Data

Sample Data


Sample Data in Numbers

Sample Data in Numbers


The variables Resource1, Resource2, Resource3, Resource4, Resource5 are classified as Likert items. These items retain the structure of the qualitative ordinal data although these are now represented by numeric labels. It is important to remember that the ordinal structure of these variables prevail over the mathematical operations on the set of integers being used to represent them.


The variable Resource is called a Likert scale. Parametric methods of analysis like Analysis of Variance, Regression Analysis can be performed on this variable. The smallest possible sum is 5 and the highest possible sum is 30. The possible sums ranges from {5, 6, 7, 8, … 28, 29, 30} and this makes the variable as an interval variable. It no longer retains the qualitative nature of the original scale. If this variable is transformed further to a percentage score then it becomes a continuous variable.


Data is the heart and soul of any statistical analysis. It is very important for anyone conducting research to distinguish what type of data he or she is dealing with. The type of statistical analysis will depend largely on the structure of the data or variables being considered.


The Importance of Understanding Different Data Structure for Qualitative Research

In the field of Statistics and Information Technology, data structure played a central role in the development of algorithms needed for data analysis, data processing and data storage. There are two major data types, the qualitative or categorical data and the other one is quantitative or numeric data. Categorical data are split into nominal data and ordinal data while numeric data are also split into interval or discrete data and the real numbers or continuous data. A summary of these different data stuctures is given below.

Data Structure Examples

Data Structure Examples


Nominal data
The color of the skin is classified as nominal type and the values can be arranged in whatever order you like. The skin color can be white, black, yellow, brown, red, orange. Arithmetic operations cannot be performed on this set and there is no assumed distance or gap between nominal values .


Ordinal data
The level of educational attainment can be described as Elementary graduate, High School graduate, College graduate, Masters graduate, Phd graduate. A person with no education is perceived to be of lesser level than a person having a college degree. The values are arrange from lowest to the highest order to emphasize the hierarchy of values. The following are established facts about ordinal data.

  • Cannot perform basic arithmetic operations on the level of education
    that is Elementary + High school = College graduate does not make sense.

  • The gap or space in between the values is not the same. In terms of content, there is a big gap between a High School and College graduate. There is also a big gap between a Masters and Phd graduate while we can say that there is small gap between a High School and elementary graduate. Measuring content gap in terms of numbers is hard to do. We just know that the gap exists but to quantify it by numbers is next to impossible.

  • Can use integers to represent ordinal values but the qualitative meaning prevails.
    Elementary graduate = 1, High School graduate = 2, College graduate = 3, Masters graduate = 4, Phd graduate = 5.

  • Cannot apply the equation 1 + 2 = 3 since adding Elementary+Highschool = College graduate is wrong.

  • Cannot apply (1+5)/2 = 3 since the average of (Elementary graduate + Phd graduate)/2 = College graduate is invalid.


Interval data
This is also called as integer data. Number of cars = 1, 2, 3, 4 and so on. Number of siblings = 1, 2, 3, 4 and so on. These integers have a unique equi-distant property. It means that there is a fixed 1 unit distance from one value to the next or preceding value.

  • The median between the values 1, 2, 3, 4, 5 is 3 since the middle value is 3.
  • The median between the values 1, 2, 3, 3, 3, 4, 5 is 3 since the middle value of the list is 3.
  • The median between 1, 2, 3, 3, 3, 3, 4, 5 is 3 since the middle value of the list is (3+3)/2=3.
  • The mean between the values 1, 2, 3, 4, 5 is 3 since (1 + 2 +3 +4 +5)/5 = 15/5 = 3
  • The mean between the values 1,5 is 3 since (1 + 5/2 = 6/2 = 3


These median results are possible because of the equi-distant property of integer values. Unfortunately, there is no such thing as median value between two qualitative data. How could you get the midpoint between Love and Lust?


Here is the catch between the components of Likert Scale data

When we set the ordinal data {Elementary graduate,High School graduate, College graduate, Masters graduate, Phd graduate} = {1, 2, 3, 4, 5}, the mean and median value computation no longer holds. It is the qualitative nature of the data that prevails over the arithmetic operations possible on the integers and not the other way around. In other words, the integers are treated as numeric label only.


  • Although 1 + 2 = 3, you cannot impose this result and declare that Elementary Graduate + High School graduate = College graduate.

  • Similarly, you can compute easily the average (2+5)/2 = 3.5 but you cannot impose this result to say that the average (High School graduate + Phd graduate)/2 = College graduate + 0.5.


Measures of Central Tendency

The common measures of central tendency are the mean, median and mode. For numeric and continuous data like age, height and weight of a person, these three statistics can be computed easily using the R statistical software and other commercial software. For Likert Type data, the computation of these statistics is quite different since there are two data components to deal with. For categorical data, the mode is more appropriate to apply compared to the mean and the mode. Our point of interest here is how to apply the different measures of central tendency to the Likert scale data which consists of an ordinal data and a numeric equivalent.



In the example dataset, the Likert Items Resource1, Resource2, … Resource5, are different indicators which the respondents have to answer. The respondent can choose only from the ordinal data {Strongly Disagree, Moderately Disagree, Slightly Disagree, Slightly Agree, Moderately Agree, Strongly Agree}. The first table represents the original data as downloaded from Google Survey. The second table represents the equivalent data in numeric form.


Confusion involving Likert scale type of data


This is where the confusion started about the operations involving Likert scale type of data. Take a look at the sample dataset closely.

Computation of Likert scale variable. The variable Resource is computed by adding the Likert Items by respondent. Any other operations performed, by row (respondent), on the numeric values are valid. The underlying structure of the Likert scale variable is interval or continuous (no longer ordinal). This process is called estimation of the underlying continuous data.


Computation of Likert Items variable. Operations performed on the numeric values (by columns) are meaningless. The ordinal data prevails over whatever computations performed on the numeric values.


Not everybody knows this somewhat weird property of data collected from Likert Type Surveys and this is why problems in making statistical inference begins.


What is the Focus of Writing this Article


The focus of this article is on the analysis of each Likert item considering all respondents because this is where controversies and anomalies regarding Likert Type Surveys occur. We will be discussing the three (3) common measures of central tendency, the mean, median and mode and discuss which one is most appropriate to obtain an overall measure for each Likert item.


To examine the mean value, I will be using the advance results in Mathematical Statistics theory such as Central Limit Theorem and Bootstrap process. I will encourage the reader to search and read interpretation of these advance concepts by using search engines in the net.


Bootstrapping and Estimation of 95% Confidence Interval for the Mean


  • Bootstrapping is a simulation of random sets from the original set of random samples by using repeated sampling methods with replacement . For each random set, the corresponding sample mean and standard deviation will be computed. The distribution of sampling means allow us to estimate the unknown population mean. This result is guaranteed by invoking the Central Limit Theorem in Mathematical Statistics. The 95% confidence interval follows easily once the estimate of the population mean is obtained. I will be using the R language for the simulation of 10,000 random sets which is large enough to estimate the unknown population parameter.

  • How good is the bootstrap process? The effectiveness of the bootstrap process is only as good as the original dataset which must be a truly representative of the population. In other words, when data collection was performed, we must have truly random samples from the population of interest.

  • In this article I pressume that the original dataset is a set consisting of sufficient number of random samples. Jerome Friedman, in 1998, said that The bootstrap is the most important new idea in statistics introduced in the last 20 years, and probably in the last 50 years. The trend now is to use bootstrapping as a better alternative instead of using the p-value. Kindly search for the article, “Ditch p-values. Use Bootstrap confidence intervals instead” by Florent Buisson to get a better idea on how modern statistical analysis is applied. Search more using the keyword “Abandon p-value” for more hot issues.

  • Bootstrapping confidence intervals is the foundation of modern statistical inference. I am going to use bootstrap confidence intervals to argue that different styles of self-made intervals that are arbitrarily constructed to interpret the mean are inefficient and full of inconsistencies.

  • The bootstrap process is a modern tool that was established in 1979 by an American statistician Bradley Efron that paves the way for a researcher to repeat the survey by simulation through resampling with replacement methods. The bootstrap confidence interval tells us that sample means will fall inside this interval 95% of the total number of simulations.


According to Richard McElreath, statistical rethinking build your knowledge and confidence in making inferences from the raw data. I think there is really a need to review the methods of finding central tendency for Likert Scale Survey because it now becomes a dominant approach in the analysis of ordinal qualitative data. Lastly, the argument that is presented here is that of a purist statistical point of view which makes it easy to interpret results without resorting to inserting another layer of transformation in order to interpret the results.


Why is it that the ordinal data prevails over the integers?


In the first place, we want to measure the sentiment of the respondents. The qualitative data gives us that kind of sentiment which the integers could not and the question is answered.

Suppose that it is not. Then we assume that both variables are important and are of the same level. We try different representations of the same ordinal data. We will consider the sets {1,2,3,4,5}, {6,7,8,9,10}, {96,97,98,99,100} and see what happens. The next three tables provide the result under this assumption.


Likert Scale: Never =1, Rarely = 2, Sometimes = 3, Often = 4, Always = 5

Likert Scale: Never =1, Rarely = 2, Sometimes = 3, Often = 4, Always = 5


Likert Scale: Never =6, Rarely = 7, Sometimes = 8, Often = 9, Always = 10

Likert Scale: Never =6, Rarely = 7, Sometimes = 8, Often = 9, Always = 10


Likert Scale: Never =96, Rarely = 97, Sometimes = 98, Often = 99, Always = 100

Likert Scale: Never =96, Rarely = 97, Sometimes = 98, Often = 99, Always = 100


The three (3) tables above showed what happened when the Likert Scale data was represented using different sets of integers.

The Likert Item Percentage entries are the same for all tables. The modal value is the same for all tables. The plot of the distribution and assessment for normality test using Shapiro-Wilk test are the same.

The interpretation of the mean value, the median value and the bootstrapped 95% confidence interval is affected by the choice of the integers.

The overall impression is that the choice of the integers to represent a Likert scale is not important. This proves that in a Likert Scale survey, the primary data is the ordinal data and not the set of integers we used to represent them.


For Likert type of surveys which one is more appropriate as measure of Central Tendency?


Let us answer this question by looking at closely the individual properties of these methods.


The modal value. The mode is a measure of central tendency that is applicable whether the variable is a quantitative or qualitative type. For Likert scale data, it is a mixture of qualitative and quantitative data, it is determined by finding the peak of the distribution which can be done visually. The value is consistent regardless of what set of integers is applied in representing the qualitative data as shown in the three tables above.

The median value. The median is a measure of central tendency that is applicable for quantitative data only. We know that Likert scale data is a mixture of qualitative and quantitative component. Previously, we were able to show that the qualitative component prevails over the other component. Therefore, Likert scale data is basically a qualitative data. We know that the median value thrives on the existence of equi-distant property which qualitative data does not possess. How in the world can you get the middle point between “Never” and “Always”? Therefore We can conclude that the median value is not appropriate.


The mean value. The mean is the most popular measures of central tendency. It is computed by summing all the observations then divide the result by the total count of the data. The catch is that the mean is applicable only for numeric data. Adding the fruits in the set {mango, apple, banana, guava} does not make sense. Also to declare that the average fruit in the collection {mango, apple, banana, guava) is mango is nonsense.


So why are most researcher using the mean value for Likert Items data?


“Well, the short and hard answer is I dont know why. It must have been love for the method or these people may think that the mean is a heaven sent solution that they just have to apply whatever that means.”


No pun intended here but it is surprising to know that a high percentage of researchers are doing this method of measuring sentiment even among people with higher ranks in the academe.


I am not going to deal this question head on but I am going to deal with the way how statistical inference was made based on the computed mean.


Validating the mean using the 95% Confidence Interval


First, let us assume for the moment that the mean value is appropriate for the Likert Items. We let it fall where it falls. Later, we will point out the weakness and contradiction about this assumption using the foundation of statistical inference.

As I noticed, people are in love to use make-believe tables to interpret the computed mean. Some of these tables are found in the net. I collected some of them and presented it here for discussion purposes.


Guide for Interpreting the Mean. Choose Your Own Style

Guide for Interpreting the Mean. Choose Your Own Style


Weakness of these tables. As presented, there are several styles of constructing the intervals with the purpose of interpreting the mean value. Each was constructed by different people and most of these are formed without theoretical basis. The interpretation of one value may vary from one table to another table.


Benchmark for evaluation. I will be using the results of bootstrapped 95% confidence intervals to highlight the weakness or inconsistencies of these tables. Bootstrapping confidence intervals is at the center of modern statistical inference which is backup by established theories in Mathematical Statistics.


The wisdom of creating and using these tables as basis for making statistical inference will be challenged by the bootstrap results.


  1. First we compare the 95% bootstrap confidence interval and the Interval Style #1 and #2. The problem with these tables is that there are different interpretations for values that are not significantly different. Another problem with Interval Styles #1 and #2 is that there are intervals where the endpoints must be interpreted differently because they do not belong to the same bootstrap confidence interval.


Guide for Interpreting the Mean Using Interval Style #1 and #2

Guide for Interpreting the Mean Using Interval Style #1 and #2


  • Refer to question number 1. Interval Style #1 says that the endpoints 3.45 and 3.46 must be interpreted as Sometimes and Often respectively. But if you look at the bootstrap CI, 3.45 and 3.46 fall inside the interval (3.22, 3.64). The implication is that both numbers are not significantly different and should have the same interpretation.

  • For Interval Style #2, the same thing happened, The endpoints 3.49 and 3.5 must be interpreted as Sometimes and Often respectively. A contradiction since both numbers are inside the same confidence interval. If we allow Interval Style #2, then how is 3.4999 interpreted? Is it something that is between Sometimes and Often?

  • Refer to interval (3.46,4.45) from Style #1 and the interval (3.5,4.49) in Style #2, both are interpreted as Often. The confidence interval for Question #1 is (3.22, 3.64) and it does not contain the value 4.45. Therefore 3.46 and 4.45 must have different interpretation. Similarly, the value 4.49 is outside the CI and so the interval (3.5,4.49) should not be interpreted as Often.


  1. Next we compare the four Interval Styles. Since these tables are constructed by different people then we expect to have inconsistencies in interpreting the same computed mean.

Suppose the computed mean for a particular Likert item is 3.28.

  • Using style #1, this value is interpreted as Sometimes.
  • Using style #2, this value is interpreted as Sometimes.
  • Using style #3, this value is interpreted as Often.
  • Using style #4, this value is interpreted as Sometimes.


The inconsistencies sighted above showed that there is something wrong with our assumption. Using the logic in set theory, I can only conclude that the assumption is empty.


Thank you for reading my article. This is my first one and I think I will be publishing more in the coming days about human behavior analysis. This covers the interesting fields of descriptive analysis, predictive analysis and causal analysis. I hope this will be my contribution in the field of research not only in my country but for others as well.


My other field of interest involves building machine learning models with application in time series analysis, regression analysis, classification analysis and cluster analysis.



About Me:
Carlito O. Daarol
Faculty/Statistician/Data Scientist
BS Mathematics - MSU Marawi
MS Statistics - UP Diliman
Phd Statistics (candidate) - UP Diliman



References

  • Paul E. Spector. Summated Rating Scale Construction. an Introduction. University of South Florida

  • Florent Buisson. Behavioral Data Analysis with R and Python Customer Driven Data for Real Business

  • Efron, B. (1992). Bootstrap Methods: Another Look at the Jackknife. In: Kotz, S., Johnson, N.L. (eds) Breakthroughs in Statistics. Springer Series in Statistics. Springer, New York, NY. https://doi.org/10.1007/978-1-4612-4380-9_41

  • Achilleas Kostoulas Likert-scales-ordinal-data-and-mean-values

  • Richard McElreath. Statistical Rethinking ( A book in Evolutionary Ecology, Bayesian Data Analysis)