This tutorial will cover calculating Cronbach’s Alpha for a unidimensional scale. We will use the 2020 American National Election Study, and its scale for “Racial Resentment”. This scale is frequently used in the study of racism and is designed to measure feelings of racial animosity in a more indirect way than simply asking respondents if they are racist. It consists of 4 variables: 1. “Irish, Italian, Jewish and many other minorities overcame prejudice and worked their way up. Blacks should do the same without any special favors.” 2. “Generations of slavery and discrimination have created conditions that make it difficult for blacks to work their way out of the lower class” 3. “Over the past few years, blacks have gotten less than they deserve.” 4. “It’s really a matter of some people not trying hard enough; if blacks would only try harder they could be just as well off as whites.”
The first thing we do is create a new dataframe that only includes
the variables we want to include in the Cronbach’s alpha test. We find
this information in the codebook and use the select
function from the dplyr
package to save the new dataframe.
Next step is to give the variables a new name so that we can more easily
evaluate the analysis results. We first save a vector of names and then
use colnames
to apply them in order. There are
non-substantive responses that need to be dealt with. Per the codebook,
any value less than 0 should be handled as system missing in our
analysis so we make those values NA
in the data.
<- anes %>%
df ::select(V202300, V202301, V202302, V202303)
dplyr
<- c("resent_gen", "resent_fav", "resent_try", "resent_deserve")#Give variables more informative names
new_names
# Update column names
colnames(df) <- new_names #Apply names to data frame for analysis
skim(df) #Check for missing data that should be recoded
Name | df |
Number of rows | 3000 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
resent_gen | 0 | 1 | 2.02 | 3.09 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
resent_fav | 0 | 1 | 1.89 | 3.01 | -9 | 1 | 2 | 4 | 5 | ▁▂▁▇▇ |
resent_try | 0 | 1 | 1.97 | 3.06 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
resent_deserve | 0 | 1 | 2.43 | 3.20 | -9 | 2 | 3 | 5 | 5 | ▁▁▁▃▇ |
<= -1] <- NA #Recode negative values to NA for analysis
df[df
skim(df) #Validate that missing data is now treated as NA
Name | df |
Number of rows | 3000 |
Number of columns | 4 |
_______________________ | |
Column type frequency: | |
numeric | 4 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
resent_gen | 305 | 0.9 | 2.94 | 1.49 | 1 | 2.0 | 3 | 4 | 5 | ▇▇▆▆▇ |
resent_fav | 301 | 0.9 | 2.78 | 1.47 | 1 | 1.5 | 2 | 4 | 5 | ▇▇▃▅▆ |
resent_try | 308 | 0.9 | 2.90 | 1.40 | 1 | 2.0 | 3 | 4 | 5 | ▇▇▇▆▇ |
resent_deserve | 310 | 0.9 | 3.42 | 1.37 | 1 | 2.0 | 4 | 5 | 5 | ▃▅▅▅▇ |
Now that we have cleaned the data, we are ready to conduct some analysis. We start by examining the correlation matrix for the individual survey items of interest. All items are correlated at .6 or higher as we should expect if they are measuring the same latent concept. Note that two of the items have negative correlations. This indicates that the variable is reverse coded so that higher levels of the measure is equal to lower levels of racial resentment. This will need to be remembered when creating the new combined measure of racial resentment.
# Calculate correlation matrix
<- cor(df, use = "pairwise.complete.obs")
cor_matrix
# Display correlation matrix as a table
<- round(cor_matrix, 2)
cor_table print(cor_table)
## resent_gen resent_fav resent_try resent_deserve
## resent_gen 1.00 -0.64 -0.64 0.70
## resent_fav -0.64 1.00 0.75 -0.59
## resent_try -0.64 0.75 1.00 -0.61
## resent_deserve 0.70 -0.59 -0.61 1.00
# Plot correlation matrix as a heatmap
corrplot(cor_matrix, method = "color")
Next, we will calculate the Cronbach’s Alpha value using the
psych
package. We simply have to input the dataframe name
we want to include in the analysis, here it is df
, have it
remove missing data with na.rm=TRUE
, and finally we use the
check.keys=TRUE
function to flip the reverse coded items.
This deals with the negative correlation we previously saw to ensure the
analysis is done correctly.
#Calculate Cronbach's Alpha using 'psych' package
##Generic format 'alpha(data, na.rm=TRUE, check.keys=TRUE) #check.keys=TRUE is important as it checks the scale direction and, if necessary, flips the order of the scale prior to running the analysis.
::alpha(df, na.rm = TRUE, check.keys=TRUE) psych
## Warning in psych::alpha(df, na.rm = TRUE, check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
##
## Reliability analysis
## Call: psych::alpha(x = df, na.rm = TRUE, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.88 0.88 0.86 0.65 7.6 0.0035 2.8 1.2 0.64
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.88 0.88 0.89
## Duhachek 0.88 0.88 0.89
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## resent_gen- 0.85 0.85 0.80 0.65 5.5 0.0048 0.0072 0.61
## resent_fav 0.85 0.85 0.79 0.65 5.6 0.0048 0.0023 0.64
## resent_try 0.84 0.84 0.79 0.64 5.4 0.0050 0.0032 0.64
## resent_deserve- 0.86 0.86 0.81 0.68 6.2 0.0044 0.0038 0.64
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## resent_gen- 2695 0.87 0.87 0.80 0.75 3.1 1.5
## resent_fav 2699 0.87 0.86 0.81 0.75 2.8 1.5
## resent_try 2692 0.87 0.87 0.82 0.76 2.9 1.4
## resent_deserve- 2690 0.84 0.84 0.77 0.72 2.6 1.4
##
## Non missing response frequency for each item
## 1 2 3 4 5 miss
## resent_gen 0.23 0.21 0.17 0.16 0.23 0.1
## resent_fav 0.25 0.27 0.12 0.16 0.19 0.1
## resent_try 0.21 0.22 0.22 0.17 0.19 0.1
## resent_deserve 0.10 0.19 0.19 0.20 0.31 0.1
Interpreting alpha is very straightforward. First, evaluate the actual alpha level, which here is a robust .88. Remember, alpha ranges from 0 to 1 with higher values meanin a more reliable scale. The standard arbitrary cut-point for a reliable scale is .7 or larger. Alpha of .88 here represents a very strong and internally reliable scale.
The second thing to evaluate are the individual items in the analysis specifically how alpha would change if it were to be dropped. This metric gives insight into how well each individual item fits the overall latent factor. If alpha goes up with its removal, that indicates the individual item might not truly be part of that concept and should potentially be removed from the scale. If the alpha goes down with its removal, that indicates the individual item is important to the overall latent factor and should be kept in the scale.
For illustration purposes, let’s add three additional variables that are not related to the racial resentment scale. If the new items are not related to racial resentment, we will see that removing the new items would result in a higher alpha level. We’ll add a series of three questions designed to measure rural resentment, or the perception that Americans who live in rural parts of the country are being overlooked and have too little influence in politics. These three questions measure: - How much assistance rural areas get from government - How much influence rural areas have in government - How much respect rural people get from others
Note, this is entirely for pedagogical purposes. I do not believe these two concepts to be related.
<- anes %>%
df ::select(V202300, V202301, V202302, V202303, V202276x, V202279x, V202282x)
dplyr
skim(df)
Name | df |
Number of rows | 3000 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
V202300 | 0 | 1 | 2.02 | 3.09 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
V202301 | 0 | 1 | 1.89 | 3.01 | -9 | 1 | 2 | 4 | 5 | ▁▂▁▇▇ |
V202302 | 0 | 1 | 1.97 | 3.06 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
V202303 | 0 | 1 | 2.43 | 3.20 | -9 | 2 | 3 | 5 | 5 | ▁▁▁▃▇ |
V202276x | 0 | 1 | 3.56 | 3.49 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▆ |
V202279x | 0 | 1 | 3.55 | 3.55 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▇ |
V202282x | 0 | 1 | 3.90 | 3.55 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▇ |
<- c("resent_gen", "resent_fav", "resent_try", "resent_deserve", "rural_assist", "rural_influence", "rural_respect")#Give variables more informative names
new_names
# Update column names
colnames(df) <- new_names #Apply names to data frame for analysis
skim(df) #Check for missing data that should be recoded
Name | df |
Number of rows | 3000 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
resent_gen | 0 | 1 | 2.02 | 3.09 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
resent_fav | 0 | 1 | 1.89 | 3.01 | -9 | 1 | 2 | 4 | 5 | ▁▂▁▇▇ |
resent_try | 0 | 1 | 1.97 | 3.06 | -9 | 1 | 3 | 4 | 5 | ▁▂▁▆▇ |
resent_deserve | 0 | 1 | 2.43 | 3.20 | -9 | 2 | 3 | 5 | 5 | ▁▁▁▃▇ |
rural_assist | 0 | 1 | 3.56 | 3.49 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▆ |
rural_influence | 0 | 1 | 3.55 | 3.55 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▇ |
rural_respect | 0 | 1 | 3.90 | 3.55 | -7 | 4 | 4 | 6 | 7 | ▂▁▁▇▇ |
<= -1] <- NA #Recode negative values to NA for analysis
df[df
skim(df) #Validate that missing data is now treated as NA
Name | df |
Number of rows | 3000 |
Number of columns | 7 |
_______________________ | |
Column type frequency: | |
numeric | 7 |
________________________ | |
Group variables | None |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
resent_gen | 305 | 0.90 | 2.94 | 1.49 | 1 | 2.0 | 3 | 4 | 5 | ▇▇▆▆▇ |
resent_fav | 301 | 0.90 | 2.78 | 1.47 | 1 | 1.5 | 2 | 4 | 5 | ▇▇▃▅▆ |
resent_try | 308 | 0.90 | 2.90 | 1.40 | 1 | 2.0 | 3 | 4 | 5 | ▇▇▇▆▇ |
resent_deserve | 310 | 0.90 | 3.42 | 1.37 | 1 | 2.0 | 4 | 5 | 5 | ▃▅▅▅▇ |
rural_assist | 345 | 0.88 | 4.73 | 1.32 | 1 | 4.0 | 4 | 6 | 7 | ▁▁▇▂▆ |
rural_influence | 327 | 0.89 | 4.66 | 1.60 | 1 | 4.0 | 4 | 6 | 7 | ▂▁▇▁▇ |
rural_respect | 327 | 0.89 | 5.05 | 1.29 | 1 | 4.0 | 5 | 6 | 7 | ▁▁▇▂▇ |
#Run correlations between the items in the proposed scale
# Calculate correlation matrix
<- cor(df, use = "pairwise.complete.obs")
cor_matrix
# Display correlation matrix as a table
<- round(cor_matrix, 2)
cor_table print(cor_table)
## resent_gen resent_fav resent_try resent_deserve rural_assist
## resent_gen 1.00 -0.64 -0.64 0.70 -0.05
## resent_fav -0.64 1.00 0.75 -0.59 0.08
## resent_try -0.64 0.75 1.00 -0.61 0.05
## resent_deserve 0.70 -0.59 -0.61 1.00 -0.03
## rural_assist -0.05 0.08 0.05 -0.03 1.00
## rural_influence -0.29 0.30 0.29 -0.25 0.45
## rural_respect -0.15 0.16 0.15 -0.12 0.38
## rural_influence rural_respect
## resent_gen -0.29 -0.15
## resent_fav 0.30 0.16
## resent_try 0.29 0.15
## resent_deserve -0.25 -0.12
## rural_assist 0.45 0.38
## rural_influence 1.00 0.46
## rural_respect 0.46 1.00
# Plot correlation matrix as a heatmap
corrplot(cor_matrix, method = "color")
Examining the correlations, we see that the three new items are not strongly related to the existing racial resentment items and even have relatively weak correlations between each other. This will help illustrate how to identify items that do not belond in a scale.
Next, we reestimate the alpha level with the three additional variables included. Remember, the initial alpha level was .88 so anything below that would indicate a less reliable scale with items that might not belong.
::alpha(df, na.rm = TRUE, check.keys=TRUE) #Run the alpha calculation psych
## Warning in psych::alpha(df, na.rm = TRUE, check.keys = TRUE): Some items were negatively correlated with total scale and were automatically reversed.
## This is indicated by a negative sign for the variable name.
##
## Reliability analysis
## Call: psych::alpha(x = df, na.rm = TRUE, check.keys = TRUE)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## 0.79 0.78 0.82 0.34 3.6 0.006 4.2 0.95 0.3
##
## 95% confidence boundaries
## lower alpha upper
## Feldt 0.77 0.79 0.8
## Duhachek 0.77 0.79 0.8
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## resent_gen- 0.73 0.73 0.77 0.31 2.7 0.0076 0.049 0.29
## resent_fav 0.73 0.73 0.77 0.31 2.7 0.0077 0.050 0.29
## resent_try 0.73 0.73 0.77 0.31 2.7 0.0076 0.048 0.30
## resent_deserve- 0.74 0.74 0.78 0.32 2.9 0.0073 0.051 0.30
## rural_assist 0.81 0.80 0.83 0.41 4.1 0.0055 0.052 0.30
## rural_influence 0.76 0.76 0.79 0.34 3.1 0.0066 0.078 0.16
## rural_respect 0.79 0.79 0.82 0.38 3.7 0.0060 0.067 0.30
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## resent_gen- 2695 0.76 0.75 0.73 0.64 5.1 1.5
## resent_fav 2699 0.77 0.76 0.74 0.65 2.8 1.5
## resent_try 2692 0.76 0.76 0.74 0.64 2.9 1.4
## resent_deserve- 2690 0.72 0.72 0.68 0.59 4.6 1.4
## rural_assist 2655 0.43 0.44 0.31 0.25 4.7 1.3
## rural_influence 2673 0.67 0.66 0.57 0.50 4.7 1.6
## rural_respect 2673 0.51 0.52 0.40 0.34 5.1 1.3
##
## Non missing response frequency for each item
## 1 2 3 4 5 6 7 miss
## resent_gen 0.23 0.21 0.17 0.16 0.23 0.00 0.00 0.10
## resent_fav 0.25 0.27 0.12 0.16 0.19 0.00 0.00 0.10
## resent_try 0.21 0.22 0.22 0.17 0.19 0.00 0.00 0.10
## resent_deserve 0.10 0.19 0.19 0.20 0.31 0.00 0.00 0.10
## rural_assist 0.02 0.03 0.02 0.50 0.08 0.25 0.09 0.11
## rural_influence 0.05 0.06 0.02 0.42 0.07 0.25 0.13 0.11
## rural_respect 0.01 0.02 0.00 0.45 0.08 0.27 0.16 0.11
First thing to note is that the overall alpha of this new seven item scale is lower, ~.79, than the original .88 results indicating a less reliable scale. The new items harmed the reliablity of the scale overall. Next, we look at the alpha level if each item were removed. The four initial items included in the racial resentment scale all have alpha levels lower than the overall alpha indicating the scale would be worse if any of them were removed. That is what we expected to happen. For the other three items, the alpha levels would either stay the same or get larger if each individual one was removed indicating that these new items probably do not fit the overall latent concept of racial resentment.
Coupling these findings with the small correlations plus the lack of theory, we would conclude that the rural resentment questions do not measure the same concept as racial resentment.
Final step to know is how to combine existing survey questions into a scale. The easiest way, provided they are on the exact same scale which generally should be the case, is to combine them and divide by the total number of items in the scale. For the racial resentment scale, we would sum across the four items and then divide by 4, since there are 4 items in the scale. However, this is when we must flip the scale so that higher values equal the same thing. You would be introducing bias into your analysis if you fail to do this step when required.
We start by creating new variables for each of the four racial resentment questions. First, we examine the codebook and determine that the resentment favoritism and try harder questions are reverse coded so that must be accounted for when creating the new measures. We simply flip the scale direction while saving a new measure in the existing ‘anes’ data frame.
#Working out of the original data frame, anes, so we can save the new variable there for analysis purposes.
<- anes %>% #This creates new variable
anes mutate(resent_gen = case_when(
==1 ~ 1,
V202300 ==2 ~ 2,
V202300 ==3 ~ 3,
V202300 ==4 ~ 4,
V202300 ==5 ~ 5
V202300
))<- anes %>% #Note the reverse coding
anes mutate(resent_fav = case_when(
==1 ~ 5,
V202301 ==2 ~ 4,
V202301 ==3 ~ 3,
V202301 ==4 ~ 2,
V202301 ==5 ~ 1
V202301
))
<- anes %>% #Note the reverse coding
anes mutate(resent_try = case_when(
==1 ~ 5,
V202302 ==2 ~ 4,
V202302 ==3 ~ 3,
V202302 ==4 ~ 2,
V202302 ==5 ~ 1
V202302
))<- anes %>%
anes mutate(resent_deserve = case_when(
==1 ~ 1,
V202303 ==2 ~ 2,
V202303 ==3 ~ 3,
V202303 ==4 ~ 4,
V202303 ==5 ~ 5
V202303
))
#With the new variables coded in same direction, we create the new scale 'racial_resent'
<- anes %>%
anes mutate(racial_resent = (resent_gen + resent_fav + resent_try + resent_deserve) / 4) #Add across individual items and divide by the total number of items. Note this uses casewise deletion so any case that did not answer each question is removed from the calculation
summary(anes$racial_resent) #Examine the
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1.000 2.250 3.000 3.172 4.250 5.000 319
%>%
anes count(racial_resent)
## # A tibble: 18 × 2
## racial_resent n
## <dbl> <int>
## 1 1 158
## 2 1.25 76
## 3 1.5 135
## 4 1.75 116
## 5 2 149
## 6 2.25 145
## 7 2.5 152
## 8 2.75 168
## 9 3 272
## 10 3.25 144
## 11 3.5 135
## 12 3.75 135
## 13 4 149
## 14 4.25 147
## 15 4.5 136
## 16 4.75 148
## 17 5 316
## 18 NA 319
<- anes %>%
df ::select(resent_gen, resent_fav, resent_try, resent_deserve, racial_resent)
dplyr
# Calculate the correlation matrix
<- cor(df, use = "complete.obs") #Note "complete.obs" removes any case with a NA value
cor_matrix
# View the correlation matrix
print(cor_matrix)
## resent_gen resent_fav resent_try resent_deserve racial_resent
## resent_gen 1.0000000 0.6345672 0.6440277 0.7026454 0.8688039
## resent_fav 0.6345672 1.0000000 0.7451431 0.5925321 0.8655949
## resent_try 0.6440277 0.7451431 1.0000000 0.6090468 0.8694315
## resent_deserve 0.7026454 0.5925321 0.6090468 1.0000000 0.8392165
## racial_resent 0.8688039 0.8655949 0.8694315 0.8392165 1.0000000
# Graph the results
corrplot(cor_matrix, method = "color")
Lastly, we want to examine the newly create measure to ensure that it was created appropriately. Since the recoding approach we took kept the original scale in tact of 1-5, we should see 1 as the minimum value and 5 as the maximum. That is what we see in the results. We also see values between the whole numbers such as 1.25 and 1.5 since the denominator in our recode was 4. All of these indicators look good.
The final check is to correlate the new scale with the individual items. The new scale should be highly, but not perfectly, correlated with each of the individual items. That is exactly what we see here. The correlation is at least .84 between the new scale and the individual items but none are perfectly correlated. This indicates that our new scale was created successfully and is now ready to be analyzed.