Making Scales

Sometimes, we collect data that we need to make into scales, such as multiple questions that get at the same construct. When this happens, we want to combine the data in some way, maybe through summation or maybe through averaging. We’ll go over both here.

Importing Data

But first, let’s bring in our data. We’re working with an sav file type, so we need the haven package loaded. You should already have it installed since we worked with an sav file last week. Meaning, you just need to load the library for that package.

library(haven)

prri <- read_sav("PRRI-2019-American-Values-Survey-1.sav")

Selecting Data

And let’s look at the variables they have here: https://www.prri.org/wp-content/uploads/2020/11/2019-AVS-Topline-Full.pdf We’re going to work with the subset of questions that stem from Q2. Q2 (in its many forms) asks about opinions on political figures, some that are Democrats, some that are Republicans, and some that are more Independent. We might not be interested in participants’ opinions of individual politicians, but Democrats or Republicans more broadly.

Let’s first select the columns of interest: Q2A_A: Donald Trump, R Q2A_B: Bill Weld, R Q2A_C: Nancy Pelosi, D Q2A_D: Alexandria Ocasio-Cortez, D Q2A_E: Mitch McConnell, R Q2A_F: Barack Obama, D Q2B_A: Joe Biden, D Q2B_B: Cory Booker, D Q2B_C: Pete Buttigieg, D Q2B_D: Kamala Harris, D Q2B_E: Bernie Sanders, D Q2B_F: Elizabeth Warren, D Q2B_G: Steve Bullock, D Q2B_H: Julian Castro, D Q2B_I: Bill de Blasio, D Q2B_J: Tulsi Gabbard, D Q2B_K: Kirsten Gillibrand, D Q2B_L: Amy Klobuchar, D Q2B_M: Beto O’Rourke, D Q2B_N: Tom Steyer, D Q2B_O: Marianne Williamson, D Q2B_P: Andrew Yang, D

Last week we went over multiple ways of selecting specific columns. For today, we’re just going to use the ‘grep’ approach, using the column numbers. These span columns 5 to 26, so let’s use that to subset into a politician dataset: And we want to also keep column 1, since that contains the participants’ ID

df1 <- prri[,c(1, 5:26)]

Creating NAs from NAs that are Coded as Numbers

You may remember from last week that PRRI codes “Don’t Know/Refuse” as 9s and “Have not heard of” as 5s. For example, if we look at the frequencies for Amy Klobuchar:

table(df1$Q2B_L)

## 
##   1   2   3   4   5   9 
##  40 209 198 165 575  45

A good number of people haven’t heard of her!

So let’s recode her, and others’, 5s and 9s as NAs. Remember that for this, we use the dplyr package.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df2 <- na_if(df1, 9)
df2 <- na_if(df2, 5)
#And let's check Klobuchar again:
table(df2$Q2B_L)

## 
##   1   2   3   4 
##  40 209 198 165

Yay, no 5s or 9s! And this is true for other columns too:

table(df2$Q2B_G)

## 
##   1   2   3   4 
##  16 121 191 154

By not including a column designation in the na_if code, it applied it to the whole dataset.

Making a Scale Value for Democrat Favorability

Let’s start by dropping the Republican columns from our data Looking at the questions above, we can see that only three of the questions asked about a Republican candidate. So, those are the ones we’ll remove. We will use the subset() function since we’re only removing 3 columns.

df3 <- subset(df2, select=-c(Q2A_A,Q2A_B,Q2A_E))

And for the sake of it, let’s visualize our missingness

library(naniar)

vis_miss(df3)

## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.

There’s quite a bit of missingness in this data! According to PRRI, they only gave some questions to half of their participants, so that makes some sense then (Not to mention, people don’t always know politicians!)

Now, remember that we don’t care about individual ratings of individual politicians, instead, we want to make a scale of feelings towards Democrats. So, in this particular case, we don’t have to worry much about missing data for now.

But before we create our scale, let’s talk about alpha, but not the 0.05 one! This is Cronbach’s Alpha, which tells us (simply) if our scale is good. Typically, alpha > 0.7 is a good scale, but there may be some times where you use a scale that doesn’t reach that threshold. But this is simply done, if every column in your dataframe is related to the same construct There are a couple ways of getting Cronbach’s alpha, but thankfully, the psych package contains an alpha() function. Our df3 has all the variables we care about, so we just feed that data frame to the function. Don’t forget to install the psych package if you haven’t already

library(psych)

alpha(df3)

## Warning in alpha(df3): Some items were negatively correlated with the total scale and probably 
## should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## Some items ( CaseId ) were negatively correlated with the total scale and 
## probably should be reversed.  
## To do this, run the function again with the 'check.keys=TRUE' option

## 
## Reliability analysis   
## Call: alpha(x = df3)
## 
##   raw_alpha std.alpha G6(smc) average_r S/N     ase mean  sd median_r
##   -0.00013      0.96    0.97      0.55  24 0.00023  495 419     0.62
## 
##     95% confidence boundaries 
##          lower alpha upper
## Feldt    -0.06     0  0.06
## Duhachek  0.00     0  0.00
## 
##  Reliability if an item is dropped:
##        raw_alpha std.alpha G6(smc) average_r S/N alpha se  var.r med.r
## CaseId   9.7e-01      0.97    0.98      0.61  30  0.00094 0.0084  0.63
## Q2A_C   -1.1e-04      0.96    0.97      0.55  23  0.00021 0.0444  0.62
## Q2A_D   -1.3e-04      0.96    0.97      0.55  23  0.00021 0.0447  0.62
## Q2A_F   -1.2e-04      0.96    0.97      0.55  23  0.00021 0.0445  0.62
## Q2B_A   -1.1e-04      0.96    0.97      0.55  23  0.00021 0.0451  0.62
## Q2B_B   -1.3e-04      0.96    0.97      0.54  22  0.00021 0.0436  0.61
## Q2B_C   -1.3e-04      0.96    0.97      0.54  23  0.00021 0.0440  0.61
## Q2B_D   -9.9e-05      0.96    0.97      0.54  23  0.00021 0.0436  0.61
## Q2B_E   -1.4e-04      0.96    0.97      0.55  23  0.00021 0.0452  0.62
## Q2B_F   -1.1e-04      0.96    0.97      0.54  23  0.00021 0.0435  0.61
## Q2B_G   -1.5e-04      0.96    0.97      0.55  23  0.00022 0.0443  0.62
## Q2B_H   -1.2e-04      0.96    0.97      0.54  22  0.00021 0.0439  0.61
## Q2B_I   -1.0e-04      0.96    0.97      0.54  23  0.00022 0.0442  0.62
## Q2B_J   -1.2e-04      0.96    0.97      0.55  23  0.00022 0.0446  0.62
## Q2B_K   -1.4e-04      0.96    0.97      0.54  22  0.00021 0.0442  0.61
## Q2B_L   -1.2e-04      0.96    0.97      0.54  22  0.00021 0.0437  0.61
## Q2B_M   -1.5e-04      0.96    0.97      0.54  22  0.00021 0.0443  0.61
## Q2B_N   -1.8e-04      0.96    0.97      0.54  22  0.00021 0.0443  0.61
## Q2B_O   -1.7e-04      0.96    0.97      0.56  24  0.00022 0.0447  0.63
## Q2B_P   -8.9e-05      0.96    0.97      0.55  23  0.00021 0.0444  0.62
## 
##  Item statistics 
##           n    raw.r std.r   r.cor   r.drop   mean      sd
## CaseId 2527  0.68750 0.053 -0.0093 -0.01415 4518.5 2704.59
## Q2A_C  2451  0.03422 0.783  0.7705 -0.03640    2.8    0.96
## Q2A_D  1177  0.06926 0.753  0.7429  0.00045    2.8    1.02
## Q2A_F  2502 -0.00884 0.773  0.7611 -0.01235    2.2    1.12
## Q2B_A  2396 -0.04723 0.737  0.7208 -0.02308    2.6    1.00
## Q2B_B  1703 -0.01519 0.852  0.8479 -0.00642    2.8    0.96
## Q2B_C  1603 -0.02363 0.828  0.8223  0.00151    2.7    1.01
## Q2B_D  1903 -0.04203 0.820  0.8133 -0.04306    2.8    1.02
## Q2B_E  2399 -0.02459 0.758  0.7442  0.00671    2.6    1.09
## Q2B_F  2069 -0.02513 0.833  0.8302 -0.03073    2.6    1.09
## Q2B_G   482  0.01704 0.791  0.7903  0.02212    3.0    0.84
## Q2B_H   718  0.00084 0.843  0.8383 -0.02194    2.8    0.95
## Q2B_I   756 -0.07493 0.804  0.8005 -0.04286    3.0    0.84
## Q2B_J   566 -0.04923 0.720  0.7108 -0.02171    2.9    0.92
## Q2B_K   700 -0.01722 0.838  0.8334  0.01044    2.8    0.88
## Q2B_L   612 -0.03107 0.841  0.8419 -0.01147    2.8    0.91
## Q2B_M   835 -0.01679 0.839  0.8319  0.02475    2.8    0.98
## Q2B_N   517  0.04761 0.844  0.8430  0.06166    2.9    0.89
## Q2B_O   609  0.04511 0.624  0.5994  0.05655    2.9    0.84
## Q2B_P   665 -0.06935 0.787  0.7764 -0.05853    2.7    0.95
## 
## Non missing response frequency for each item
##          1    2    3    4 miss
## Q2A_C 0.08 0.33 0.28 0.31 0.03
## Q2A_D 0.11 0.30 0.25 0.34 0.53
## Q2A_F 0.36 0.29 0.16 0.19 0.01
## Q2B_A 0.15 0.38 0.25 0.23 0.05
## Q2B_B 0.08 0.36 0.26 0.29 0.33
## Q2B_C 0.14 0.33 0.27 0.26 0.37
## Q2B_D 0.11 0.34 0.23 0.32 0.25
## Q2B_E 0.18 0.34 0.19 0.29 0.05
## Q2B_F 0.17 0.32 0.20 0.30 0.18
## Q2B_G 0.03 0.25 0.40 0.32 0.81
## Q2B_H 0.08 0.36 0.28 0.28 0.72
## Q2B_I 0.03 0.25 0.37 0.35 0.70
## Q2B_J 0.06 0.31 0.33 0.30 0.78
## Q2B_K 0.03 0.38 0.30 0.28 0.72
## Q2B_L 0.07 0.34 0.32 0.27 0.76
## Q2B_M 0.09 0.33 0.27 0.30 0.67
## Q2B_N 0.05 0.30 0.34 0.31 0.80
## Q2B_O 0.04 0.27 0.40 0.29 0.76
## Q2B_P 0.09 0.35 0.29 0.26 0.74

Now THAT is a high alpha! (Good for stats, maybe not good for the nuance on how people view politicians of the same political party) We can also examine a bit with this output, the part where they break it down by item with a raw_alpha/std.alpha next to the column name. This shows you the alpha if you drop a scale item.

Now we can actually make this into a scale! There’s two ways to do this: summation or averaging. Which do you think makes more sense: summation (rowSums) or averaging (rowMeans)?

df3$demFav1 <- rowMeans(subset(df3, select = 2:20),na.rm=T)

We may also want to reverse code this scale, so that higher values are more favorable. We talked about this last week. How would we do that?

df3$demFav <- 5-df3$demFav1

And then let’s visualize the data, using a histogram again! Go back to last week’s code to make this graph a little prettier

hist(df3$demFav)

Practice!

Let’s say you wanted to make a scale of Republican Favorability, how would you? Follow the steps above and make such a scale, and visualize it! Don’t forget to annotate (leave yourself comments to help you remember what you’re doing at each step)

df4<- subset(df2, select=c(1,Q2A_A,Q2A_B,Q2A_E))

df4$repFav1 <- rowMeans(subset(df4, select = 2:4),na.rm=T)

df4$repFav <- 5-df4$repFav1

hist(df4$repFav)

Last thing we’ll do on this topic is merging our datasets and save them as a single file in .csv format

df5 <- merge(df3,df4,by.y="CaseId")
df <- subset(df5,select=-c(CaseId,repFav1, demFav1))
write.csv(df,"Republican and Democrat Favorability Scale.csv")

Central Limit Theorem

In class, we learned about sampling distributions, and in R, we can play around with some of this by randomly selecting “samples” out of our “population.”

Now we can play with a for loop to see sampling distributions in real time. The code below is complicated, and not important for you to learn, but it is a good way to visualize what you learned in class.

df <- subset(df3,select=demFav)
df <- na.omit(df)

s <- NA
s <- as.data.frame(s)
s <- s[FALSE,]
reps <- 2000 #How many samples are drawn from the population
N <- 10 #Sample size of the samples drawn from the population

for(i in 1:reps){
  tmp <- df[sample(nrow(df), N), ]
  tmp2 <- mean(tmp$demFav)
  s <- rbind(s, tmp2)
}

hist(s,xlim=c(1,4),breaks=20)

s <- NA
s <- as.data.frame(s)
s <- s[FALSE,]
reps <- 10000 #How many samples are drawn from the population
N <- 100 #Sample size of the samples drawn from the population

for(i in 1:reps){
  tmp <- df[sample(nrow(df), N), ]
  tmp2 <- mean(tmp$demFav)
  s <- rbind(s, tmp2)
}

hist(s,xlim=c(1,4),breaks=20)

We can also compare the mean of the sampling distribution:

mean(s)

## [1] 2.332237

To the mean of the “population”:

mean(df$demFav)

## [1] 2.332861

Population SD

sd(df$demFav,na.rm=T)

## [1] 0.801605

Similarly, consider the standard deviation of the sampling distribution:

sd(s,na.rm=T)

## [1] 0.07787212

To think about: How is this impacted by the sample size, N? And how is the sampling distribution mean affected by N and/or number of repetitions?

Week 2