Sometimes, we collect data that we need to make into scales, such as multiple questions that get at the same construct. When this happens, we want to combine the data in some way, maybe through summation or maybe through averaging. We’ll go over both here.
But first, let’s bring in our data. We’re working with an sav file type, so we need the haven package loaded. You should already have it installed since we worked with an sav file last week. Meaning, you just need to load the library for that package.
library(haven)
prri <- read_sav("PRRI-2019-American-Values-Survey-1.sav")
And let’s look at the variables they have here: https://www.prri.org/wp-content/uploads/2020/11/2019-AVS-Topline-Full.pdf We’re going to work with the subset of questions that stem from Q2. Q2 (in its many forms) asks about opinions on political figures, some that are Democrats, some that are Republicans, and some that are more Independent. We might not be interested in participants’ opinions of individual politicians, but Democrats or Republicans more broadly.
Let’s first select the columns of interest: Q2A_A: Donald Trump, R Q2A_B: Bill Weld, R Q2A_C: Nancy Pelosi, D Q2A_D: Alexandria Ocasio-Cortez, D Q2A_E: Mitch McConnell, R Q2A_F: Barack Obama, D Q2B_A: Joe Biden, D Q2B_B: Cory Booker, D Q2B_C: Pete Buttigieg, D Q2B_D: Kamala Harris, D Q2B_E: Bernie Sanders, D Q2B_F: Elizabeth Warren, D Q2B_G: Steve Bullock, D Q2B_H: Julian Castro, D Q2B_I: Bill de Blasio, D Q2B_J: Tulsi Gabbard, D Q2B_K: Kirsten Gillibrand, D Q2B_L: Amy Klobuchar, D Q2B_M: Beto O’Rourke, D Q2B_N: Tom Steyer, D Q2B_O: Marianne Williamson, D Q2B_P: Andrew Yang, D
Last week we went over multiple ways of selecting specific columns. For today, we’re just going to use the ‘grep’ approach, using the column numbers. These span columns 5 to 26, so let’s use that to subset into a politician dataset: And we want to also keep column 1, since that contains the participants’ ID
df1 <- prri[,c(1, 5:26)]
You may remember from last week that PRRI codes “Don’t Know/Refuse” as 9s and “Have not heard of” as 5s. For example, if we look at the frequencies for Amy Klobuchar:
table(df1$Q2B_L)
##
## 1 2 3 4 5 9
## 40 209 198 165 575 45
A good number of people haven’t heard of her!
So let’s recode her, and others’, 5s and 9s as NAs. Remember that for this, we use the dplyr package.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df2 <- na_if(df1, 9)
df2 <- na_if(df2, 5)
#And let's check Klobuchar again:
table(df2$Q2B_L)
##
## 1 2 3 4
## 40 209 198 165
Yay, no 5s or 9s! And this is true for other columns too:
table(df2$Q2B_G)
##
## 1 2 3 4
## 16 121 191 154
By not including a column designation in the na_if code, it applied it to the whole dataset.
Let’s start by dropping the Republican columns from our data Looking at the questions above, we can see that only three of the questions asked about a Republican candidate. So, those are the ones we’ll remove. We will use the subset() function since we’re only removing 3 columns.
df3 <- subset(df2, select=-c(Q2A_A,Q2A_B,Q2A_E))
And for the sake of it, let’s visualize our missingness
library(naniar)
vis_miss(df3)
## Warning: `gather_()` was deprecated in tidyr 1.2.0.
## Please use `gather()` instead.
There’s quite a bit of missingness in this data! According to PRRI, they
only gave some questions to half of their participants, so that makes
some sense then (Not to mention, people don’t always know
politicians!)
Now, remember that we don’t care about individual ratings of individual politicians, instead, we want to make a scale of feelings towards Democrats. So, in this particular case, we don’t have to worry much about missing data for now.
But before we create our scale, let’s talk about alpha, but not the 0.05 one! This is Cronbach’s Alpha, which tells us (simply) if our scale is good. Typically, alpha > 0.7 is a good scale, but there may be some times where you use a scale that doesn’t reach that threshold. But this is simply done, if every column in your dataframe is related to the same construct There are a couple ways of getting Cronbach’s alpha, but thankfully, the psych package contains an alpha() function. Our df3 has all the variables we care about, so we just feed that data frame to the function. Don’t forget to install the psych package if you haven’t already
library(psych)
alpha(df3)
## Warning in alpha(df3): Some items were negatively correlated with the total scale and probably
## should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
## Some items ( CaseId ) were negatively correlated with the total scale and
## probably should be reversed.
## To do this, run the function again with the 'check.keys=TRUE' option
##
## Reliability analysis
## Call: alpha(x = df3)
##
## raw_alpha std.alpha G6(smc) average_r S/N ase mean sd median_r
## -0.00013 0.96 0.97 0.55 24 0.00023 495 419 0.62
##
## 95% confidence boundaries
## lower alpha upper
## Feldt -0.06 0 0.06
## Duhachek 0.00 0 0.00
##
## Reliability if an item is dropped:
## raw_alpha std.alpha G6(smc) average_r S/N alpha se var.r med.r
## CaseId 9.7e-01 0.97 0.98 0.61 30 0.00094 0.0084 0.63
## Q2A_C -1.1e-04 0.96 0.97 0.55 23 0.00021 0.0444 0.62
## Q2A_D -1.3e-04 0.96 0.97 0.55 23 0.00021 0.0447 0.62
## Q2A_F -1.2e-04 0.96 0.97 0.55 23 0.00021 0.0445 0.62
## Q2B_A -1.1e-04 0.96 0.97 0.55 23 0.00021 0.0451 0.62
## Q2B_B -1.3e-04 0.96 0.97 0.54 22 0.00021 0.0436 0.61
## Q2B_C -1.3e-04 0.96 0.97 0.54 23 0.00021 0.0440 0.61
## Q2B_D -9.9e-05 0.96 0.97 0.54 23 0.00021 0.0436 0.61
## Q2B_E -1.4e-04 0.96 0.97 0.55 23 0.00021 0.0452 0.62
## Q2B_F -1.1e-04 0.96 0.97 0.54 23 0.00021 0.0435 0.61
## Q2B_G -1.5e-04 0.96 0.97 0.55 23 0.00022 0.0443 0.62
## Q2B_H -1.2e-04 0.96 0.97 0.54 22 0.00021 0.0439 0.61
## Q2B_I -1.0e-04 0.96 0.97 0.54 23 0.00022 0.0442 0.62
## Q2B_J -1.2e-04 0.96 0.97 0.55 23 0.00022 0.0446 0.62
## Q2B_K -1.4e-04 0.96 0.97 0.54 22 0.00021 0.0442 0.61
## Q2B_L -1.2e-04 0.96 0.97 0.54 22 0.00021 0.0437 0.61
## Q2B_M -1.5e-04 0.96 0.97 0.54 22 0.00021 0.0443 0.61
## Q2B_N -1.8e-04 0.96 0.97 0.54 22 0.00021 0.0443 0.61
## Q2B_O -1.7e-04 0.96 0.97 0.56 24 0.00022 0.0447 0.63
## Q2B_P -8.9e-05 0.96 0.97 0.55 23 0.00021 0.0444 0.62
##
## Item statistics
## n raw.r std.r r.cor r.drop mean sd
## CaseId 2527 0.68750 0.053 -0.0093 -0.01415 4518.5 2704.59
## Q2A_C 2451 0.03422 0.783 0.7705 -0.03640 2.8 0.96
## Q2A_D 1177 0.06926 0.753 0.7429 0.00045 2.8 1.02
## Q2A_F 2502 -0.00884 0.773 0.7611 -0.01235 2.2 1.12
## Q2B_A 2396 -0.04723 0.737 0.7208 -0.02308 2.6 1.00
## Q2B_B 1703 -0.01519 0.852 0.8479 -0.00642 2.8 0.96
## Q2B_C 1603 -0.02363 0.828 0.8223 0.00151 2.7 1.01
## Q2B_D 1903 -0.04203 0.820 0.8133 -0.04306 2.8 1.02
## Q2B_E 2399 -0.02459 0.758 0.7442 0.00671 2.6 1.09
## Q2B_F 2069 -0.02513 0.833 0.8302 -0.03073 2.6 1.09
## Q2B_G 482 0.01704 0.791 0.7903 0.02212 3.0 0.84
## Q2B_H 718 0.00084 0.843 0.8383 -0.02194 2.8 0.95
## Q2B_I 756 -0.07493 0.804 0.8005 -0.04286 3.0 0.84
## Q2B_J 566 -0.04923 0.720 0.7108 -0.02171 2.9 0.92
## Q2B_K 700 -0.01722 0.838 0.8334 0.01044 2.8 0.88
## Q2B_L 612 -0.03107 0.841 0.8419 -0.01147 2.8 0.91
## Q2B_M 835 -0.01679 0.839 0.8319 0.02475 2.8 0.98
## Q2B_N 517 0.04761 0.844 0.8430 0.06166 2.9 0.89
## Q2B_O 609 0.04511 0.624 0.5994 0.05655 2.9 0.84
## Q2B_P 665 -0.06935 0.787 0.7764 -0.05853 2.7 0.95
##
## Non missing response frequency for each item
## 1 2 3 4 miss
## Q2A_C 0.08 0.33 0.28 0.31 0.03
## Q2A_D 0.11 0.30 0.25 0.34 0.53
## Q2A_F 0.36 0.29 0.16 0.19 0.01
## Q2B_A 0.15 0.38 0.25 0.23 0.05
## Q2B_B 0.08 0.36 0.26 0.29 0.33
## Q2B_C 0.14 0.33 0.27 0.26 0.37
## Q2B_D 0.11 0.34 0.23 0.32 0.25
## Q2B_E 0.18 0.34 0.19 0.29 0.05
## Q2B_F 0.17 0.32 0.20 0.30 0.18
## Q2B_G 0.03 0.25 0.40 0.32 0.81
## Q2B_H 0.08 0.36 0.28 0.28 0.72
## Q2B_I 0.03 0.25 0.37 0.35 0.70
## Q2B_J 0.06 0.31 0.33 0.30 0.78
## Q2B_K 0.03 0.38 0.30 0.28 0.72
## Q2B_L 0.07 0.34 0.32 0.27 0.76
## Q2B_M 0.09 0.33 0.27 0.30 0.67
## Q2B_N 0.05 0.30 0.34 0.31 0.80
## Q2B_O 0.04 0.27 0.40 0.29 0.76
## Q2B_P 0.09 0.35 0.29 0.26 0.74
Now THAT is a high alpha! (Good for stats, maybe not good for the nuance on how people view politicians of the same political party) We can also examine a bit with this output, the part where they break it down by item with a raw_alpha/std.alpha next to the column name. This shows you the alpha if you drop a scale item.
Now we can actually make this into a scale! There’s two ways to do this: summation or averaging. Which do you think makes more sense: summation (rowSums) or averaging (rowMeans)?
df3$demFav1 <- rowMeans(subset(df3, select = 2:20),na.rm=T)
We may also want to reverse code this scale, so that higher values are more favorable. We talked about this last week. How would we do that?
df3$demFav <- 5-df3$demFav1
And then let’s visualize the data, using a histogram again! Go back to last week’s code to make this graph a little prettier
hist(df3$demFav)
Let’s say you wanted to make a scale of Republican Favorability, how would you? Follow the steps above and make such a scale, and visualize it! Don’t forget to annotate (leave yourself comments to help you remember what you’re doing at each step)
df4<- subset(df2, select=c(1,Q2A_A,Q2A_B,Q2A_E))
df4$repFav1 <- rowMeans(subset(df4, select = 2:4),na.rm=T)
df4$repFav <- 5-df4$repFav1
hist(df4$repFav)
Last thing we’ll do on this topic is merging our datasets and save them as a single file in .csv format
df5 <- merge(df3,df4,by.y="CaseId")
df <- subset(df5,select=-c(CaseId,repFav1, demFav1))
write.csv(df,"Republican and Democrat Favorability Scale.csv")
In class, we learned about sampling distributions, and in R, we can play around with some of this by randomly selecting “samples” out of our “population.”
Now we can play with a for loop to see sampling distributions in real time. The code below is complicated, and not important for you to learn, but it is a good way to visualize what you learned in class.
df <- subset(df3,select=demFav)
df <- na.omit(df)
s <- NA
s <- as.data.frame(s)
s <- s[FALSE,]
reps <- 2000 #How many samples are drawn from the population
N <- 10 #Sample size of the samples drawn from the population
for(i in 1:reps){
tmp <- df[sample(nrow(df), N), ]
tmp2 <- mean(tmp$demFav)
s <- rbind(s, tmp2)
}
hist(s,xlim=c(1,4),breaks=20)
s <- NA
s <- as.data.frame(s)
s <- s[FALSE,]
reps <- 10000 #How many samples are drawn from the population
N <- 100 #Sample size of the samples drawn from the population
for(i in 1:reps){
tmp <- df[sample(nrow(df), N), ]
tmp2 <- mean(tmp$demFav)
s <- rbind(s, tmp2)
}
hist(s,xlim=c(1,4),breaks=20)
We can also compare the mean of the sampling distribution:
mean(s)
## [1] 2.332237
To the mean of the “population”:
mean(df$demFav)
## [1] 2.332861
Population SD
sd(df$demFav,na.rm=T)
## [1] 0.801605
Similarly, consider the standard deviation of the sampling distribution:
sd(s,na.rm=T)
## [1] 0.07787212
To think about: How is this impacted by the sample size, N? And how is the sampling distribution mean affected by N and/or number of repetitions?