In this activity, you will analyze data from the 2020 wave of the American National Election Studies, A.K.A. ANES, which is a survey asking ordinary Americans for their views about a wide range of political issues. See more here.
Since this is a nationally representative survey of American adults, each observation is a survey respondent (an individual person who took the survey).
The variables we have in the dataset are:
respondentid this variable marks individual
respondents. We will not use this variable through this activity,
however, it is still included to differentiate between respondents.pid7 respondents’ self-identified party affiliations
(also called party ID). 1=“Strong Democrat”, 2=“Weak Democrat”,
3=“Independent leaning Democrat”, 4=“Independent”, 5=“Independent
leaning Republican”, 6=“Weak Republican”, 7=“Strong Republican”, 99=“no
response”.feeling_nra respondents’ evaluation of the National
Rifle Association. The variable ranges from 0 (very negative feelings
toward the association) to 100 (very positive feelings).ban_ar respondents’ degree of support for banning
‘assault-style’ rifles. 1=“Favor”, 2=“Oppose”, 3=“Neither favor nor
oppose”.sex respondents’ stated gender, 1=male and
2=female.You should begin by downloading both the dataset and the Lab 3 R
Markdown (.Rmd) template to your computer, saving them in
the same folder. Then double-click the .Rmd template file
to start RStudio.
Load the dataset. Because it is in .RData format (R’s
native data format) you will just use the command
load("anes2020.RData") which will put the dataset into R’s
workspace as an object named anes2020. You don’t have to
assign it to a name yourself (i.e. you don’t have to use
<- as you would with read.csv or other
commands) but R will just load it automatically under the name of the
object it was saved from.
You should then attach this object so that R will know to look in this dataset whenever you reference a variable name.
Finally, have R print out all the variable names in the dataset using
the names command.
load("anes2020.RData")
attach(anes2020)
names(anes2020)
## [1] "respondentid" "pid7" "sex" "feeling_nra" "ban_ar"
Make a table of the variable pid7. Note that the value
of 99 corresponds to a missing value. If we do calculations with this
variable, R will inappropriately treat 99s like a real value rather than
missing data. So we’d like to fix that.
Using the recode command in the car
library, create a new variable called pid7.new which is the
same as the original variable except that it recodes values of 99 to
NA but keeps values of 1 through 7 as they were in the
original variable.
(Hint: Remember that you will need to install the car
package if you haven’t already done this. To do this type
install.packages("car") in the console – this is a rare
time when you should not do this in your code but actually should do it
in the terminal in the bottom left after the > prompt.
This is because you will only need to install the package once and then
it will be on your computer. Then each new R session you want to use the
package you can type library(car) – this command should go
in your source code as usual – to load it into your workspace.) Then
make a table of the original variable against the new one adding the
option exclude=NULL to show NA values in the
table in order to make sure the recoding worked as planned.
Finally, calculate the mean and make a histogram of the
pid7.new variable and comment briefly on what you see,
including what values are most common and the overall shape of the
distribution. (Note: the hist command might choose odd bin
divisions but the bars should represent the frequency of values 1, 2, …,
7. You don’t need to worry about making the histogram look pretty
here.)
table(pid7)
## pid7
## 1 2 3 4 5 6 7 99
## 1961 900 975 968 879 832 1730 35
pid7[pid7 == 99] <- NA
table(pid7, useNA = "ifany")
## pid7
## 1 2 3 4 5 6 7 <NA>
## 1961 900 975 968 879 832 1730 35
library(car)
## Loading required package: carData
pid7.new <- recode(pid7, "99=NA")
table(pid7.new, useNA = "ifany")
## pid7.new
## 1 2 3 4 5 6 7 <NA>
## 1961 900 975 968 879 832 1730 35
mean_pid7_new <- mean(pid7.new, na.rm = TRUE)
print(mean_pid7_new)
## [1] 3.887811
hist(pid7.new, main = "Histogram of pid7.new", xlab = "Party ID", breaks = 7)
With 1 being strongly democratic and 7 being strongly Republican having a mean of Party ID at around 3.887 shows that the mean of Party ID is roughly in the middle with no lean towards any side. The histogram however shows that there are more people that are strongly Republican and Democrat than there are in the middle.
The variable feeling_nra gives respondents’ evaluations
of the National Rifle Association. It is measured as a “feeling
thermometer” where 0 denotes the most negative evaluations and 100 the
most positive evaluations. In other words, higher values represent more
positive feelings toward the association and lower values represent more
negative feelings.
Make a histogram of the variable feeling_nra and then
make a boxplot of feeling_nra (on the vertical axis)
against pid7.new. Briefly comment on what you see in both
the histogram and the boxplot, including the typical evaluations of the
NRA of respondents with different party identifications.
hist(feeling_nra, main = "Histogram of Feeling Thermometer for NRA",
xlab = "Feeling Thermometer Score", breaks = 20)
boxplot(feeling_nra ~ pid7.new, main = "Boxplot of Feeling Thermometer for NRA by Party ID",
xlab = "Party ID", ylab = "Feeling Thermometer for NRA",
names = c("Strong Dem", "Weak Dem", "Lean Dem", "Independent",
"Lean Rep", "Weak Rep", "Strong Rep"))
The historgram shows that the NRA for ratings on the candidates gives out a lot of 0s and a lot of 100s without a lot of scores in the middle. The boxplot shows that the NRA gives democrats way worse rating than those of strong republicans who get way higher ratings.
Now, estimate the average evaluation of the National Rifle Association (that is, calculate the sample mean).
Then, construct a 95% confidence interval for this mean in two ways:
(1) calculating it yourself based on the mean and standard deviation of
this variable (which you can use the mean and
sd functions to calculate) and (2) using the
t.test function. (Note: these two ways might give very
slightly different values, but should be identical to at least a couple
decimal places. Also note there are no missing values in this variable
so length(feeling_nra) will give you the sample size \(N\).))
mean_nra <- mean(feeling_nra)
print(mean_nra)
## [1] 56.42234
sd_nra <- sd(feeling_nra)
print(sd_nra)
## [1] 36.50145
n <- length(feeling_nra)
print(n)
## [1] 8280
error_margin <- qt(0.975, df=n-1) * (sd_nra / sqrt(n))
lower_bound <- mean_nra - error_margin
upper_bound <- mean_nra + error_margin
cat("95% CI (manual calculation):", lower_bound, "to", upper_bound, "\n")
## 95% CI (manual calculation): 55.63601 to 57.20868
t_test_result <- t.test(feeling_nra, conf.level = 0.95)
cat("95% CI (t.test):", t_test_result$conf.int[1], "to", t_test_result$conf.int[2], "\n")
## 95% CI (t.test): 55.63601 to 57.20868
The variable ban_ar gives respondents’ views on the
following question: “Do you favor, oppose, or neither favor nor oppose
banning the sale of semi-automatic “assault-style” rifles?” where the
value of 1 indicates a response of “Favor,” a value of 2 indicates a
response of “Oppose,” and a value of 3 indicates “Neither favor nor
oppose”.
First, create a new variable called ban_ar.new that is 1
for the “Favor” response and 0 for the “Oppose” and “Neither favor nor
oppose” responses. (Hint: you can use the recode function
as above and code the values 2 and 3 as 0.)
Next, make a table of this new variable.
Finally, calculate the proportion of respondents giving the “Favor”
response and calculate a 95% confidence interval for this proportion
separately in two ways: (1) calculating it yourself using the sample
proportion (Hint: You need to calculate the mean of the new variable.
Since the variable includes NA values, you need to type in the
na.rm=TRUE option.) and (2) using the
prop.test function. (Hint: These two ways should give
nearly identical answers but they might after the first couple decimal
places. Also note there are no missing values in this variable so
length(ban_ar.new) will give you the sample size N.)
Briefly comment on what you learned from this, in particular what the estimated proportion tells you in this context and whether we can be confident about what the majority position of Americans is on this question.
library(car)
ban_ar.new <- recode(ban_ar, "1=1; 2=0; 3=0")
table(ban_ar.new)
## ban_ar.new
## 0 1
## 3395 3990
prop_favor <- mean(ban_ar.new, na.rm = TRUE)
print(prop_favor)
## [1] 0.5402844
n_ban_ar <- length(ban_ar.new)
print(n_ban_ar)
## [1] 8280
error_margin_prop <- qnorm(0.975) * sqrt((prop_favor * (1 - prop_favor)) / n_ban_ar)
lower_bound_prop <- prop_favor - error_margin_prop
upper_bound_prop <- prop_favor + error_margin_prop
cat("95% CI (manual calculation):", lower_bound_prop, "to", upper_bound_prop, "\n")
## 95% CI (manual calculation): 0.5295497 to 0.551019
prop_test_result <- prop.test(sum(ban_ar.new, na.rm = TRUE), n_ban_ar, conf.level = 0.95)
cat("95% CI (prop.test):", prop_test_result$conf.int[1], "to", prop_test_result$conf.int[2], "\n")
## 95% CI (prop.test): 0.471072 to 0.492713
The estimated proportion of people that want to ban selling semi automatic weapons in the prop test is between 47.1% - 49.3%. The manual test shows this number to be between 52.9% - 55.1%. Because this is not a crazy majority or minority I don’t believe that we can reasonably assume the majority of Americans want to keep or ban semi automatic assault rifles
Make a table of ban_ar.new against sex by
typing table(ban_ar.new, gender).
Next calculate sample proportions and 95% confidence intervals for
the proportions of men and women who supported banning the sale of
assault rifles. You can just do this using the prop.test
and don’t have to calculate it manually. Then comment briefly on what
you learn from this about possible differences in support for banning
assault rifle sales by gender, including both estimates and confidence
intervals.
table(ban_ar.new, sex)
## sex
## ban_ar.new 1 2
## 0 1751 1617
## 1 1598 2370
For some reason the data set is showing me that there is no men or women in the data set.