Due by 11:00pm on 10/4, submitted through Canvas

In this activity, you will analyze data from the 2020 wave of the American National Election Studies, A.K.A. ANES, which is a survey asking ordinary Americans for their views about a wide range of political issues. See more here.

Since this is a nationally representative survey of American adults, each observation is a survey respondent (an individual person who took the survey).

The variables we have in the dataset are:

You should begin by downloading both the dataset and the Lab 3 R Markdown (.Rmd) template to your computer, saving them in the same folder. Then double-click the .Rmd template file to start RStudio.

Question 1: Loading and Exploring the Dataset

Load the dataset. Because it is in .RData format (R’s native data format) you will just use the command load("anes2020.RData") which will put the dataset into R’s workspace as an object named anes2020. You don’t have to assign it to a name yourself (i.e. you don’t have to use <- as you would with read.csv or other commands) but R will just load it automatically under the name of the object it was saved from.

You should then attach this object so that R will know to look in this dataset whenever you reference a variable name.

Finally, have R print out all the variable names in the dataset using the names command.

load("anes2020.RData")
attach(anes2020)
names(anes2020)
## [1] "respondentid" "pid7"         "sex"          "feeling_nra"  "ban_ar"

Question 2: Recoding Party ID Variable

Make a table of the variable pid7. Note that the value of 99 corresponds to a missing value. If we do calculations with this variable, R will inappropriately treat 99s like a real value rather than missing data. So we’d like to fix that.

Using the recode command in the car library, create a new variable called pid7.new which is the same as the original variable except that it recodes values of 99 to NA but keeps values of 1 through 7 as they were in the original variable.

(Hint: Remember that you will need to install the car package if you haven’t already done this. To do this type install.packages("car") in the console – this is a rare time when you should not do this in your code but actually should do it in the terminal in the bottom left after the > prompt. This is because you will only need to install the package once and then it will be on your computer. Then each new R session you want to use the package you can type library(car) – this command should go in your source code as usual – to load it into your workspace.) Then make a table of the original variable against the new one adding the option exclude=NULL to show NA values in the table in order to make sure the recoding worked as planned.

Finally, calculate the mean and make a histogram of the pid7.new variable and comment briefly on what you see, including what values are most common and the overall shape of the distribution. (Note: the hist command might choose odd bin divisions but the bars should represent the frequency of values 1, 2, …, 7. You don’t need to worry about making the histogram look pretty here.)

table(pid7)
## pid7
##    1    2    3    4    5    6    7   99 
## 1961  900  975  968  879  832 1730   35
pid7[pid7 == 99] <- NA
table(pid7, useNA = "ifany")
## pid7
##    1    2    3    4    5    6    7 <NA> 
## 1961  900  975  968  879  832 1730   35
library(car)
## Loading required package: carData
pid7.new <- recode(pid7, "99=NA")
table(pid7.new, useNA = "ifany")
## pid7.new
##    1    2    3    4    5    6    7 <NA> 
## 1961  900  975  968  879  832 1730   35
mean_pid7_new <- mean(pid7.new, na.rm = TRUE)
print(mean_pid7_new)
## [1] 3.887811
hist(pid7.new, main = "Histogram of pid7.new", xlab = "Party ID", breaks = 7)

With 1 being strongly democratic and 7 being strongly Republican having a mean of Party ID at around 3.887 shows that the mean of Party ID is roughly in the middle with no lean towards any side. The histogram however shows that there are more people that are strongly Republican and Democrat than there are in the middle.

Question 3: Examining Association Between Variables

The variable feeling_nra gives respondents’ evaluations of the National Rifle Association. It is measured as a “feeling thermometer” where 0 denotes the most negative evaluations and 100 the most positive evaluations. In other words, higher values represent more positive feelings toward the association and lower values represent more negative feelings.

Make a histogram of the variable feeling_nra and then make a boxplot of feeling_nra (on the vertical axis) against pid7.new. Briefly comment on what you see in both the histogram and the boxplot, including the typical evaluations of the NRA of respondents with different party identifications.

hist(feeling_nra, main = "Histogram of Feeling Thermometer for NRA", 
     xlab = "Feeling Thermometer Score", breaks = 20)

boxplot(feeling_nra ~ pid7.new, main = "Boxplot of Feeling Thermometer for NRA by Party ID",
        xlab = "Party ID", ylab = "Feeling Thermometer for NRA", 
        names = c("Strong Dem", "Weak Dem", "Lean Dem", "Independent", 
                  "Lean Rep", "Weak Rep", "Strong Rep"))

The historgram shows that the NRA for ratings on the candidates gives out a lot of 0s and a lot of 100s without a lot of scores in the middle. The boxplot shows that the NRA gives democrats way worse rating than those of strong republicans who get way higher ratings.

Question 4: Estimating a mean

Now, estimate the average evaluation of the National Rifle Association (that is, calculate the sample mean).

Then, construct a 95% confidence interval for this mean in two ways: (1) calculating it yourself based on the mean and standard deviation of this variable (which you can use the mean and sd functions to calculate) and (2) using the t.test function. (Note: these two ways might give very slightly different values, but should be identical to at least a couple decimal places. Also note there are no missing values in this variable so length(feeling_nra) will give you the sample size \(N\).))

mean_nra <- mean(feeling_nra)
print(mean_nra)
## [1] 56.42234
sd_nra <- sd(feeling_nra)
print(sd_nra)
## [1] 36.50145
n <- length(feeling_nra)
print(n)
## [1] 8280
error_margin <- qt(0.975, df=n-1) * (sd_nra / sqrt(n))
lower_bound <- mean_nra - error_margin
upper_bound <- mean_nra + error_margin
cat("95% CI (manual calculation):", lower_bound, "to", upper_bound, "\n")
## 95% CI (manual calculation): 55.63601 to 57.20868
t_test_result <- t.test(feeling_nra, conf.level = 0.95)
cat("95% CI (t.test):", t_test_result$conf.int[1], "to", t_test_result$conf.int[2], "\n")
## 95% CI (t.test): 55.63601 to 57.20868

Question 5: Estimating a proportion

The variable ban_ar gives respondents’ views on the following question: “Do you favor, oppose, or neither favor nor oppose banning the sale of semi-automatic “assault-style” rifles?” where the value of 1 indicates a response of “Favor,” a value of 2 indicates a response of “Oppose,” and a value of 3 indicates “Neither favor nor oppose”.

First, create a new variable called ban_ar.new that is 1 for the “Favor” response and 0 for the “Oppose” and “Neither favor nor oppose” responses. (Hint: you can use the recode function as above and code the values 2 and 3 as 0.)

Next, make a table of this new variable.

Finally, calculate the proportion of respondents giving the “Favor” response and calculate a 95% confidence interval for this proportion separately in two ways: (1) calculating it yourself using the sample proportion (Hint: You need to calculate the mean of the new variable. Since the variable includes NA values, you need to type in the na.rm=TRUE option.) and (2) using the prop.test function. (Hint: These two ways should give nearly identical answers but they might after the first couple decimal places. Also note there are no missing values in this variable so length(ban_ar.new) will give you the sample size N.)

Briefly comment on what you learned from this, in particular what the estimated proportion tells you in this context and whether we can be confident about what the majority position of Americans is on this question.

library(car)
ban_ar.new <- recode(ban_ar, "1=1; 2=0; 3=0")
table(ban_ar.new)
## ban_ar.new
##    0    1 
## 3395 3990
prop_favor <- mean(ban_ar.new, na.rm = TRUE)
print(prop_favor)
## [1] 0.5402844
n_ban_ar <- length(ban_ar.new)
print(n_ban_ar)
## [1] 8280
error_margin_prop <- qnorm(0.975) * sqrt((prop_favor * (1 - prop_favor)) / n_ban_ar)
lower_bound_prop <- prop_favor - error_margin_prop
upper_bound_prop <- prop_favor + error_margin_prop
cat("95% CI (manual calculation):", lower_bound_prop, "to", upper_bound_prop, "\n")
## 95% CI (manual calculation): 0.5295497 to 0.551019
prop_test_result <- prop.test(sum(ban_ar.new, na.rm = TRUE), n_ban_ar, conf.level = 0.95)
cat("95% CI (prop.test):", prop_test_result$conf.int[1], "to", prop_test_result$conf.int[2], "\n")
## 95% CI (prop.test): 0.471072 to 0.492713

The estimated proportion of people that want to ban selling semi automatic weapons in the prop test is between 47.1% - 49.3%. The manual test shows this number to be between 52.9% - 55.1%. Because this is not a crazy majority or minority I don’t believe that we can reasonably assume the majority of Americans want to keep or ban semi automatic assault rifles

Question 6: Estimating proportions separately for men and women

Make a table of ban_ar.new against sex by typing table(ban_ar.new, gender).

Next calculate sample proportions and 95% confidence intervals for the proportions of men and women who supported banning the sale of assault rifles. You can just do this using the prop.test and don’t have to calculate it manually. Then comment briefly on what you learn from this about possible differences in support for banning assault rifle sales by gender, including both estimates and confidence intervals.

table(ban_ar.new, sex)
##           sex
## ban_ar.new    1    2
##          0 1751 1617
##          1 1598 2370

For some reason the data set is showing me that there is no men or women in the data set.