.Rmd file and
the knitted PDF (or HTML).This assignment focuses on a more substantive social science question: how economic perceptions or concerns are related to immigration attitudes, and whether that relationship differs by party identification.
This assignment uses a subset of mydata from the American
National Election Studies (ANES).
ANES is a long-running academic survey project that studies
political attitudes and behavior in the United States.
Researchers commonly use ANES mydata to study topics such as political
participation, public opinion, and social attitudes.
The mydataset used in this assignment contains a subset of respondents and variables from the 2020 ANES Time Series Study. The original survey includes thousands of respondents and is designed to represent the U.S. voting-age population.
Some ANES survey questions use special numeric codes to represent
responses such as “Don’t know,” “Refused,” or “Not
applicable.” In this mydataset, these responses are coded using
negative values (for example, -8 or -9). In Part B of
this assignment, you will convert these values into missing
values (NA) so that statistical calculations in R
treat them appropriately.
Important note: Please use the codebook provided with the assignment to confirm how each variable is coded. In some ANES variables, smaller values may indicate more restrictive attitudes, while in others larger values may indicate more supportive attitudes. Be careful when interpreting signs and differences.
In this part, you will load the mydataset into R and take a first look at its structure.
First, load the mydataset using read.csv(). Then use the
functions head() and dim() to inspect the
mydata.
head() shows the first few rows of the mydataset.dim() shows the number of observations
(rows) and variables (columns) in the
mydataset.file.exists(“anes2020_assignment2_cleaned.csv”) head(mymymydata)
mydata <- read.csv("anes2020_assignment2_cleaned.csv")
## Your R codes
Written answer:
How many observations are in this mydataset? 8281
How many variables are included in the mydataset? 16
ANES often codes missing responses as negative values (e.g.,
-9 Refused, -8 Don’t know). For this
assignment, recode any negative values to
NA for the variables below.
Before recoding, count how many -8 and -9
responses exist in each of the following variables:
immigration_attitude, econ_concern,
pid, income, and weight_pre.
vars <- c("immigration_attitude", "econ_concern", "pid", "income", "weight_pre")
for (v in vars) {
cat("\nVariable:", v, "\n")
cat("-8:", sum(mydata[[v]] == -8, na.rm = TRUE), "\n")
cat("-9:", sum(mydata[[v]] == -9, na.rm = TRUE), "\n")
}
##
## Variable: immigration_attitude
## -8: 3
## -9: 80
##
## Variable: econ_concern
## -8: 0
## -9: 0
##
## Variable: pid
## -8: 4
## -9: 31
##
## Variable: income
## -8: 0
## -9: 584
##
## Variable: weight_pre
## -8: 0
## -9: 0
## Your R codes
NAvars <- c("immigration_attitude", "econ_concern", "pid", "income", "weight_pre")
for (v in vars) {
mydata[[v]][mydata[[v]] < 0] <- NA
}
## Your R codes
After recoding, show results that demonstrate the negative values
were successfully changed to NA. One simple way is to
display the minimum value for each variable or count how many values are
still below zero.
sapply(mydata[vars], function(x) sum(x < 0, na.rm = TRUE))
## immigration_attitude econ_concern pid
## 0 0 0
## income weight_pre
## 0 0
## Your R codes
Written answer:
NA before conducting statistical analysis? What potential
problems might occur if we leave these values (e.g., -8 or -9) in the
mydataset when calculating statistics such as means, histograms, or
regressions?income.immigration_attitude.mean(is.na(mydata$income)) * 100
## [1] 7.439614
mean(is.na(mydata$immigration_attitude)) * 100
## [1] 1.002415
## Your R codes
Written answer:
Which variable has higher item nonresponse? The ‘income’ variable.
Give one plausible reason (in plain language) why the variable you mentioned in the previous question often has higher nonresponse. People feel more restrain on answering questions about their income. There still exists a large taboo about talking about your income, this makes that people are more prone to skip those questions.
Using R, compute and show:
The mean of
immigration_attitude
The frequency table for
econ_concern
The frequency table for
econ_concern
mean(mydata$immigration_attitude, na.rm = TRUE)
## [1] 2.767964
table(mydata$econ_concern, useNA = "ifany")
##
## 1 2 3 4 5 <NA>
## 750 800 1704 1764 3222 40
prop.table(table(mydata$econ_concern, useNA = "ifany")) * 100
##
## 1 2 3 4 5 <NA>
## 9.0579710 9.6618357 20.5797101 21.3043478 38.9130435 0.4830918
## Your R codes
Written answer:
What is the average level of
immigration_attitude in the sample?
Report the numerical value. Based on the codebook, what
does this average suggest about respondents’ immigration attitudes in
the sample? The rounded mean of 2.77 corresponds most closely with
following attitude: ‘Allow unauthorized immigrants to remain in the U.S.
and eventually qualify for citizenship, but only if they meet
requirements’.
Looking at the frequency table of
econ_concern, which response category is most
common?
What does this tell us about how respondents perceive the
national economy? Response 5 is most common, this means
that most respondents believe the national economy has gotten
worse.
Using R, create a histogram of
immigration_attitude. Your figure should have a clear
x-axis label, y-axis label, and title. Don’t leave the axis labels as
default variable names – make sure to use clear, descriptive labels for
both axes.
hist(
mydata$immigration_attitude,
breaks = 0.5:4.5, # bins centered on 1,2,3,4
main = "Distribution of Immigration Attitudes",
xlab = "Immigration Attitude (1 = Very Negative, 4 = Very Positive)",
ylab = "Number of Respondents",
col = "skyblue",
border = "white",
xaxt = "n" # suppress default x-axis
)
axis(1, at = 1:4) # label x-axis with 1,2,3,4
## Your R codes
Written answer:
After creating the histogram of immigration_attitude,
answer the following questions:
Describe the overall shape of the distribution. Is it roughly symmetric, left-skewed, or right-skewed? The historgram is left-skewed, the median will be bigger than the mean.
Which values of immigration_attitude appear
most common in the sample?
The third response: ‘Allow unauthorized immigrants to remain in the U.S. and eventually qualify for citizenship, but only if they meet requirements’
Compute the weighted mean of
immigration_attitude using weight_pre variable
and compare it to the unweighted mean.
# Unweighted mean
unweighted_mean <- mean(mydata$immigration_attitude, na.rm = TRUE)
# Weighted mean using weight_pre
weighted_mean <- weighted.mean(mydata$immigration_attitude, w = mydata$weight_pre, na.rm = TRUE)
# Compare the two
cat("Unweighted mean: ", round(unweighted_mean, 2), "\n")
## Unweighted mean: 2.77
cat("Weighted mean: ", round(weighted_mean, 2), "\n")
## Weighted mean: 2.75
## Hint: use weighted.mean() function to calculate weighted mean
## Your R codes
Written answer:
Compare the weighted mean and unweighted
mean of immigration_attitude. What is the
difference between the weighted mean and unweighted mean of
immigration_attitude? The difference is 0.02.
In plain language, what does a survey weight try to do in survey analysis? Explain how applying weights can help make survey estimates more representative of the target population. The formula is as follows: weight = population proportion/sample proportion. Basically underrepresented groups, for example male in relation to female, are given more weight to compensate, while overrepresentated groups are given less weight. This makes the results more representative for the population.
Research question: Do respondents with different levels of economic concern report different immigration attitudes?
Hypothesis: Respondents with higher levels of economic concern will report more negative attitudes toward immigration than those with lower levels of economic concern.
For a simple comparison, let’s calculate difference in means. Create two groups as follows:
a higher economic concern group
a lower economic concern group
Use the mean of economic concern as the
cutpoint.
Then compute the mean of immigration_attitude for each
group and calculate the difference-in-means:
\[ \bar{Y}_{higher\ concern} - \bar{Y}_{lower\ concern} \]
# --- Step 1: Compute mean of econ_concern (cutpoint) ---
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)
# --- Step 2: Create higher vs lower economic concern groups ---
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")
# --- Step 3: Compute mean immigration_attitude by group ---
group_means <- tapply(mydata$immigration_attitude, mydata$econ_group, mean, na.rm = TRUE)
cat("Mean immigration_attitude by economic concern group:\n")
## Mean immigration_attitude by economic concern group:
print(round(group_means, 2))
## Higher concern Lower concern
## 2.95 2.49
# --- Step 4: Compute difference-in-means (Higher - Lower) ---
diff_means <- group_means["Higher concern"] - group_means["Lower concern"]
cat("\nDifference in means (Higher - Lower concern):", round(diff_means, 2), "\n")
##
## Difference in means (Higher - Lower concern): 0.47
## Your R codes
Written answer:
Interpret the sign of the difference. Is the difference positive or negative? The difference is positive.
Based on your coding and the codebook, does higher economic concern appear to be associated with more restrictive (e.g., send them back to their home country) or more supportive immigration (e.g., allow unauthorized immigrants to remain in the U.S. ) attitudes? Economic concern and Migration attitudes are coded differently. A higher score on economic concern means more negative, while a higher score on migration attitudes means more positive. Therefore a higher economic concern is correlated with a more positive attitude towards migrants.
Is this a causal effect? Why or why not? The effect is not necessarily causal, there is no correction for possible lurking variables.
Now examine whether the relationship between economic concern and immigration attitudes differs by party identification.
Create a binary party identification variable
(rep_pid):
rep_pid = 1 if respondent identifies as Republican
or leans Republican
rep_pid = 0 otherwise
Note: Use the codebook carefully to decide which codes should be included in each category. For this part, your R code should:
Create the binary variable
rep_pid
Show how many respondents fall into each
category of rep_pid
#Create binary Republican variable
mydata$rep_pid <- ifelse(mydata$pid %in% 5:7, 1, 0)
# Show counts for each category
table(mydata$rep_pid)
##
## 0 1
## 4839 3441
# Optional: nicer labels
table(Party = ifelse(mydata$rep_pid == 1, "Republican", "Other"))
## Party
## Other Republican
## 4839 3441
## Your R codes
Compute the higher-concern vs. lower-concern difference within each party ID group.
# --- Step 1: Compute mean of econ_concern (cutpoint) ---
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)
# --- Step 2: Create higher vs lower economic concern groups ---
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")
# --- Step 3: Compute mean immigration_attitude by econ_group within each party group ---
library(dplyr)
results <- mydata %>%
group_by(rep_pid, econ_group) %>%
summarise(mean_immigration = mean(immigration_attitude, na.rm = TRUE), .groups = "drop") %>%
tidyr::pivot_wider(names_from = econ_group, values_from = mean_immigration) %>%
mutate(diff_means = `Higher concern` - `Lower concern`)
# --- Step 4: Show results ---
results
## # A tibble: 2 × 5
## rep_pid `Higher concern` `Lower concern` `NA` diff_means
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0 3.13 2.83 3 0.301
## 2 1 2.48 2.29 2.13 0.190
## Your R codes
Written answer:
Compare the difference-in-means across the two party ID groups (Republicans and non-Republicans). The difference in means is larger within Republicans (0,30) than it is within non-Republicans (0,19).
In which group does the relationship between economic concern and immigration attitudes appear stronger? The relationship between economic concern and immigration attitudes is larger within the Republicans.
Provide one plausible reason (non-causal explanation is fine) why the relationship might differ by party identification. It could be that the Republicans with higher income vote Republican for economic reasons, while having a more positive stance against imigrants.Republicans with a lower income possibly tend to vote Republican for cultural conservative reasons, correlated to anti-imigration opions. —
To evaluate whether respondents with higher and lower economic concern are comparable, compare them on two pre-existing characteristics:
ageincomeCompute group means for age and income by economic concern group.
# --- Step 1: Ensure econ_group exists ---
# Create higher vs lower economic concern groups if not already done
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")
# --- Step 2: Compute mean age and income by econ_group ---
library(dplyr)
group_characteristics <- mydata %>%
group_by(econ_group) %>%
summarise(
mean_age = mean(age, na.rm = TRUE),
mean_income = mean(income, na.rm = TRUE),
n = n()
)
# --- Step 3: Show results ---
group_characteristics
## # A tibble: 3 × 4
## econ_group mean_age mean_income n
## <chr> <dbl> <dbl> <int>
## 1 Higher concern 49.2 12.1 4986
## 2 Lower concern 49.0 11.2 3254
## 3 <NA> 40.6 9.43 40
## Your R codes
Written answer:
Are the higher-concern and lower-concern groups similar on age and income? Both groups are quite similar, high concern has a larger mean age and mean income, but the difference does not appear problematic.
Explain how group differences in age or income could bias the difference-in-means on immigration attitudes. Older people tend to be more conservative, while people with a larger income tend to be more progressive. Misrepresentations of the population can therefore cause deviations of the results.
In one sentence, state what additional information or design feature would be needed to make a stronger causal claim. There should be controled for lurking variabels that can both explain the levels of economic concern and more negative attitudes toward immigration, a multivariable analysis will most likely be needed.