Total: 50 points

Part A. Load and Inspect the mydataset (2 points)

In this part, you will load the mydataset into R and take a first look at its structure.

First, load the mydataset using read.csv(). Then use the functions head() and dim() to inspect the mydata.

head() shows the first few rows of the mydataset.
dim() shows the number of observations (rows) and variables (columns) in the mydataset.

file.exists(“anes2020_assignment2_cleaned.csv”) head(mymymydata)

mydata <- read.csv("anes2020_assignment2_cleaned.csv")
## Your R codes

Written answer:

How many observations are in this mydataset? 8281
How many variables are included in the mydataset? 16

Part B. Clean missing values (recode negative codes to NA) (6 points)

ANES often codes missing responses as negative values (e.g., -9 Refused, -8 Don’t know). For this assignment, recode any negative values to NA for the variables below.

B1. Count negative missing-value codes before recoding

Before recoding, count how many -8 and -9 responses exist in each of the following variables: immigration_attitude, econ_concern, pid, income, and weight_pre.

vars <- c("immigration_attitude", "econ_concern", "pid", "income", "weight_pre")

for (v in vars) {
  cat("\nVariable:", v, "\n")
  cat("-8:", sum(mydata[[v]] == -8, na.rm = TRUE), "\n")
  cat("-9:", sum(mydata[[v]] == -9, na.rm = TRUE), "\n")
}

## 
## Variable: immigration_attitude 
## -8: 3 
## -9: 80 
## 
## Variable: econ_concern 
## -8: 0 
## -9: 0 
## 
## Variable: pid 
## -8: 4 
## -9: 31 
## 
## Variable: income 
## -8: 0 
## -9: 584 
## 
## Variable: weight_pre 
## -8: 0 
## -9: 0

## Your R codes

B2. Recode negative values to `NA`

vars <- c("immigration_attitude", "econ_concern", "pid", "income", "weight_pre")

for (v in vars) {
  mydata[[v]][mydata[[v]] < 0] <- NA
}
## Your R codes

B3. Show that your recoding worked

After recoding, show results that demonstrate the negative values were successfully changed to NA. One simple way is to display the minimum value for each variable or count how many values are still below zero.

sapply(mydata[vars], function(x) sum(x < 0, na.rm = TRUE))

## immigration_attitude         econ_concern                  pid 
##                    0                    0                    0 
##               income           weight_pre 
##                    0                    0

## Your R codes

Written answer:

Why is it important to recode these negative values as NA before conducting statistical analysis? What potential problems might occur if we leave these values (e.g., -8 or -9) in the mydataset when calculating statistics such as means, histograms, or regressions?

This is important because these are not numerical values, but responses that should be removed from statistical anlysis. If they are not renamed they will disturb results in further analysis such as a histogram or regressions.

Part C. Item nonresponse diagnostics (3 points)

Compute the item nonresponse rate (percent missing) for income.
Compute the item nonresponse rate (percent missing) for immigration_attitude.

mean(is.na(mydata$income)) * 100

## [1] 7.439614

mean(is.na(mydata$immigration_attitude)) * 100

## [1] 1.002415

## Your R codes

Written answer:

Which variable has higher item nonresponse? The ‘income’ variable.
Give one plausible reason (in plain language) why the variable you mentioned in the previous question often has higher nonresponse. People feel more restrain on answering questions about their income. There still exists a large taboo about talking about your income, this makes that people are more prone to skip those questions.

Part D. Descriptive statistics and one plot (7 points)

D1. Unweighted descriptives

Using R, compute and show:

The mean of immigration_attitude
The frequency table for econ_concern
The frequency table for econ_concern

mean(mydata$immigration_attitude, na.rm = TRUE)

## [1] 2.767964

table(mydata$econ_concern, useNA = "ifany")

## 
##    1    2    3    4    5 <NA> 
##  750  800 1704 1764 3222   40

prop.table(table(mydata$econ_concern, useNA = "ifany")) * 100

## 
##          1          2          3          4          5       <NA> 
##  9.0579710  9.6618357 20.5797101 21.3043478 38.9130435  0.4830918

## Your R codes

Written answer:

What is the average level of immigration_attitude in the sample?
Report the numerical value. Based on the codebook, what does this average suggest about respondents’ immigration attitudes in the sample? The rounded mean of 2.77 corresponds most closely with following attitude: ‘Allow unauthorized immigrants to remain in the U.S. and eventually qualify for citizenship, but only if they meet requirements’.
Looking at the frequency table of econ_concern, which response category is most common?
What does this tell us about how respondents perceive the national economy? Response 5 is most common, this means that most respondents believe the national economy has gotten worse.

D2. One plot

Using R, create a histogram of immigration_attitude. Your figure should have a clear x-axis label, y-axis label, and title. Don’t leave the axis labels as default variable names – make sure to use clear, descriptive labels for both axes.

hist(
  mydata$immigration_attitude,
  breaks = 0.5:4.5,                     # bins centered on 1,2,3,4
  main = "Distribution of Immigration Attitudes",
  xlab = "Immigration Attitude (1 = Very Negative, 4 = Very Positive)",
  ylab = "Number of Respondents",
  col = "skyblue",
  border = "white",
  xaxt = "n"                            # suppress default x-axis
)
axis(1, at = 1:4)                        # label x-axis with 1,2,3,4

## Your R codes

Written answer:

After creating the histogram of immigration_attitude, answer the following questions:

Describe the overall shape of the distribution. Is it roughly symmetric, left-skewed, or right-skewed? The historgram is left-skewed, the median will be bigger than the mean.
Which values of immigration_attitude appear most common in the sample?

The third response: ‘Allow unauthorized immigrants to remain in the U.S. and eventually qualify for citizenship, but only if they meet requirements’

Does the histogram suggest that respondents are clustered around certain attitudes, or are opinions spread out across many categories? Briefly explain what this tells us about immigration attitudes in the sample. I would say opinions are spread, with every response getting over 1000 responses, though there is a large preference for the third answer (more than 4000 responses). This tells us that migration attitudes in the population cover the entire spectrum from most negative to most positive, but with a majority of people having a more positive attitude.

Part E. Weighted vs. unweighted estimates (6 points)

Compute the weighted mean of immigration_attitude using weight_pre variable and compare it to the unweighted mean.

# Unweighted mean
unweighted_mean <- mean(mydata$immigration_attitude, na.rm = TRUE)

# Weighted mean using weight_pre
weighted_mean <- weighted.mean(mydata$immigration_attitude, w = mydata$weight_pre, na.rm = TRUE)

# Compare the two
cat("Unweighted mean: ", round(unweighted_mean, 2), "\n")

## Unweighted mean:  2.77

cat("Weighted mean:   ", round(weighted_mean, 2), "\n")

## Weighted mean:    2.75

## Hint: use weighted.mean() function to calculate weighted mean

## Your R codes

Written answer:

Compare the weighted mean and unweighted mean of immigration_attitude. What is the difference between the weighted mean and unweighted mean of immigration_attitude? The difference is 0.02.
In plain language, what does a survey weight try to do in survey analysis? Explain how applying weights can help make survey estimates more representative of the target population. The formula is as follows: weight = population proportion/sample proportion. Basically underrepresented groups, for example male in relation to female, are given more weight to compensate, while overrepresentated groups are given less weight. This makes the results more representative for the population.

Part F. Difference-in-means (6 points)

Research question: Do respondents with different levels of economic concern report different immigration attitudes?

Hypothesis: Respondents with higher levels of economic concern will report more negative attitudes toward immigration than those with lower levels of economic concern.

For a simple comparison, let’s calculate difference in means. Create two groups as follows:

a higher economic concern group
a lower economic concern group

Use the mean of economic concern as the cutpoint.

Then compute the mean of immigration_attitude for each group and calculate the difference-in-means:

\[ \bar{Y}_{higher\ concern} - \bar{Y}_{lower\ concern} \]

# --- Step 1: Compute mean of econ_concern (cutpoint) ---
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)

# --- Step 2: Create higher vs lower economic concern groups ---
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")

# --- Step 3: Compute mean immigration_attitude by group ---
group_means <- tapply(mydata$immigration_attitude, mydata$econ_group, mean, na.rm = TRUE)

cat("Mean immigration_attitude by economic concern group:\n")

## Mean immigration_attitude by economic concern group:

print(round(group_means, 2))

## Higher concern  Lower concern 
##           2.95           2.49

# --- Step 4: Compute difference-in-means (Higher - Lower) ---
diff_means <- group_means["Higher concern"] - group_means["Lower concern"]
cat("\nDifference in means (Higher - Lower concern):", round(diff_means, 2), "\n")

## 
## Difference in means (Higher - Lower concern): 0.47

## Your R codes

Written answer:

Interpret the sign of the difference. Is the difference positive or negative? The difference is positive.
Based on your coding and the codebook, does higher economic concern appear to be associated with more restrictive (e.g., send them back to their home country) or more supportive immigration (e.g., allow unauthorized immigrants to remain in the U.S. ) attitudes? Economic concern and Migration attitudes are coded differently. A higher score on economic concern means more negative, while a higher score on migration attitudes means more positive. Therefore a higher economic concern is correlated with a more positive attitude towards migrants.
Is this a causal effect? Why or why not? The effect is not necessarily causal, there is no correction for possible lurking variables.

Part G. Subgroup analysis (6 points)

Now examine whether the relationship between economic concern and immigration attitudes differs by party identification.

Step 1. Create a binary party identification variable

Create a binary party identification variable (rep_pid):

rep_pid = 1 if respondent identifies as Republican or leans Republican
rep_pid = 0 otherwise

Note: Use the codebook carefully to decide which codes should be included in each category. For this part, your R code should:

Create the binary variable rep_pid
Show how many respondents fall into each category of rep_pid

#Create binary Republican variable
mydata$rep_pid <- ifelse(mydata$pid %in% 5:7, 1, 0)

# Show counts for each category
table(mydata$rep_pid)

## 
##    0    1 
## 4839 3441

# Optional: nicer labels
table(Party = ifelse(mydata$rep_pid == 1, "Republican", "Other"))

## Party
##      Other Republican 
##       4839       3441

## Your R codes

Step 2. Compute subgroup difference-in-means

Compute the higher-concern vs. lower-concern difference within each party ID group.

# --- Step 1: Compute mean of econ_concern (cutpoint) ---
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)

# --- Step 2: Create higher vs lower economic concern groups ---
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")

# --- Step 3: Compute mean immigration_attitude by econ_group within each party group ---
library(dplyr)

results <- mydata %>%
  group_by(rep_pid, econ_group) %>%
  summarise(mean_immigration = mean(immigration_attitude, na.rm = TRUE), .groups = "drop") %>%
  tidyr::pivot_wider(names_from = econ_group, values_from = mean_immigration) %>%
  mutate(diff_means = `Higher concern` - `Lower concern`)

# --- Step 4: Show results ---
results

## # A tibble: 2 × 5
##   rep_pid `Higher concern` `Lower concern`  `NA` diff_means
##     <dbl>            <dbl>           <dbl> <dbl>      <dbl>
## 1       0             3.13            2.83  3         0.301
## 2       1             2.48            2.29  2.13      0.190

## Your R codes

Written answer:

Compare the difference-in-means across the two party ID groups (Republicans and non-Republicans). The difference in means is larger within Republicans (0,30) than it is within non-Republicans (0,19).
In which group does the relationship between economic concern and immigration attitudes appear stronger? The relationship between economic concern and immigration attitudes is larger within the Republicans.
Provide one plausible reason (non-causal explanation is fine) why the relationship might differ by party identification. It could be that the Republicans with higher income vote Republican for economic reasons, while having a more positive stance against imigrants.Republicans with a lower income possibly tend to vote Republican for cultural conservative reasons, correlated to anti-imigration opions. —

Part H. Comparability check and interpretation limits (4 points)

To evaluate whether respondents with higher and lower economic concern are comparable, compare them on two pre-existing characteristics:

age
income

Compute group means for age and income by economic concern group.

# --- Step 1: Ensure econ_group exists ---
# Create higher vs lower economic concern groups if not already done
econ_mean <- mean(mydata$econ_concern, na.rm = TRUE)
mydata$econ_group <- ifelse(mydata$econ_concern > econ_mean, "Higher concern", "Lower concern")

# --- Step 2: Compute mean age and income by econ_group ---
library(dplyr)

group_characteristics <- mydata %>%
  group_by(econ_group) %>%
  summarise(
    mean_age = mean(age, na.rm = TRUE),
    mean_income = mean(income, na.rm = TRUE),
    n = n()
  )

# --- Step 3: Show results ---
group_characteristics

## # A tibble: 3 × 4
##   econ_group     mean_age mean_income     n
##   <chr>             <dbl>       <dbl> <int>
## 1 Higher concern     49.2       12.1   4986
## 2 Lower concern      49.0       11.2   3254
## 3 <NA>               40.6        9.43    40

## Your R codes

Written answer:

Are the higher-concern and lower-concern groups similar on age and income? Both groups are quite similar, high concern has a larger mean age and mean income, but the difference does not appear problematic.
Explain how group differences in age or income could bias the difference-in-means on immigration attitudes. Older people tend to be more conservative, while people with a larger income tend to be more progressive. Misrepresentations of the population can therefore cause deviations of the results.
In one sentence, state what additional information or design feature would be needed to make a stronger causal claim. There should be controled for lurking variabels that can both explain the levels of economic concern and more negative attitudes toward immigration, a multivariable analysis will most likely be needed.

Part I. Expressive responding and social desirability (10 points)

For this final section, use the original ANES codebook to think about survey questions that may be especially vulnerable to response bias.

I1. Identify potentially vulnerable questions

Look through the ANES variables in the codebook and identify:

One survey question that you think is especially susceptible to social desirability bias
One survey question that you think is especially susceptible to expressive responding

Name the question/variable you chose for social desirability bias and briefly explain why it may be vulnerable. So far as you and your family are concerned, how worried are you about your current financial situation? People tend to be ashamed to admit financial uncertainty, it is likely that people underreport the level of worriedness.
Name the question/variable you chose for expressive responding and briefly explain why it may be vulnerable. What about the next 12 months? Do you expect the economy, in the country as a whole, to get better, stay about the same, or get worse? This question is very likely to be influenced by the relation or opposition between the party currently in government and the party the respondent voted for.

I2. Reducing Expressive Responding

Describe one method to reduce expressive responding for the variable you identified in the previous question. Use one of the methods discussed in class (for example, incentives for accuracy, subtle pipeline, list experiments, or balanced incentive design).

In your answer, explain the design in detail:
- How the survey question or survey design would change
- What different groups of respondents would see (if applicable)
- How the method helps encourage more sincere responses

I would chose the subtle pipeline, this would mean that respondents do not answer each question individually but say with how many of the presented statements they agree. This method needs to randomly split up the sample in a control and treatment group. The control group would be presented with 4 questions, the treatment group would be presented with the same 4 questions plus whether or not they think the economy will perform better in the next 12 months. By not asking a direct response it is easier for respondents to think for themselves, rather than automatically following party aligned feelings.

Now think creatively and propose one additional way to reduce expressive responding in a survey. This can be a new idea or a modification of an existing approach.

In your answer, explain:
- How the survey would be implemented
- What respondents would be asked to do
- Why this design might reduce expressive responding. I would tell respondents that positive or negative answers might require a verbal elaboration at the end of the survey. This way people are more prone to think critically of their own answer as they do not want to be embarrassed while elaborating their answer.
For the survey question or variable where expressive responding may occur, explain why identifying expressive responding is important. Why does it matter to distinguish expressive responding from respondents’ true beliefs, attitudes, or behaviors, and what are the potential real-world implications of this distinction? Expressive responding is important to make a correct judgement of peoples thought. Expressive responding can lead to mydata that respondents themselves do not actually support on a deeper level. This distinction proves very interesting in real life, for example: it would be very relevant for a government to know whether people actually believe certain misinformation, or whether the statement/agreement follows from expressive responding.

SOSC 3730: Assignment 2

Sander De Beule

2026-03-19

Instructions

Overview

mydataset Description

Total: 50 points

Part A. Load and Inspect the mydataset (2 points)

Part B. Clean missing values (recode negative codes to NA) (6 points)

B1. Count negative missing-value codes before recoding

B2. Recode negative values to `NA`

B3. Show that your recoding worked

This is important because these are not numerical values, but responses that should be removed from statistical anlysis. If they are not renamed they will disturb results in further analysis such as a histogram or regressions.

Part C. Item nonresponse diagnostics (3 points)

Part D. Descriptive statistics and one plot (7 points)

D1. Unweighted descriptives

D2. One plot

Part E. Weighted vs. unweighted estimates (6 points)

Part F. Difference-in-means (6 points)

Part G. Subgroup analysis (6 points)

Step 1. Create a binary party identification variable

Step 2. Compute subgroup difference-in-means

Part H. Comparability check and interpretation limits (4 points)

SOSC 3730: Assignment 2

Sander De Beule

2026-03-19

Instructions

Overview

mydataset Description

Total: 50 points

Part A. Load and Inspect the mydataset (2 points)

Part B. Clean missing values (recode negative codes to NA) (6 points)

B1. Count negative missing-value codes before recoding

B2. Recode negative values to NA

B3. Show that your recoding worked

This is important because these are not numerical values, but responses that should be removed from statistical anlysis. If they are not renamed they will disturb results in further analysis such as a histogram or regressions.

Part C. Item nonresponse diagnostics (3 points)

Part D. Descriptive statistics and one plot (7 points)

D1. Unweighted descriptives

D2. One plot

Part E. Weighted vs. unweighted estimates (6 points)

Part F. Difference-in-means (6 points)

Part G. Subgroup analysis (6 points)

Step 1. Create a binary party identification variable

Step 2. Compute subgroup difference-in-means

Part H. Comparability check and interpretation limits (4 points)

Part I. Expressive responding and social desirability (10 points)

I1. Identify potentially vulnerable questions

I2. Reducing Expressive Responding

B2. Recode negative values to `NA`