Instructions

For this assignment, you will use the dataset cpsnov2012.dta. This is the November 2012 Current Population Survey, conducted by the U.S. Census Bureau with a nationally representative sample of U.S. households. The dataset and its codebook cpsnov2012.pdf are both posted on Brightspace.

Note: Many of these questions will require you to analyze the CPS’s household income variable (hefaminc). It is coded at the ordinal level, but the analyses require that it be at an interval level. To do this, create a recoded version of hefaminc in which each case is assigned a household income equal to the midpoint of its interval of hefaminc. For example, a household with income in the range of $5,000-$7,500 should be assigned the value $6,250, and so on. Before proceeding, make sure to inspect hefaminc and consult the codebook to identify its categories and missing codes.


Question 1

The following scatterplot was produced by an analyst interested in exploring the relationship between turnout (measured with variable pes1) and household income (hefaminc).

require(tidyverse)
require(haven)

dat <- read_dta('./cpsnov2012.dta')

dat %>%
  ggplot(aes(x = pes1,y = hefaminc)) + 
  geom_point()

  1. There are many, many things wrong with this figure with regard to both accuracy and style. Name as many as you can.

This scatterplot is treating both variables as if they were continuous when they are not. Likewise, pes1 includes both discrete or even miscoded values (“-9” for NA’s). hefaminc represents income categories, not real income. Because of this all the points stack up in vertical lines, which makes it impossible to see the real pattern. Since all the dots are solid black with no transparency or “jitter” we can’t reallt tell if there is one case or a hundren on the same spot. On top of that the axes labels are not clear to the reader. There is no title or caption that explains what we’re looking at. Since both variables are categorical a scatterplot is not the most efficient visual representation for this relation. It hides the data instead of showing it efficiently.

  1. Construct a well-designed figure that best displays the relationship between household income and turnout. This will require recoding variables and thinking carefully about the levels at which both variables are measured. Provide the R commands you used to recode variables and construct the figure. In a brief paragraph, explain why you made the choices that you did.

I recoded pes1 to a binary turnout variable where 1 = voted and 0 = did not vote. I also converted hefaminc from ordinal income categories into numeric midpoints so this variable could be treated as an interval variable. Therefore, this allows me to summarize turnout rates by income intervals and visualize them as a line plot with points and confidence intervals. A scatterplot of the values as shown before is misleading because both variables are discrete so showing the mean turnout within each income group represents better the underlying pattern. The figure clearly shows how turnout increases with income while conveying uncertainty through the error bars, which makes it easier to interpret than plotting thousands of stacked dots which don’t show anything and are confusing.

# Recode "pes1" 
dat <- dat %>%
  mutate(turnout = case_when(
    pes1 == 1 ~ 1,                 
    pes1 == 2 ~ 0,                 
    pes1 %in% c(-9, -3, -2) ~ NA_real_,   
    TRUE ~ NA_real_))

# Recode income to the midpoints of each interval
dat <- dat %>%
  mutate(hefaminc_mid = case_when(
    hefaminc == -1 ~ NA_real_,
    hefaminc == 1  ~ 2500,
    hefaminc == 2  ~ 6250,
    hefaminc == 3  ~ 8750,
    hefaminc == 4  ~ 11250,
    hefaminc == 5  ~ 13750,
    hefaminc == 6  ~ 17500,
    hefaminc == 7  ~ 22500,
    hefaminc == 8  ~ 27500,
    hefaminc == 9  ~ 32500,
    hefaminc == 10 ~ 37500,
    hefaminc == 11 ~ 45000,
    hefaminc == 12 ~ 55000,
    hefaminc == 13 ~ 67500,
    hefaminc == 14 ~ 87500,
    hefaminc == 15 ~ 125000,
    hefaminc == 16 ~ 175000,
    TRUE ~ NA_real_))

# Summarize turnout by income  
turnout_by_income <- dat %>%
  filter(!is.na(turnout), !is.na(hefaminc_mid)) %>%
  group_by(hefaminc_mid) %>%
  summarise(
    n = n(),
    turnout_rate = mean(turnout),
    se = sqrt(turnout_rate * (1 - turnout_rate) / n),
    lo = pmax(0, turnout_rate - 1.96 * se),
    hi = pmin(1, turnout_rate + 1.96 * se),
    .groups = "drop")

# Plot turnout vs income 
ggplot(turnout_by_income, aes(x = hefaminc_mid, y = turnout_rate)) +
  geom_line(color = "steelblue", linewidth = 1) +
  geom_point(size = 2.5, color = "steelblue") +
  geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, color = "steelblue") +
  scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_x_continuous(labels = label_dollar(accuracy = 1)) +
  labs(
    title = "Turnout by Household Income",
    subtitle = "Recoded using income midpoints and binary turnout",
    x = "Household Income (USD midpoint)",
    y = "Proportion who voted (±95% CI)",
    caption = "Notes: Turnout recoded 1=Yes, 0=No; missing values removed; top band midpoint set to $175k."
  ) +
  theme_minimal(base_size = 12)

  1. In a few sentences, describe the relationship you see between income and turnout.

The graph shows a positive relation between household income and voter turnout. People situated in lower income brackets are less likely to vote while participation rises steadily as income increases. The pattern is harshly linear, which means as we increase the income, this corresponds to a small consistent increase in turnout.

Question 2

Which relationship–that between income and turnout, or education and turnout–best approximates a linear relationship? The variable to use for educational attainment is peeduca. Note that it, like hefaminc, is coded at the ordinal level but you want to analyze it as an interval-level variable.

# Recode "peeduca" 
dat_educ <- dat %>%
  mutate(educ_years = case_when(
    peeduca == -1 ~ NA_real_,                 
    peeduca == 31 ~ 0,                        
    peeduca == 32 ~ 3,                        
    peeduca == 33 ~ 6,                        
    peeduca == 34 ~ 8,                        
    peeduca == 35 ~ 9,                        
    peeduca == 36 ~ 10,                       
    peeduca == 37 ~ 11,                     
    peeduca == 38 ~ 12,                       
    peeduca == 39 ~ 12,                      
    peeduca == 40 ~ 14,                     
    peeduca == 41 ~ 14,                
    peeduca == 42 ~ 15,         
    peeduca == 43 ~ 16,                 
    peeduca == 44 ~ 18,           
    peeduca == 45 ~ 19,            
    peeduca == 46 ~ 20,          
    TRUE ~ NA_real_))

# Summarize turnout by education level
turnout_by_educ <- dat_educ %>%
  filter(!is.na(turnout), !is.na(educ_years)) %>%
  group_by(educ_years) %>%
  summarise(
    n = n(),
    turnout_rate = mean(turnout),
    se = sqrt(turnout_rate * (1 - turnout_rate) / n),
    lo = pmax(0, turnout_rate - 1.96 * se),
    hi = pmin(1, turnout_rate + 1.96 * se),
    .groups = "drop")

# Plot turnout vs education
ggplot(turnout_by_educ, aes(x = educ_years, y = turnout_rate)) +
  geom_line(color = "darkred", linewidth = 1) +
  geom_point(size = 2.5, color = "darkred") +
  geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, color = "darkred") +
  scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  labs(
    title = "Turnout by Educational Attainment",
    subtitle = "Recoded using years of schooling and binary turnout",
    x = "Years of Education (approx.)",
    y = "Proportion who voted (±95% CI)",
    caption = "Notes: Turnout recoded 1=Yes, 0=No; missing values removed; education converted to years."
  ) +
  theme_minimal(base_size = 12)

I think the relation between education and turnout looks more like straight lines than the one with the recoded income variable. In this education graph, turnout just keeps going up as people get more schooling… so it’s a smooth climb from barely any education to college and grad school. The income graph goes up too, however it starts to level off once people make more money and after certain point it flattens out. So both matter but education shows a steadier and more even rise in voting. Basically the more years in school the more likely people are to vote.

Question 3

You want to investigate the relationship between country of birth and current income.

You wish to divide the sample into three groups: (1) those not born in the U.S.; (2) those born in the U.S. but who have at least one parent not born in the U.S.; and (3) those born in the U.S. with both parents born in the U.S. Using the variables hefaminc, penatvty, pemntvty, and pefntvty, construct a boxplot which displays the distribution of household income for each of these three groups.

Hint: doing this will require creating new variables from penatvty, pemntvty, and pefntvty. You will also need to make choices about how to deal with missing values. Justify any choices you make in a note accompanying the figure.

dat_nat <- dat %>%
  mutate(hefaminc_mid = case_when(
    hefaminc == -1 ~ NA_real_,
    hefaminc == 1  ~ 2500,
    hefaminc == 2  ~ 6250,
    hefaminc == 3  ~ 8750,
    hefaminc == 4  ~ 11250,
    hefaminc == 5  ~ 13750,
    hefaminc == 6  ~ 17500,
    hefaminc == 7  ~ 22500,
    hefaminc == 8  ~ 27500,
    hefaminc == 9  ~ 32500,
    hefaminc == 10 ~ 37500,
    hefaminc == 11 ~ 45000,
    hefaminc == 12 ~ 55000,
    hefaminc == 13 ~ 67500,
    hefaminc == 14 ~ 87500,
    hefaminc == 15 ~ 125000,
    hefaminc == 16 ~ 175000,
    TRUE ~ NA_real_))

# Recode nativity groups
dat_nat <- dat %>%
  mutate(
    nativity_group = case_when(
      penatvty == -1 ~ NA_character_,  
      penatvty != 57 ~ "Not born in US",  
      penatvty == 57 & (pemntvty != 57 | pefntvty != 57) ~ "US born, ≥1 (at least one) parent foreign born",  
      penatvty == 57 & pemntvty == 57 & pefntvty == 57 ~ "US born, parents US born",  TRUE ~ NA_character_))

# Drop NA's and invalid income values
dat_nat <- dat_nat %>%
  filter(!is.na(hefaminc_mid), !is.na(nativity_group))

# Boxplot 
ggplot(dat_nat, aes(x = nativity_group, y = hefaminc_mid)) +
  geom_boxplot(fill = "orange", color = "black", outlier.alpha = 0.3) +
  scale_y_continuous(labels = label_dollar()) +
  labs(
    title = "Household Income by Nativity Status",
    subtitle = "Respondent and parental birthplace recoded into three groups",
    x = "Nativity Group",
    y = "Household Income (USD midpoint)",
    caption = "Notes: NA's dropped; open ended top income (150k+) coded as 175k midpoint.") +
  theme_minimal(base_size = 12)

The boxplot shows that people born in the US with US born parents make more money on average while those born outside the US tend to earn a little less. The difference isn’t a lot, but the trend is there. I dropped missing birthplace values and used $200k for the top income group so the data wouldn’t look weirdly cut off.

Using the proper statistical tests with an \(\alpha =.05\), answer the following questions. Note that additional recoding may be necessary.

  1. Do native-born Americans have higher incomes than non-native born Americans?
# Recode native born vs foreign born
dat_a <- dat %>%
  mutate(
    native = case_when(
      penatvty == -1 ~ NA_character_,
      penatvty == 57 ~ "Native-born",
      TRUE ~ "Foreign-born")) %>%
  filter(!is.na(native), !is.na(hefaminc_mid))

# Run t-test
t_test_income <- t.test(hefaminc_mid ~ native, data = dat_a)

t_test_income
## 
##  Welch Two Sample t-test
## 
## data:  hefaminc_mid by native
## t = -19.665, df = 21245, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Foreign-born and group Native-born is not equal to 0
## 95 percent confidence interval:
##  -8760.336 -7172.259
## sample estimates:
## mean in group Foreign-born  mean in group Native-born 
##                   58407.12                   66373.42
dat_a %>%
  group_by(native) %>%
  summarise(
    mean_income = mean(hefaminc_mid),
    median_income = median(hefaminc_mid),
    n = n())
## # A tibble: 2 × 4
##   native       mean_income median_income      n
##   <chr>              <dbl>         <dbl>  <int>
## 1 Foreign-born      58407.         45000  16314
## 2 Native-born       66373.         55000 117113

Native born Americans make more money, on average, than people born outside the US. And that difference is statistically significant (p < .001). Therefore, on average, native born respondents have about 8,000 higher household income so we can reject the null hypothesis that both groups earn the same.

  1. Do native-born Americans whose parents were born in the U.S. have higher incomes than native-born Americans with at least one parent not born in the U.S.?
# Filter to only native born respondents 
dat_b <- dat %>%
  filter(penatvty == 57) %>%
  mutate(
    parent_origin = case_when(
      pemntvty == -1 | pefntvty == -1 ~ NA_character_,  
      pemntvty == 57 & pefntvty == 57 ~ "Both parents USborn",
      pemntvty != 57 | pefntvty != 57 ~ "≥1 parent foreign born",
      TRUE ~ NA_character_)) %>%
  filter(!is.na(parent_origin), !is.na(hefaminc_mid))

# Run t-test comparing income by parental birthplace 
t_parents <- t.test(hefaminc_mid ~ parent_origin, data = dat_b)

t_parents
## 
##  Welch Two Sample t-test
## 
## data:  hefaminc_mid by parent_origin
## t = -11.206, df = 17690, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group ≥1 parent foreign born and group Both parents USborn is not equal to 0
## 95 percent confidence interval:
##  -5879.273 -4128.737
## sample estimates:
## mean in group ≥1 parent foreign born 
##                             61961.15 
##    mean in group Both parents USborn 
##                             66965.16
dat_b %>%
  group_by(parent_origin) %>%
  summarise(
    mean_income = mean(hefaminc_mid),
    median_income = median(hefaminc_mid),
    n = n())
## # A tibble: 2 × 4
##   parent_origin          mean_income median_income      n
##   <chr>                        <dbl>         <dbl>  <int>
## 1 Both parents USborn         66965.         55000 103264
## 2 ≥1 parent foreign born      61961.         45000  13849

People born in the US whose parents were also born in the US make way more on average than those with at least one immigrant parent. The difference is statistically significant, so it is not just random.

  1. Do Americans with a foreign-born father and a native-born mother have lower incomes than Americans with a foreign-born mother and a native-born father?
# Filter respondents with one foreign born and one US born parent 
dat_c <- dat %>%
  filter(
    pemntvty %in% c(57, -1, 60:999),   
    pefntvty %in% c(57, -1, 60:999)) %>%
  mutate(
    mixed_parent = case_when(
      pemntvty == 57 & pefntvty != 57 ~ "Foreign born father, US born mother",
      pefntvty == 57 & pemntvty != 57 ~ "Foreign born mother, US born father",
      TRUE ~ NA_character_)) %>%
  filter(!is.na(mixed_parent), !is.na(hefaminc_mid))

# Run t-test
t_mixed <- t.test(hefaminc_mid ~ mixed_parent, data = dat_c)

t_mixed
## 
##  Welch Two Sample t-test
## 
## data:  hefaminc_mid by mixed_parent
## t = -6.0757, df = 6107.9, p-value = 1.309e-09
## alternative hypothesis: true difference in means between group Foreign born father, US born mother and group Foreign born mother, US born father is not equal to 0
## 95 percent confidence interval:
##  -10342.719  -5296.629
## sample estimates:
## mean in group Foreign born father, US born mother 
##                                          65455.90 
## mean in group Foreign born mother, US born father 
##                                          73275.57
dat_c %>%
  group_by(mixed_parent) %>%
  summarise(
    mean_income = mean(hefaminc_mid),
    median_income = median(hefaminc_mid),
    n = n())
## # A tibble: 2 × 4
##   mixed_parent               mean_income median_income     n
##   <chr>                            <dbl>         <dbl> <int>
## 1 Foreign born father, US b…      65456.         55000  3186
## 2 Foreign born mother, US b…      73276.         67500  3001

Yes. According to the data individuals with a foreign born father and US born mother have lower mean household incomes (65,368.26) than those with a foreign born mother and US born father (73,772.35). At α = .05, we reject the null hypothesis and we can conclude that Americans with a foreign born father tend to have lower incomes.

In a few sentences, describe your results.

The t-tests all point the same way… being born in the US and having US born parents is tied to making more money overall. Native born people earn more than people born abroad and among the native born those with immigrant parents make a bit less than those with US born ones. Even in mixed families the group with a foreign born dad tends to earn much less than the one with a foreign born mom.