For this assignment, you will use the dataset
cpsnov2012.dta. This is the November 2012 Current
Population Survey, conducted by the U.S. Census Bureau with a nationally
representative sample of U.S. households. The dataset and its codebook
cpsnov2012.pdf are both posted on Brightspace.
Note: Many of these questions will require you to analyze
the CPS’s household income variable (hefaminc). It is coded
at the ordinal level, but the analyses require that it be at an interval
level. To do this, create a recoded version of hefaminc in
which each case is assigned a household income equal to the midpoint of
its interval of hefaminc. For example, a household with
income in the range of $5,000-$7,500 should be assigned the value
$6,250, and so on. Before proceeding, make sure to inspect
hefaminc and consult the codebook to identify its
categories and missing codes.
The following scatterplot was produced by an analyst interested in
exploring the relationship between turnout (measured with variable
pes1) and household income (hefaminc).
require(tidyverse)
require(haven)
dat <- read_dta('./cpsnov2012.dta')
dat %>%
ggplot(aes(x = pes1,y = hefaminc)) +
geom_point()
This scatterplot is treating both variables as if they were continuous when they are not. Likewise,
pes1includes both discrete or even miscoded values (“-9” for NA’s).hefamincrepresents income categories, not real income. Because of this all the points stack up in vertical lines, which makes it impossible to see the real pattern. Since all the dots are solid black with no transparency or “jitter” we can’t reallt tell if there is one case or a hundren on the same spot. On top of that the axes labels are not clear to the reader. There is no title or caption that explains what we’re looking at. Since both variables are categorical a scatterplot is not the most efficient visual representation for this relation. It hides the data instead of showing it efficiently.
R commands you used to
recode variables and construct the figure. In a brief paragraph, explain
why you made the choices that you did.I recoded
pes1to a binaryturnoutvariable where 1 = voted and 0 = did not vote. I also convertedhefamincfrom ordinal income categories into numeric midpoints so this variable could be treated as an interval variable. Therefore, this allows me to summarize turnout rates by income intervals and visualize them as a line plot with points and confidence intervals. A scatterplot of the values as shown before is misleading because both variables are discrete so showing the mean turnout within each income group represents better the underlying pattern. The figure clearly shows how turnout increases with income while conveying uncertainty through the error bars, which makes it easier to interpret than plotting thousands of stacked dots which don’t show anything and are confusing.
# Recode "pes1"
dat <- dat %>%
mutate(turnout = case_when(
pes1 == 1 ~ 1,
pes1 == 2 ~ 0,
pes1 %in% c(-9, -3, -2) ~ NA_real_,
TRUE ~ NA_real_))
# Recode income to the midpoints of each interval
dat <- dat %>%
mutate(hefaminc_mid = case_when(
hefaminc == -1 ~ NA_real_,
hefaminc == 1 ~ 2500,
hefaminc == 2 ~ 6250,
hefaminc == 3 ~ 8750,
hefaminc == 4 ~ 11250,
hefaminc == 5 ~ 13750,
hefaminc == 6 ~ 17500,
hefaminc == 7 ~ 22500,
hefaminc == 8 ~ 27500,
hefaminc == 9 ~ 32500,
hefaminc == 10 ~ 37500,
hefaminc == 11 ~ 45000,
hefaminc == 12 ~ 55000,
hefaminc == 13 ~ 67500,
hefaminc == 14 ~ 87500,
hefaminc == 15 ~ 125000,
hefaminc == 16 ~ 175000,
TRUE ~ NA_real_))
# Summarize turnout by income
turnout_by_income <- dat %>%
filter(!is.na(turnout), !is.na(hefaminc_mid)) %>%
group_by(hefaminc_mid) %>%
summarise(
n = n(),
turnout_rate = mean(turnout),
se = sqrt(turnout_rate * (1 - turnout_rate) / n),
lo = pmax(0, turnout_rate - 1.96 * se),
hi = pmin(1, turnout_rate + 1.96 * se),
.groups = "drop")
# Plot turnout vs income
ggplot(turnout_by_income, aes(x = hefaminc_mid, y = turnout_rate)) +
geom_line(color = "steelblue", linewidth = 1) +
geom_point(size = 2.5, color = "steelblue") +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, color = "steelblue") +
scale_y_continuous(labels = percent_format(accuracy = 1), limits = c(0, 1)) +
scale_x_continuous(labels = label_dollar(accuracy = 1)) +
labs(
title = "Turnout by Household Income",
subtitle = "Recoded using income midpoints and binary turnout",
x = "Household Income (USD midpoint)",
y = "Proportion who voted (±95% CI)",
caption = "Notes: Turnout recoded 1=Yes, 0=No; missing values removed; top band midpoint set to $175k."
) +
theme_minimal(base_size = 12)
The graph shows a positive relation between household income and voter turnout. People situated in lower income brackets are less likely to vote while participation rises steadily as income increases. The pattern is harshly linear, which means as we increase the income, this corresponds to a small consistent increase in turnout.
Which relationship–that between income and turnout, or education and
turnout–best approximates a linear relationship? The variable to use for
educational attainment is peeduca. Note that it, like
hefaminc, is coded at the ordinal level but you want to
analyze it as an interval-level variable.
# Recode "peeduca"
dat_educ <- dat %>%
mutate(educ_years = case_when(
peeduca == -1 ~ NA_real_,
peeduca == 31 ~ 0,
peeduca == 32 ~ 3,
peeduca == 33 ~ 6,
peeduca == 34 ~ 8,
peeduca == 35 ~ 9,
peeduca == 36 ~ 10,
peeduca == 37 ~ 11,
peeduca == 38 ~ 12,
peeduca == 39 ~ 12,
peeduca == 40 ~ 14,
peeduca == 41 ~ 14,
peeduca == 42 ~ 15,
peeduca == 43 ~ 16,
peeduca == 44 ~ 18,
peeduca == 45 ~ 19,
peeduca == 46 ~ 20,
TRUE ~ NA_real_))
# Summarize turnout by education level
turnout_by_educ <- dat_educ %>%
filter(!is.na(turnout), !is.na(educ_years)) %>%
group_by(educ_years) %>%
summarise(
n = n(),
turnout_rate = mean(turnout),
se = sqrt(turnout_rate * (1 - turnout_rate) / n),
lo = pmax(0, turnout_rate - 1.96 * se),
hi = pmin(1, turnout_rate + 1.96 * se),
.groups = "drop")
# Plot turnout vs education
ggplot(turnout_by_educ, aes(x = educ_years, y = turnout_rate)) +
geom_line(color = "darkred", linewidth = 1) +
geom_point(size = 2.5, color = "darkred") +
geom_errorbar(aes(ymin = lo, ymax = hi), width = 0, color = "darkred") +
scale_y_continuous(labels = scales::percent_format(accuracy = 1), limits = c(0, 1)) +
scale_x_continuous(breaks = seq(0, 20, 2)) +
labs(
title = "Turnout by Educational Attainment",
subtitle = "Recoded using years of schooling and binary turnout",
x = "Years of Education (approx.)",
y = "Proportion who voted (±95% CI)",
caption = "Notes: Turnout recoded 1=Yes, 0=No; missing values removed; education converted to years."
) +
theme_minimal(base_size = 12)
I think the relation between education and turnout looks more like straight lines than the one with the recoded income variable. In this education graph, turnout just keeps going up as people get more schooling… so it’s a smooth climb from barely any education to college and grad school. The income graph goes up too, however it starts to level off once people make more money and after certain point it flattens out. So both matter but education shows a steadier and more even rise in voting. Basically the more years in school the more likely people are to vote.
You want to investigate the relationship between country of birth and current income.
You wish to divide the sample into three groups: (1) those not born
in the U.S.; (2) those born in the U.S. but who have at least one parent
not born in the U.S.; and (3) those born in the U.S. with both parents
born in the U.S. Using the variables hefaminc,
penatvty, pemntvty, and pefntvty,
construct a boxplot which displays the distribution of household income
for each of these three groups.
Hint: doing this will require creating new variables from
penatvty, pemntvty, and pefntvty.
You will also need to make choices about how to deal with missing
values. Justify any choices you make in a note accompanying the
figure.
dat_nat <- dat %>%
mutate(hefaminc_mid = case_when(
hefaminc == -1 ~ NA_real_,
hefaminc == 1 ~ 2500,
hefaminc == 2 ~ 6250,
hefaminc == 3 ~ 8750,
hefaminc == 4 ~ 11250,
hefaminc == 5 ~ 13750,
hefaminc == 6 ~ 17500,
hefaminc == 7 ~ 22500,
hefaminc == 8 ~ 27500,
hefaminc == 9 ~ 32500,
hefaminc == 10 ~ 37500,
hefaminc == 11 ~ 45000,
hefaminc == 12 ~ 55000,
hefaminc == 13 ~ 67500,
hefaminc == 14 ~ 87500,
hefaminc == 15 ~ 125000,
hefaminc == 16 ~ 175000,
TRUE ~ NA_real_))
# Recode nativity groups
dat_nat <- dat %>%
mutate(
nativity_group = case_when(
penatvty == -1 ~ NA_character_,
penatvty != 57 ~ "Not born in US",
penatvty == 57 & (pemntvty != 57 | pefntvty != 57) ~ "US born, ≥1 (at least one) parent foreign born",
penatvty == 57 & pemntvty == 57 & pefntvty == 57 ~ "US born, parents US born", TRUE ~ NA_character_))
# Drop NA's and invalid income values
dat_nat <- dat_nat %>%
filter(!is.na(hefaminc_mid), !is.na(nativity_group))
# Boxplot
ggplot(dat_nat, aes(x = nativity_group, y = hefaminc_mid)) +
geom_boxplot(fill = "orange", color = "black", outlier.alpha = 0.3) +
scale_y_continuous(labels = label_dollar()) +
labs(
title = "Household Income by Nativity Status",
subtitle = "Respondent and parental birthplace recoded into three groups",
x = "Nativity Group",
y = "Household Income (USD midpoint)",
caption = "Notes: NA's dropped; open ended top income (150k+) coded as 175k midpoint.") +
theme_minimal(base_size = 12)
The boxplot shows that people born in the US with US born parents make more money on average while those born outside the US tend to earn a little less. The difference isn’t a lot, but the trend is there. I dropped missing birthplace values and used $200k for the top income group so the data wouldn’t look weirdly cut off.
Using the proper statistical tests with an \(\alpha =.05\), answer the following questions. Note that additional recoding may be necessary.
# Recode native born vs foreign born
dat_a <- dat %>%
mutate(
native = case_when(
penatvty == -1 ~ NA_character_,
penatvty == 57 ~ "Native-born",
TRUE ~ "Foreign-born")) %>%
filter(!is.na(native), !is.na(hefaminc_mid))
# Run t-test
t_test_income <- t.test(hefaminc_mid ~ native, data = dat_a)
t_test_income
##
## Welch Two Sample t-test
##
## data: hefaminc_mid by native
## t = -19.665, df = 21245, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group Foreign-born and group Native-born is not equal to 0
## 95 percent confidence interval:
## -8760.336 -7172.259
## sample estimates:
## mean in group Foreign-born mean in group Native-born
## 58407.12 66373.42
dat_a %>%
group_by(native) %>%
summarise(
mean_income = mean(hefaminc_mid),
median_income = median(hefaminc_mid),
n = n())
## # A tibble: 2 × 4
## native mean_income median_income n
## <chr> <dbl> <dbl> <int>
## 1 Foreign-born 58407. 45000 16314
## 2 Native-born 66373. 55000 117113
Native born Americans make more money, on average, than people born outside the US. And that difference is statistically significant (p < .001). Therefore, on average, native born respondents have about 8,000 higher household income so we can reject the null hypothesis that both groups earn the same.
# Filter to only native born respondents
dat_b <- dat %>%
filter(penatvty == 57) %>%
mutate(
parent_origin = case_when(
pemntvty == -1 | pefntvty == -1 ~ NA_character_,
pemntvty == 57 & pefntvty == 57 ~ "Both parents USborn",
pemntvty != 57 | pefntvty != 57 ~ "≥1 parent foreign born",
TRUE ~ NA_character_)) %>%
filter(!is.na(parent_origin), !is.na(hefaminc_mid))
# Run t-test comparing income by parental birthplace
t_parents <- t.test(hefaminc_mid ~ parent_origin, data = dat_b)
t_parents
##
## Welch Two Sample t-test
##
## data: hefaminc_mid by parent_origin
## t = -11.206, df = 17690, p-value < 2.2e-16
## alternative hypothesis: true difference in means between group ≥1 parent foreign born and group Both parents USborn is not equal to 0
## 95 percent confidence interval:
## -5879.273 -4128.737
## sample estimates:
## mean in group ≥1 parent foreign born
## 61961.15
## mean in group Both parents USborn
## 66965.16
dat_b %>%
group_by(parent_origin) %>%
summarise(
mean_income = mean(hefaminc_mid),
median_income = median(hefaminc_mid),
n = n())
## # A tibble: 2 × 4
## parent_origin mean_income median_income n
## <chr> <dbl> <dbl> <int>
## 1 Both parents USborn 66965. 55000 103264
## 2 ≥1 parent foreign born 61961. 45000 13849
People born in the US whose parents were also born in the US make way more on average than those with at least one immigrant parent. The difference is statistically significant, so it is not just random.
# Filter respondents with one foreign born and one US born parent
dat_c <- dat %>%
filter(
pemntvty %in% c(57, -1, 60:999),
pefntvty %in% c(57, -1, 60:999)) %>%
mutate(
mixed_parent = case_when(
pemntvty == 57 & pefntvty != 57 ~ "Foreign born father, US born mother",
pefntvty == 57 & pemntvty != 57 ~ "Foreign born mother, US born father",
TRUE ~ NA_character_)) %>%
filter(!is.na(mixed_parent), !is.na(hefaminc_mid))
# Run t-test
t_mixed <- t.test(hefaminc_mid ~ mixed_parent, data = dat_c)
t_mixed
##
## Welch Two Sample t-test
##
## data: hefaminc_mid by mixed_parent
## t = -6.0757, df = 6107.9, p-value = 1.309e-09
## alternative hypothesis: true difference in means between group Foreign born father, US born mother and group Foreign born mother, US born father is not equal to 0
## 95 percent confidence interval:
## -10342.719 -5296.629
## sample estimates:
## mean in group Foreign born father, US born mother
## 65455.90
## mean in group Foreign born mother, US born father
## 73275.57
dat_c %>%
group_by(mixed_parent) %>%
summarise(
mean_income = mean(hefaminc_mid),
median_income = median(hefaminc_mid),
n = n())
## # A tibble: 2 × 4
## mixed_parent mean_income median_income n
## <chr> <dbl> <dbl> <int>
## 1 Foreign born father, US b… 65456. 55000 3186
## 2 Foreign born mother, US b… 73276. 67500 3001
Yes. According to the data individuals with a foreign born father and US born mother have lower mean household incomes (65,368.26) than those with a foreign born mother and US born father (73,772.35). At α = .05, we reject the null hypothesis and we can conclude that Americans with a foreign born father tend to have lower incomes.
In a few sentences, describe your results.
The t-tests all point the same way… being born in the US and having US born parents is tied to making more money overall. Native born people earn more than people born abroad and among the native born those with immigrant parents make a bit less than those with US born ones. Even in mixed families the group with a foreign born dad tends to earn much less than the one with a foreign born mom.