Once again, we start by defining our exercise setup, loading packages
and data. We are using the include: true argument to
specify that these blocks should remain visible in our final output for
informational purposes at the start of the course, but as the semester
progresses we will start to hide these blocks in our final output.
We load the same packages as we did in our last exercise.
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)
Our dataset this time comes from the Chetty study. You were introduced to Chetty in the reading for today’s class, via the Executive Summary.
This code creates a new data frame named “chetty” by reading in the data from another Excel file.
Also note the use of skip=1 within our
read_excel() function. Can you determine why this argument
is necessary?
chetty <- read_excel("data/Lab1_Chetty_2014.xlsx", skip=1)
Once again, use glimpse() or just invoke the name of the
data frame to inspect it:
glimpse(chetty)
## Rows: 741
## Columns: 48
## $ cz <dbl> 100, 200, 301, 302, 401, 402, 500, 601, 602, 700,…
## $ cz_name <chr> "Johnson City", "Morristown", "Middlesborough", "…
## $ state <chr> "TN", "TN", "TN", "TN", "NC", "VA", "NC", "NC", "…
## $ region <chr> "South", "South", "South", "South", "South", "Sou…
## $ division <chr> "East South Central", "East South Central", "East…
## $ pop_2000 <dbl> 576081, 227816, 66708, 727600, 493180, 92753, 105…
## $ top_100 <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ urban <chr> "urban", "urban", "rural", "urban", "urban", "rur…
## $ rel_mobility <dbl> 0.37170, 0.35611, 0.37557, 0.37253, 0.38913, 0.42…
## $ abs_mobility <dbl> 38.38750, 37.77675, 39.04925, 37.84125, 36.96925,…
## $ prob_q1q5 <dbl> 0.06219881, 0.05365194, 0.07263514, 0.05628121, 0…
## $ frac_black <dbl> 0.0208408, 0.0197791, 0.0146459, 0.0563648, 0.173…
## $ racial_seg <dbl> 0.0903837, 0.0931530, 0.0642501, 0.2099943, 0.262…
## $ income_seg <dbl> 0.0348657, 0.0262809, 0.0240811, 0.0921036, 0.071…
## $ poverty_seg <dbl> 0.0301532, 0.0278582, 0.0146831, 0.0842674, 0.061…
## $ affluence_seg <dbl> 0.0382395, 0.0253348, 0.0258143, 0.1019656, 0.080…
## $ frac_commute_under15 <dbl> 0.3252578, 0.2764278, 0.3585358, 0.2685696, 0.291…
## $ hhi_percap <dbl> 31559.77, 29958.93, 22328.48, 35884.29, 38891.75,…
## $ gini <dbl> 0.46804, 0.43459, 0.44095, 0.50832, 0.46553, 0.44…
## $ top1inc_share <dbl> 13.459000, 10.631000, 10.691000, 15.080000, 11.91…
## $ gini_bottom99 <dbl> 0.33345, 0.32828, 0.33404, 0.35752, 0.34636, 0.33…
## $ frac_p25top75 <dbl> 0.54796, 0.53750, 0.46685, 0.50410, 0.49991, 0.53…
## $ taxrate <dbl> 0.0203924, 0.0234471, 0.0153799, 0.0188704, 0.017…
## $ gov_exp <dbl> 1886.148, 2004.337, 1189.820, 2356.851, 1891.450,…
## $ inctax_progressive <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1…
## $ eitc <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sch_exp <dbl> 5.184547, 4.505886, 5.614119, 4.899846, 5.462676,…
## $ tchr_ratio <dbl> NA, NA, 15.08494, NA, 15.38528, NA, 16.67801, 16.…
## $ test_pctile <dbl> 2.7283790, -3.4002740, -9.3150620, -6.0318310, -2…
## $ hs_dropout <dbl> -0.0152656, -0.0235207, -0.0046291, -0.0110711, 0…
## $ colleges_percap <dbl> 0.0138869, 0.0087790, 0.0449721, 0.0109951, 0.014…
## $ college_tuition <dbl> 4816.820, 4762.230, 11840.400, 3480.440, 9715.330…
## $ college_gradrate <dbl> -0.0024312, -0.1011827, 0.1112985, -0.0238261, 0.…
## $ labor_participation <dbl> 0.5873474, 0.6249742, 0.4789631, 0.6148306, 0.656…
## $ manufacturing <dbl> 0.2373972, 0.2377556, 0.2335314, 0.1455054, 0.215…
## $ growth_imports <dbl> 5.2937860, 3.0304790, 2.0625960, 1.0783080, 1.016…
## $ teen_lfp <dbl> 0.0037529, 0.0047773, 0.0028932, 0.0042875, 0.003…
## $ migration_in <dbl> 0.005639832, 0.016206061, 0.008050009, 0.01630291…
## $ migration_out <dbl> 0.004697256, 0.014235172, 0.011602806, 0.01355415…
## $ foreign_born <dbl> 0.0117837, 0.0230553, 0.0070780, 0.0199675, 0.052…
## $ social_capital <dbl> -0.29785830, -0.76735477, -1.27025130, -0.2218846…
## $ frac_religious <dbl> 0.5144033, 0.5438951, 0.6678060, 0.6019530, 0.487…
## $ crime_rate <dbl> 0.0014095, 0.0018436, 0.0008545, 0.0013441, 0.002…
## $ frac_single_mothers <dbl> 0.1898032, 0.1851060, 0.2110027, 0.2056023, 0.220…
## $ frac_divorced <dbl> 0.1101729, 0.1159584, 0.1134514, 0.1142778, 0.092…
## $ frac_married <dbl> 0.6008929, 0.6133591, 0.5902804, 0.5751500, 0.585…
## $ inc_growth <dbl> -0.0022776, -0.0021528, -0.0037121, -0.0019974, -…
## $ `Urban or Rural` <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…
There are 741 observations/rows in the chetty data frame
There are 48 variables/columns in the chetty data frame
Each row/observation in the chetty data frame represents a CZ (commuting zone) in the United States
Today we will examine the following variables:
| Variable | Type | Description |
|---|---|---|
social_capital |
continuous | Social capital score |
abs_mobility |
continuous | Absolute mobility score |
rel_mobility |
continuous | Relative mobility score |
urban |
categorical | Identifies whether the CZ is urban or
rural |
gini |
continuous | Gini coefficient |
One of the conclusions of the Chetty study in 2014 was that social capital, defined as “the strength of an individual’s social network and community,” may influence upward economic mobility. We’re going to explore that assertion in this exercise using Chetty’s own data and a mixture of the variables seen above.
Question: How strong is the relationship between social capital and economic mobility?
Using geom_point(), visualize the relationship between
social capital and absolute mobility.
ggplot(data = chetty, mapping = aes(
x = social_capital,
y = abs_mobility
)) +
geom_point() +
labs(
x = "Social capital score",
y = "Absolute mobility score"
)
Social capital vs. Absolute mobility
Do you see a correlation? Is it what you expected from the Chetty study executive summary?
Yes, I see a relatively positive correlation.
Add an aesthetic to your graph to represent whether the CZ is urban or not.
ggplot(data = chetty, mapping = aes(
x = social_capital,
y = abs_mobility,
color = urban
)) +
geom_point() +
labs(
x = "Social capital score",
y = "Absolute mobility score",
color = "Urban v.s. Rural"
)
Social capital vs. Absolute mobility, segmented by urban and rural CZ’s
Do you see any indication that Simpson’s Paradox could be an issue?
Yes, since the correlation for urban is not that obvious. Instead, the rural correlation is very strong.
Simpson’s Paradox: when a trend appears in separate groups but reverses or disappears when the data is aggregated, leading to misleading conclusions about wages, mobility, policy impacts, or inequality.
Use faceting to separate urban and non-urban CZ’s into two separate plots.
ggplot(data = chetty, mapping = aes(
x = social_capital,
y = abs_mobility,
color = urban
)) +
geom_point() +
labs(
x = "Social capital score",
y = "Absolute mobility score",
color = "Urban v.s. Rural"
) +
facet_grid(~urban)
Social capital vs. Absolute mobility, faceted by urban and rural CZ’s
Does this change your answer to the previous question? Do you see any indication that Simpson’s Paradox could be an issue?
It seems like it’s no longer an issue.
Use geom_smooth() to add a line of best fit to your
faceted plot, and experiment with these different options for your
line:
method: determines what kind of regression model to use
in calculating the slope of the line of best fit
lm: linear model (straight line)loess: locally-estimated smoothing (smooth line)se: show or hide a shaded area representing the
calculated standard error:
TRUE: show the standard errorFALSE: hide the standard errorggplot(data = chetty, mapping = aes(
x = social_capital,
y = abs_mobility,
color = urban
)) +
geom_point() +
labs(
x = "Social capital score",
y = "Absolute mobility score",
color = "Urban v.s. Rural"
) +
facet_grid(~urban) +
geom_smooth(method = "lm",
se = FALSE)
Social capital vs. Absolute mobility, faceted by urban and rural CZ’s with line of best fit
What differences do you observe between the urban and rural CZ’s? Do you see any indication that Simpson’s Paradox could be an issue?
Yes, the data for urban is more condense and less range.
Let’s examine relative mobility this time. Borrowing
your code from above, create a scatterplot that replaces the
abs_mobility variable with rel_mobility, and
add a line of best fit. Also, use a color aesthetic on
`geom_point() only to draw a single line for the entire
dataset.
ggplot(data = chetty, mapping = aes(
x = social_capital,
y = rel_mobility,
)) +
geom_point(aes(color = urban)) + # Remeber!! to add aes() here
geom_smooth(method = "lm") +
labs(
x = "Social capital score",
y = "Relative mobility score",
color = "Urban v.s. Rural"
)
Social capital vs. Relative mobility, segmented by urban/rural
Now take the same chart from Step 5 above and facet it by the
urban variable.
ggplot(data = chetty, mapping = aes(
x = social_capital,
y = rel_mobility,
)) +
geom_point(aes(color = urban)) + # Remeber!! to add aes() here
geom_smooth(method = "lm",
se = FALSE) +
facet_wrap(~urban) +
labs(
x = "Social capital score",
y = "Relative mobility score",
color = "Urban v.s. Rural"
)
Social capital vs. Relative mobility, faceted by urban/rural
What differences do you observe when we use relative mobility instead of absolute?
It is relatively negative correlation. Notice that relative mobility reverses the relationship, and urban CZs in particular have probably the weakest correlation.
Question: How strong is the relationship between inequality and economic mobility?
The Gini coefficient is a popular measure of income and wealth inequality. In a perfectly unequal society where one person holds all the wealth and everyone else has none of it, the Gini coefficient would be 1. In a perfectly equal society where all members hold an equal share of the wealth, the Gini coefficient would be 0.
With this in mind, Chetty calculated the Gini coefficient for all of the CZs in the study, and this data is included in our dataset for today’s lesson.
In this last phase, let’s examine the relationship between inequality
and mobility by plotting our gini variable on the X axis
and our abs_mobility variable on the Y axis.
ggplot(data = chetty, mapping = aes(
x = gini,
y = abs_mobility,
)) +
geom_point(aes(color = urban)) + # Remeber!! to add aes() here
geom_smooth(method = "lm",
se = FALSE) +
facet_wrap(~urban) +
labs(
x = "Gini coefficient",
y = "Absolute mobility score"
)
What do you observe when we analyze the relationship between the Gini coefficient and the absolute mobility score?
With higher absolute mobility, the gini coefficient is lower, which means the economic gap between the rich and the poor is less. The lower the Gini coefficient, the higher the mobility score. In other words, the more income equality we see, the greater the economic mobility. We can perhaps conclude, then, that income inequality acts as an impediment to upward mobility in certain CZ’s.
Great work! When you’re finished, follow these steps: