true

Goals for this exercise

Setup

Once again, we start by defining our exercise setup, loading packages and data. We are using the include: true argument to specify that these blocks should remain visible in our final output for informational purposes at the start of the course, but as the semester progresses we will start to hide these blocks in our final output.

Load packages

We load the same packages as we did in our last exercise.

knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
library(readxl)

Read in Chetty data

Our dataset this time comes from the Chetty study. You were introduced to Chetty in the reading for today’s class, via the Executive Summary.

This code creates a new data frame named “chetty” by reading in the data from another Excel file.

Also note the use of skip=1 within our read_excel() function. Can you determine why this argument is necessary?

chetty <- read_excel("data/Lab1_Chetty_2014.xlsx", skip=1)

Get to know the data

Once again, use glimpse() or just invoke the name of the data frame to inspect it:

glimpse(chetty)
## Rows: 741
## Columns: 48
## $ cz                   <dbl> 100, 200, 301, 302, 401, 402, 500, 601, 602, 700,…
## $ cz_name              <chr> "Johnson City", "Morristown", "Middlesborough", "…
## $ state                <chr> "TN", "TN", "TN", "TN", "NC", "VA", "NC", "NC", "…
## $ region               <chr> "South", "South", "South", "South", "South", "Sou…
## $ division             <chr> "East South Central", "East South Central", "East…
## $ pop_2000             <dbl> 576081, 227816, 66708, 727600, 493180, 92753, 105…
## $ top_100              <dbl> 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0…
## $ urban                <chr> "urban", "urban", "rural", "urban", "urban", "rur…
## $ rel_mobility         <dbl> 0.37170, 0.35611, 0.37557, 0.37253, 0.38913, 0.42…
## $ abs_mobility         <dbl> 38.38750, 37.77675, 39.04925, 37.84125, 36.96925,…
## $ prob_q1q5            <dbl> 0.06219881, 0.05365194, 0.07263514, 0.05628121, 0…
## $ frac_black           <dbl> 0.0208408, 0.0197791, 0.0146459, 0.0563648, 0.173…
## $ racial_seg           <dbl> 0.0903837, 0.0931530, 0.0642501, 0.2099943, 0.262…
## $ income_seg           <dbl> 0.0348657, 0.0262809, 0.0240811, 0.0921036, 0.071…
## $ poverty_seg          <dbl> 0.0301532, 0.0278582, 0.0146831, 0.0842674, 0.061…
## $ affluence_seg        <dbl> 0.0382395, 0.0253348, 0.0258143, 0.1019656, 0.080…
## $ frac_commute_under15 <dbl> 0.3252578, 0.2764278, 0.3585358, 0.2685696, 0.291…
## $ hhi_percap           <dbl> 31559.77, 29958.93, 22328.48, 35884.29, 38891.75,…
## $ gini                 <dbl> 0.46804, 0.43459, 0.44095, 0.50832, 0.46553, 0.44…
## $ top1inc_share        <dbl> 13.459000, 10.631000, 10.691000, 15.080000, 11.91…
## $ gini_bottom99        <dbl> 0.33345, 0.32828, 0.33404, 0.35752, 0.34636, 0.33…
## $ frac_p25top75        <dbl> 0.54796, 0.53750, 0.46685, 0.50410, 0.49991, 0.53…
## $ taxrate              <dbl> 0.0203924, 0.0234471, 0.0153799, 0.0188704, 0.017…
## $ gov_exp              <dbl> 1886.148, 2004.337, 1189.820, 2356.851, 1891.450,…
## $ inctax_progressive   <dbl> 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1…
## $ eitc                 <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0…
## $ sch_exp              <dbl> 5.184547, 4.505886, 5.614119, 4.899846, 5.462676,…
## $ tchr_ratio           <dbl> NA, NA, 15.08494, NA, 15.38528, NA, 16.67801, 16.…
## $ test_pctile          <dbl> 2.7283790, -3.4002740, -9.3150620, -6.0318310, -2…
## $ hs_dropout           <dbl> -0.0152656, -0.0235207, -0.0046291, -0.0110711, 0…
## $ colleges_percap      <dbl> 0.0138869, 0.0087790, 0.0449721, 0.0109951, 0.014…
## $ college_tuition      <dbl> 4816.820, 4762.230, 11840.400, 3480.440, 9715.330…
## $ college_gradrate     <dbl> -0.0024312, -0.1011827, 0.1112985, -0.0238261, 0.…
## $ labor_participation  <dbl> 0.5873474, 0.6249742, 0.4789631, 0.6148306, 0.656…
## $ manufacturing        <dbl> 0.2373972, 0.2377556, 0.2335314, 0.1455054, 0.215…
## $ growth_imports       <dbl> 5.2937860, 3.0304790, 2.0625960, 1.0783080, 1.016…
## $ teen_lfp             <dbl> 0.0037529, 0.0047773, 0.0028932, 0.0042875, 0.003…
## $ migration_in         <dbl> 0.005639832, 0.016206061, 0.008050009, 0.01630291…
## $ migration_out        <dbl> 0.004697256, 0.014235172, 0.011602806, 0.01355415…
## $ foreign_born         <dbl> 0.0117837, 0.0230553, 0.0070780, 0.0199675, 0.052…
## $ social_capital       <dbl> -0.29785830, -0.76735477, -1.27025130, -0.2218846…
## $ frac_religious       <dbl> 0.5144033, 0.5438951, 0.6678060, 0.6019530, 0.487…
## $ crime_rate           <dbl> 0.0014095, 0.0018436, 0.0008545, 0.0013441, 0.002…
## $ frac_single_mothers  <dbl> 0.1898032, 0.1851060, 0.2110027, 0.2056023, 0.220…
## $ frac_divorced        <dbl> 0.1101729, 0.1159584, 0.1134514, 0.1142778, 0.092…
## $ frac_married         <dbl> 0.6008929, 0.6133591, 0.5902804, 0.5751500, 0.585…
## $ inc_growth           <dbl> -0.0022776, -0.0021528, -0.0037121, -0.0019974, -…
## $ `Urban or Rural`     <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N…

Questions

  1. How many observations (rows) are there in our chetty data frame?

There are 741 observations/rows in the chetty data frame

  1. How many variables (columns) are there?

There are 48 variables/columns in the chetty data frame

  1. What does each observation represent?

Each row/observation in the chetty data frame represents a CZ (commuting zone) in the United States

Variables of interest

Today we will examine the following variables:

Variable Type Description
social_capital continuous Social capital score
abs_mobility continuous Absolute mobility score
rel_mobility continuous Relative mobility score
urban categorical Identifies whether the CZ is urban or rural
gini continuous Gini coefficient

Exercise, Part 1

One of the conclusions of the Chetty study in 2014 was that social capital, defined as “the strength of an individual’s social network and community,” may influence upward economic mobility. We’re going to explore that assertion in this exercise using Chetty’s own data and a mixture of the variables seen above.

Question: How strong is the relationship between social capital and economic mobility?

Step 1

Using geom_point(), visualize the relationship between social capital and absolute mobility.

ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = abs_mobility
           )) +
  geom_point() +
  labs(
    x = "Social capital score",
    y = "Absolute mobility score"
  )
Social capital vs. Absolute mobility

Social capital vs. Absolute mobility

Question 1

Do you see a correlation? Is it what you expected from the Chetty study executive summary?

Yes, I see a relatively positive correlation.

Step 2

Add an aesthetic to your graph to represent whether the CZ is urban or not.

ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = abs_mobility,
           color = urban
           )) +
  geom_point() +
  labs(
    x = "Social capital score",
    y = "Absolute mobility score",
    color = "Urban v.s. Rural"
  )
Social capital vs. Absolute mobility, segmented by urban and rural CZ's

Social capital vs. Absolute mobility, segmented by urban and rural CZ’s

Question 2

Do you see any indication that Simpson’s Paradox could be an issue?

Yes, since the correlation for urban is not that obvious. Instead, the rural correlation is very strong.

Simpson’s Paradox: when a trend appears in separate groups but reverses or disappears when the data is aggregated, leading to misleading conclusions about wages, mobility, policy impacts, or inequality.

Step 3

Use faceting to separate urban and non-urban CZ’s into two separate plots.

ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = abs_mobility,
           color = urban
           )) +
  geom_point() +
  labs(
    x = "Social capital score",
    y = "Absolute mobility score",
    color = "Urban v.s. Rural"
  ) +
  facet_grid(~urban)
Social capital vs. Absolute mobility, faceted by urban and rural CZ's

Social capital vs. Absolute mobility, faceted by urban and rural CZ’s

Question 3

Does this change your answer to the previous question? Do you see any indication that Simpson’s Paradox could be an issue?

It seems like it’s no longer an issue.

Step 4

Use geom_smooth() to add a line of best fit to your faceted plot, and experiment with these different options for your line:

  • method: determines what kind of regression model to use in calculating the slope of the line of best fit
    • lm: linear model (straight line)
    • loess: locally-estimated smoothing (smooth line)
    • And several others
  • se: show or hide a shaded area representing the calculated standard error:
    • TRUE: show the standard error
    • FALSE: hide the standard error
ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = abs_mobility,
           color = urban
           )) +
  geom_point() +
  labs(
    x = "Social capital score",
    y = "Absolute mobility score",
    color = "Urban v.s. Rural"
  ) +
  facet_grid(~urban) +
  geom_smooth(method = "lm",
                  se = FALSE)
Social capital vs. Absolute mobility, faceted by urban and rural CZ's with line of best fit

Social capital vs. Absolute mobility, faceted by urban and rural CZ’s with line of best fit

Question 4

What differences do you observe between the urban and rural CZ’s? Do you see any indication that Simpson’s Paradox could be an issue?

Yes, the data for urban is more condense and less range.

Exercise, Part 2

Step 5

Let’s examine relative mobility this time. Borrowing your code from above, create a scatterplot that replaces the abs_mobility variable with rel_mobility, and add a line of best fit. Also, use a color aesthetic on `geom_point() only to draw a single line for the entire dataset.

ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = rel_mobility,
           )) +
  geom_point(aes(color = urban)) + # Remeber!! to add aes() here
  geom_smooth(method = "lm") +
  labs(
    x = "Social capital score",
    y = "Relative mobility score",
    color = "Urban v.s. Rural"
  )
Social capital vs. Relative mobility, segmented by urban/rural

Social capital vs. Relative mobility, segmented by urban/rural

Step 6

Now take the same chart from Step 5 above and facet it by the urban variable.

ggplot(data = chetty, mapping = aes(
           x = social_capital,
           y = rel_mobility,
           )) +
  geom_point(aes(color = urban)) + # Remeber!! to add aes() here
  geom_smooth(method = "lm",
              se = FALSE) +
  facet_wrap(~urban) +
  labs(
    x = "Social capital score",
    y = "Relative mobility score",
    color = "Urban v.s. Rural"
  )
Social capital vs. Relative mobility, faceted by urban/rural

Social capital vs. Relative mobility, faceted by urban/rural

Question 5

What differences do you observe when we use relative mobility instead of absolute?

It is relatively negative correlation. Notice that relative mobility reverses the relationship, and urban CZs in particular have probably the weakest correlation.

Exercise, Part 3

Question: How strong is the relationship between inequality and economic mobility?

The Gini coefficient is a popular measure of income and wealth inequality. In a perfectly unequal society where one person holds all the wealth and everyone else has none of it, the Gini coefficient would be 1. In a perfectly equal society where all members hold an equal share of the wealth, the Gini coefficient would be 0.

With this in mind, Chetty calculated the Gini coefficient for all of the CZs in the study, and this data is included in our dataset for today’s lesson.

Step 6

In this last phase, let’s examine the relationship between inequality and mobility by plotting our gini variable on the X axis and our abs_mobility variable on the Y axis.

ggplot(data = chetty, mapping = aes(
           x = gini,
           y = abs_mobility,
           )) +
  geom_point(aes(color = urban)) + # Remeber!! to add aes() here
  geom_smooth(method = "lm",
              se = FALSE) +
  facet_wrap(~urban) +
  labs(
    x = "Gini coefficient",
    y = "Absolute mobility score"
  )

Question 6

What do you observe when we analyze the relationship between the Gini coefficient and the absolute mobility score?

With higher absolute mobility, the gini coefficient is lower, which means the economic gap between the rich and the poor is less. The lower the Gini coefficient, the higher the mobility score. In other words, the more income equality we see, the greater the economic mobility. We can perhaps conclude, then, that income inequality acts as an impediment to upward mobility in certain CZ’s.

Wrap Up

Great work! When you’re finished, follow these steps: