Welcome to CSC 110 Lab 1! In this lab, you will visualize data on economic mobility in the U.S. from Opportunity Insights using the ggplot2 package.
This is a pair programming assignment: one student acts as the driver, the other as the navigator. The driver writes code while the navigator directs them on what to write based on the assignment instructions, and provides feedback and help as needed. You should rotate these roles roughly every 15 minutes so each person gets plenty of time in the “driver’s seat.”
Lab assignments can be taken open-book using the R4DS textbook and open-note using the notes you’ve taken and the materials we have covered in class. You should not rely on any other outside resources. Usage of generative AI tools like ChatGPT, Claude, or DeepSeek is not allowed on this assignment.
chetty_top100
data frame (below)label and fig-cap options at
the top of each code blockFirst, run the code block below to load the necessary packages.
library(tidyverse)
library(readxl)
Next, load the Chetty data from Excel into a data frame named “chetty.” Then filter the results from chetty based on the 100 largest CZs, and load that output into a new data frame named “chetty_top100.” You will work with chetty_top100 throughout this lab.
This portion is completed ahead of time for you, but you should still
compare what I wrote above with the code below and see if you can piece
together how it functions. Note the pipe operator (|>)
that we will learn about soon, which can be used to string together
multiple sequential data operations.
This part does not count for any points on the assignment, but it’s very important to get it correct because it will affect everything you do downstream, so be careful!
chetty <- read_excel("data/Lab1_Chetty_2014.xlsx", skip=1)
# filter for top 100 commuting zones
chetty_top100 <- chetty %>%
filter(top_100==1)
# remove the larger data set with all commuting zones from the work space
rm(chetty)
In Part 1, you will explore the relationship between economic
mobility and a CZ characteristic not mentioned in the Executive Summary:
household income per capita (hhi_percap). The mobility
measure that you will use in this analysis captures the probability that
a child born to a parent in quintile 1 moves to income quintile 5 as an
adult (prob_q1q5).
Make a scatterplot with household income per capita on the x-axis, and mobility on the y-axis.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility")
Plot 1.1
Describe the chart: what is the range of the x-axis? What is the range of the y-axis? Do you think there is a relationship between these two variables? Why or why not?
The range of the x-axis is 20000-60000, and the range of the y-axis is 0.025-0.140. I don’t think there is an obvious relationship between these two variables, because the data points are quite dispersed and no apparent trend line is observed.
Use color to represent the geographic region (region) to your scatterplot.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
color = region)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility",
color = "Geographic region")
Plot 2.1
What patterns does this reveal? Describe the distribution of the data, by region.
There is a weak positive correlation between household income per capita and mobility in the West and the Northeast; with no obvious correlation between household income per capita and mobility observed in the Midwest and South area. Generally, the mobility is high in the West, while it is relatively low in the South.
Which region(s) appears to have the most CZs with mobility>10%?
The West region appears to have the most CZs with mobility > 10%.
Which region(s) appears to have the most CZs with
mobility<5%?
The South region appears to have the most CZs with mobility < 5%.
Represent geographic region on your scatterplot using shape instead of color.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
shape = region)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility",
shape = "Geographic region")
Plot 3.1
Compare the use of color vs shape to represent the region: what are
the benefits and drawbacks of each?
When we use color to represent the region, we can recognize and tell each region’s data clearly. When we use shape to represent the region, we can clearly see the correlation of variables in 4 regions combined together. However, both the use of shape and color make it difficult to interpret overlapping data. Comparatively, we consider the use of color works better for data visualization of comparison between 4 regions.
Going back to the chart you made in 2.1, which uses color to
represent the region, add another aesthetic to represent the size of the
population using the pop_2000 variable (population from the
2000 Census).
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
color = region,
size = pop_2000)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility",
color = "Geographic region",
size = "Size of the population")
Plot 4.1
Describe any relationships between size and region.
We notice that in the West and Northeast regions, there is a large amount of population who have high mobility (>7.5%) and high household income per capita (>35000). In addition, in the Midwest region, a large amount of population have low mobility (<7.5%) but high household income per capita (>40000). For the south region, there is a group of outlier population who have very high mobility (>10%) and household income per capita (>50000).
Split your plot into facets to display scatterplots of your data by census region.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
color = region,
size = pop_2000)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility",
color = "Geographic region",
size = "Size of the population"
) +
facet_wrap(~region)
Plot 5.1
Compare this split plot to the combined plot from #3. Are there
aspects of the relationship between hhi_percap and mobility
that are easier to detect in the faceted plot than in the combined
plot?
Yes, we do think the faceted plot makes it easier for us to detect
correlations between hhi_percap and mobility in each region
because the data are separated and recognized more clearly.
Which regions appear to have a relatively stronger relationship
between hhi_percap and mobility?
The Northeast and West regions appear to have a relatively stronger
relationship between hhi_percap and mobility.
Add information on the census division to your chart using the color aesthetic.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
color = division,
size = pop_2000)
) +
geom_point() +
labs(x = "Household income per capita",
y = "Economic mobility",
color = "Geographic division",
size = "Size of the population"
) +
facet_wrap(~region)
Plot 6.1
What does this reveal about divisional differences in the West?
The plot reveals that in the West region, the Pacific division has a large amount of population with higher mobility and household income per capita compared to the Mountain division.
Create a plot of the relationship between hhi_percap and
mobility with two layers: (1) A scatterplot colored by region, and (2) a
smooth fit chart with no standard error also colored by region.
ggplot(data = chetty_top100, mapping = aes(
x = hhi_percap,
y = prob_q1q5,
color = region)
) +
geom_point() +
geom_smooth(
se = FALSE
) +
labs(x = "Household income per capita",
y = "Economic mobility",
color = "Geographic region"
)
Plot 7.1
What patterns does this illustrate in the data? Be specific!
This plot illustrates that in the West and Northeast region, there are positive correlations between the household income per capita and mobility, also they both have high household income (generally >30000). For the South region, there is a negative correlation between the household income per capita and mobility for (household income per capita < 30000), after that there is a weak positive correlation. For the Midwest region, there is no obvious correlation between the household income per capita and mobility.
Create a bar chart that displays the count (not proportion) of CZs by census region and fill each bar using information on census division.
ggplot(chetty_top100
) +
geom_bar(mapping = aes(
x = region,
fill = division
)) +
labs(x = "Geographic region",
fill = "Geographic division")
Plot 8.1
What do you learn from this chart?
From this chart, we learned that among the four regions, the South region has the most CZs and the rest of 3 regions have relatively equal amounts of CZs. In the South region, the South Atlantic division has specifically high CZs.
Recreate the same bar chart, but this time use position “dodge”
ggplot(chetty_top100
) +
geom_bar(mapping = aes(
x = region,
fill = division
),
position = "dodge") +
labs(x = "Geographic region",
fill = "Geographic division")
Plot 8.3.1 - dodge position
Recreate the same bar chart, but this time use position “fill”
ggplot(chetty_top100
) +
geom_bar(mapping = aes(
x = region,
fill = division
),
position = "fill") +
labs(x = "Geographic region",
fill = "Geographic division")
Plot 8.3.2 - fill position
What are the relative advantages and disadvantages of each of the three bar charts when trying to visualize this information?
In this final part, you will use the Chetty dataset to devise your
own research question and analyze it through the use of scatterplots.
You may reuse the prob_q1q5 variable for economic mobility,
but your second axis variable should be a CZ characteristic you have
not yet explored (something other than
hhi_percap, social_capital and
gini).
Guidelines:
label and
fig-cap Number your figures appropriately as well using the
evaluation options at the top of each code chunk.What Chetty variable(s) did you choose to research? Explain the rationale for your choices and what question(s) you are trying to answer.
We chose “high school dropout rates” (hs_dropout) and
economic mobility (prob_q1q5) as our variables to research.
We want to explore if there is a negative correlation between economic
mobility and high school dropout rate, to see people’s education level
and their economic mobility status.
Make a scatterplot of your variables with appropriate aesthetics and labeling.
ggplot(data = chetty_top100, mapping = aes(
x = hs_dropout,
y = prob_q1q5,
)
) +
geom_point() +
geom_smooth(
method = "lm",
se = FALSE
) +
labs(x = "High school dropout rate",
y = "Economic mobility",
color = "Geographic region",
title = "Relationship between high school dropout rate
and economic mobility")
Plot 9.2 - High school dropout rate vs. economic mobility
Describe the chart: what is the range of the x-axis? What is the range of the y-axis? Do you think there is a relationship between these two variables? Why or why not?
The range of the x-axis is -0.02~0.06, while the range of the y-axis is 0.025~0.125. There is a weak negative relationship between these two variables, because the slope of the trend line is negative, as economic mobility decreases, high school dropout rate increases, indicating that the education level decreases. This correlation makes sense to us because education costs a large amount of money.
On my honor I have neither given nor received unauthorized information regarding this work, I have followed and will continue to observe all regulations regarding it, and I am unaware of any violation of the Honor Code by others.
Charlotte Li Tania Umurerwa Kalisa
Before submitting your assignment in Gradescope, review this checklist carefully to ensure you have done everything appropriately.
Point penalties will be assessed if your assignment does not meet these guidelines.