Instructions

Welcome to CSC 110 Lab 1! In this lab, you will visualize data on economic mobility in the U.S. from Opportunity Insights using the ggplot2 package.

This is a pair programming assignment: one student acts as the driver, the other as the navigator. The driver writes code while the navigator directs them on what to write based on the assignment instructions, and provides feedback and help as needed. You should rotate these roles roughly every 15 minutes so each person gets plenty of time in the “driver’s seat.”

Lab assignments can be taken open-book using the R4DS textbook and open-note using the notes you’ve taken and the materials we have covered in class. You should not rely on any other outside resources. Usage of generative AI tools like ChatGPT, Claude, or DeepSeek is not allowed on this assignment.

Setup

First, run the code block below to load the necessary packages.

library(tidyverse)
library(readxl)

Next, load the Chetty data from Excel into a data frame named “chetty.” Then filter the results from chetty based on the 100 largest CZs, and load that output into a new data frame named “chetty_top100.” You will work with chetty_top100 throughout this lab.

This portion is completed ahead of time for you, but you should still compare what I wrote above with the code below and see if you can piece together how it functions. Note the pipe operator (|>) that we will learn about soon, which can be used to string together multiple sequential data operations.

This part does not count for any points on the assignment, but it’s very important to get it correct because it will affect everything you do downstream, so be careful!

chetty <- read_excel("data/Lab1_Chetty_2014.xlsx", skip=1)

# filter for top 100 commuting zones
chetty_top100 <- chetty %>%
  filter(top_100==1)

# remove the larger data set with all commuting zones from the work space
rm(chetty)

Part 1: Economic Mobility & Household Income Across Commuting Zones

In Part 1, you will explore the relationship between economic mobility and a CZ characteristic not mentioned in the Executive Summary: household income per capita (hhi_percap). The mobility measure that you will use in this analysis captures the probability that a child born to a parent in quintile 1 moves to income quintile 5 as an adult (prob_q1q5).

Exercise 1

Exercise 1.1

Make a scatterplot with household income per capita on the x-axis, and mobility on the y-axis.

ggplot(data = chetty_top100, mapping = aes(
  x = hhi_percap,
  y = prob_q1q5)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility")
Plot 1.1

Plot 1.1

Exercise 1.2

Describe the chart: what is the range of the x-axis? What is the range of the y-axis? Do you think there is a relationship between these two variables? Why or why not?

The range of the x-axis is 20000-60000, and the range of the y-axis is 0.025-0.140. I don’t think there is an obvious relationship between these two variables, because the data points are quite dispersed and no apparent trend line is observed.

Exercise 2

Exercise 2.1

Use color to represent the geographic region (region) to your scatterplot.

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  color = region)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       color = "Geographic region")
Plot 2.1

Plot 2.1

Exercise 2.2

What patterns does this reveal? Describe the distribution of the data, by region.

There is a weak positive correlation between household income per capita and mobility in the West and the Northeast; with no obvious correlation between household income per capita and mobility observed in the Midwest and South area. Generally, the mobility is high in the West, while it is relatively low in the South.

Exercise 2.3

Which region(s) appears to have the most CZs with mobility>10%?

The West region appears to have the most CZs with mobility > 10%.

Exercise 2.4

Which region(s) appears to have the most CZs with mobility<5%?

The South region appears to have the most CZs with mobility < 5%.

Exercise 3

Exercise 3.1

Represent geographic region on your scatterplot using shape instead of color.

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  shape = region)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       shape = "Geographic region")
Plot 3.1

Plot 3.1

Exercise 3.2

Compare the use of color vs shape to represent the region: what are the benefits and drawbacks of each?

When we use color to represent the region, we can recognize and tell each region’s data clearly. When we use shape to represent the region, we can clearly see the correlation of variables in 4 regions combined together. However, both the use of shape and color make it difficult to interpret overlapping data. Comparatively, we consider the use of color works better for data visualization of comparison between 4 regions.

Exercise 4

Exercise 4.1

Going back to the chart you made in 2.1, which uses color to represent the region, add another aesthetic to represent the size of the population using the pop_2000 variable (population from the 2000 Census).

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  color = region,
  size  = pop_2000)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       color = "Geographic region",
       size  = "Size of the population")
Plot 4.1

Plot 4.1

Exercise 4.2

Describe any relationships between size and region.

We notice that in the West and Northeast regions, there is a large amount of population who have high mobility (>7.5%) and high household income per capita (>35000). In addition, in the Midwest region, a large amount of population have low mobility (<7.5%) but high household income per capita (>40000). For the south region, there is a group of outlier population who have very high mobility (>10%) and household income per capita (>50000).

Exercise 5

Exercise 5.1

Split your plot into facets to display scatterplots of your data by census region.

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  color = region,
  size  = pop_2000)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       color = "Geographic region",
       size  = "Size of the population"
       ) +
  facet_wrap(~region)
Plot 5.1

Plot 5.1

Exercise 5.2

Compare this split plot to the combined plot from #3. Are there aspects of the relationship between hhi_percap and mobility that are easier to detect in the faceted plot than in the combined plot?

Yes, we do think the faceted plot makes it easier for us to detect correlations between hhi_percap and mobility in each region because the data are separated and recognized more clearly.

Exercise 5.3

Which regions appear to have a relatively stronger relationship between hhi_percap and mobility?

The Northeast and West regions appear to have a relatively stronger relationship between hhi_percap and mobility.

Exercise 6

Exercise 6.1

Add information on the census division to your chart using the color aesthetic.

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  color = division,
  size  = pop_2000)
) +
  geom_point() +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       color = "Geographic division",
       size  = "Size of the population"
       ) +
  facet_wrap(~region)
Plot 6.1

Plot 6.1

Exercise 6.2

What does this reveal about divisional differences in the West?

The plot reveals that in the West region, the Pacific division has a large amount of population with higher mobility and household income per capita compared to the Mountain division.

Exercise 7

Exercise 7.1

Create a plot of the relationship between hhi_percap and mobility with two layers: (1) A scatterplot colored by region, and (2) a smooth fit chart with no standard error also colored by region.

ggplot(data = chetty_top100, mapping = aes(
  x     = hhi_percap,
  y     = prob_q1q5,
  color = region)
) +
  geom_point() +
  geom_smooth(
  se     = FALSE
) +
  labs(x = "Household income per capita",
       y = "Economic mobility",
       color = "Geographic region"
       ) 
Plot 7.1

Plot 7.1

Exercise 7.2

What patterns does this illustrate in the data? Be specific!

This plot illustrates that in the West and Northeast region, there are positive correlations between the household income per capita and mobility, also they both have high household income (generally >30000). For the South region, there is a negative correlation between the household income per capita and mobility for (household income per capita < 30000), after that there is a weak positive correlation. For the Midwest region, there is no obvious correlation between the household income per capita and mobility.

Part 2: Working with Bar Charts

Exercise 8

Exercise 8.1

Create a bar chart that displays the count (not proportion) of CZs by census region and fill each bar using information on census division.

ggplot(chetty_top100
) +
  geom_bar(mapping = aes(
    x    = region,
    fill = division
    )) +
    labs(x = "Geographic region",
         fill = "Geographic division") 
Plot 8.1

Plot 8.1

Exercise 8.2

What do you learn from this chart?

From this chart, we learned that among the four regions, the South region has the most CZs and the rest of 3 regions have relatively equal amounts of CZs. In the South region, the South Atlantic division has specifically high CZs.

Exercise 8.3

Recreate the same bar chart, but this time use position “dodge”

ggplot(chetty_top100
) +
  geom_bar(mapping = aes(
    x    = region,
    fill = division
    ),
    position = "dodge") +
    labs(x = "Geographic region",
         fill = "Geographic division") 
Plot 8.3.1 - dodge position

Plot 8.3.1 - dodge position

Exercise 8.4

Recreate the same bar chart, but this time use position “fill”

ggplot(chetty_top100
) +
  geom_bar(mapping = aes(
    x    = region,
    fill = division
    ),
    position = "fill") +
    labs(x = "Geographic region",
         fill = "Geographic division") 
Plot 8.3.2 - fill position

Plot 8.3.2 - fill position

Exercise 8.5

What are the relative advantages and disadvantages of each of the three bar charts when trying to visualize this information?

  1. The first method without “fill” or “dodge”: The advantage is that we can compare the CZ count of each four regions, and tell which division has the most CZ. The disadvantage of it is that we cannot compare the CZ count in each division.
  2. The “dodge” method: The advantage is that we can clearly compare the CZ count of each division across regions; the disadvantage will be that we cannot compare the total amount of CZ count across regions.
  3. The “fill” method: The advantage of it is that we can identify the proportion of each division’s CZ count within each region; but the disadvantage is that we cannot tell the specific amount and compare across different regions in terms of CZ count.

Part 3: Answering your own research question

In this final part, you will use the Chetty dataset to devise your own research question and analyze it through the use of scatterplots. You may reuse the prob_q1q5 variable for economic mobility, but your second axis variable should be a CZ characteristic you have not yet explored (something other than hhi_percap, social_capital and gini).

Guidelines:

Exercise 9

Exercise 9.1

What Chetty variable(s) did you choose to research? Explain the rationale for your choices and what question(s) you are trying to answer.

We chose “high school dropout rates” (hs_dropout) and economic mobility (prob_q1q5) as our variables to research. We want to explore if there is a negative correlation between economic mobility and high school dropout rate, to see people’s education level and their economic mobility status.

Exercise 9.2

Make a scatterplot of your variables with appropriate aesthetics and labeling.

ggplot(data = chetty_top100, mapping = aes(
  x     = hs_dropout,
  y     = prob_q1q5,
  )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm",
    se = FALSE
  ) +
  labs(x = "High school dropout rate",
       y = "Economic mobility",
       color = "Geographic region",
       title = "Relationship between high school dropout rate
       and economic mobility")
Plot 9.2 - High school dropout rate vs. economic mobility

Plot 9.2 - High school dropout rate vs. economic mobility

Exercise 9.3

Describe the chart: what is the range of the x-axis? What is the range of the y-axis? Do you think there is a relationship between these two variables? Why or why not?

The range of the x-axis is -0.02~0.06, while the range of the y-axis is 0.025~0.125. There is a weak negative relationship between these two variables, because the slope of the trend line is negative, as economic mobility decreases, high school dropout rate increases, indicating that the education level decreases. This correlation makes sense to us because education costs a large amount of money.

Honor Code Pledge

On my honor I have neither given nor received unauthorized information regarding this work, I have followed and will continue to observe all regulations regarding it, and I am unaware of any violation of the Honor Code by others.

Charlotte Li Tania Umurerwa Kalisa

Submission checklist

Before submitting your assignment in Gradescope, review this checklist carefully to ensure you have done everything appropriately.

Point penalties will be assessed if your assignment does not meet these guidelines.

Grading

Data viz questions

  • 1 point = Code and output are both clearly and completely visible
  • 1 point = Correct geom is used
  • 1 point = Correct aesthetics are applied
  • 1 point = Descriptive labels are applied

Short answer questions

  • 1 point = Answers the question thoughtfully
  • 1 point = Provides clear expository writing