You are an analyst at the Environmental Protection Agency, a government agency dedicated to protecting humans and the environment. As part of EPA’s ongoing efforts to ensure environmental justice, you have been tasked with assessing the extent to which people in the US are differentially exposed to fine particulate matter (PM2.5, a very nasty pollutant).

To this end you have assembled a county-level dataset containing variables describing exposure to PM2.5 concentrations (from county air pollution monitors) and income (from Census records). A county is said to have bad air quality if the AQI (“Air Quality Index”, a composite measure of different kinds of air pollution) exceeds a threshold. There are separate thresholds for each of the pollutants contained in the AQI.

How big is the difference in PM2.5 exposure by income?

From the course website, download and open the Week 3 dataset, week3_data_extract.csv.1 The two variables we’ll be focusing on understanding today are medianHouseholdIncome and pm2.5_pct_average.

You may find the Week 3 notes Rmd file helpful in writing code for some of the questions below.

1. Understanding the income distribution

  1. Make histogram and overlay a density function for medianHouseholdIncome.

  2. Calculate the empirical and theoretical probabilities of having income in the upper 10th percentile vs lower 10th percentile using pnorm.

  3. How closely do they match?

2. Understanding PM2.5 concentrations

  1. Make histogram and overlay a Normal density function for pm2.5_pct_average.

  2. Calculate the empirical and theoretical probabilities of having income in the upper 10th percentile vs lower 10th percentile using pnorm.

  3. How closely do they match?

3. The relationship between income and PM2.5 concentrations

  1. Make a scatterplot with medianHouseholdIncome on the x axis and pm2.5_pct_average on the y axis.

  2. What is the relationship between the two?

  3. What is one explanation for this relationship?

  4. Calculate the empirical probability that a household in the upper 10th income percentile nationally is in a county with above-national-median PM2.5 levels. Compare this to the empirical probability that a household in the lower 10th income percentile nationally is in a county with above-national-median PM2.5 levels.


  1. This dataset comes from an IS conducted by Chujun (Christina) Chen in J-term 2022. She assembled the dataset this extract comes from to study the relationship between socioeconomic status and race, exposure to environmental hazards, and COVID-19 mortality.↩︎