You are an analyst at the Environmental Protection Agency, a government agency dedicated to protecting humans and the environment. As part of EPA’s ongoing efforts to ensure environmental justice, you have been tasked with assessing the extent to which people in the US are differentially exposed to fine particulate matter (PM2.5, a very nasty pollutant).
To this end you have assembled a county-level dataset containing
variables describing exposure to PM2.5 concentrations (from county air
pollution monitors) and income (from Census records). A county is said
to have bad air quality if the AQI (“Air Quality Index”, a composite
measure of different kinds of air pollution) exceeds a threshold. There
are separate thresholds for each of the pollutants contained in the
AQI.
From the course website, download
and open the Week 3 dataset, week3_data_extract.csv.1 The two
variables we’ll be focusing on understanding today are
medianHouseholdIncome and
pm2.5_pct_average.
medianHouseholdIncome describes the county-level
median household annual income (i.e. calculate the annual incomes of all
the housholds in a county and select the median value) in 2021.
pm2.5_pct_average describes the proportion of days
in 2011-2021 with bad air quality which were due to PM2.5 exceeding the
threshold (i.e. the probability a given bad air quality day in 2011-2021
was due to PM2.5).
You may find the Week 3 notes Rmd file helpful in writing code for
some of the questions below.
Make histogram and overlay a density function for
medianHouseholdIncome.
Calculate the empirical and theoretical probabilities of having
income in the upper 10th percentile vs lower 10th percentile using
pnorm.
How closely do they match?
Make histogram and overlay a Normal density function for
pm2.5_pct_average.
Calculate the empirical and theoretical probabilities of having
income in the upper 10th percentile vs lower 10th percentile using
pnorm.
How closely do they match?
Make a scatterplot with medianHouseholdIncome on the
x axis and pm2.5_pct_average on the y axis.
What is the relationship between the two?
What is one explanation for this relationship?
Calculate the empirical probability that a household in the upper 10th income percentile nationally is in a county with above-national-median PM2.5 levels. Compare this to the empirical probability that a household in the lower 10th income percentile nationally is in a county with above-national-median PM2.5 levels.
This dataset comes from an IS conducted by Chujun (Christina) Chen in J-term 2022. She assembled the dataset this extract comes from to study the relationship between socioeconomic status and race, exposure to environmental hazards, and COVID-19 mortality.↩︎