ECON 0210 Week 3: Probabilities and distributions

Let’s continue looking at pollution concentrations (PM2.5) and income.

You have a county-level dataset containing variables describing exposure to PM2.5 concentrations (from county air pollution monitors) and income (from Census records). A county is said to have bad air quality if the AQI (“Air Quality Index”, a composite measure of different kinds of air pollution) exceeds a threshold. There are separate thresholds for each of the pollutants contained in the AQI.

From the course website, download and open the Week 3 dataset, week3_data_extract.csv. The dataset is a sample from the American Communities Survey and the EPA’s archive of pollution measurements. We’ll focus on three variables:

medianHouseholdIncome describes the county-level median household annual income (i.e. calculate the annual incomes of all the housholds in a county and select the median value) in 2021.
pm2.5_pct_average describes the proportion of days in 2011-2021 with bad air quality which were due to PM2.5 exceeding the threshold (i.e. the probability a given bad air quality day in 2011-2021 was due to PM2.5).
noHealthInsurance_pct describes the proportion of households which report not having health insurance in 2021.

You may find the Week 3.1 and 3.2 notes Rmd files helpful in writing code for some of the questions below.

1. Create income groups

The median is a useful measure of central tendency: it cuts the sample into two equally-sized pieces. Quartiles are a finer measure: they cut the sample into four equally-sized pieces. Percentiles are finer still: one hundred equally-sized pieces. You could imagine a family of similar measures, each cutting the sample into sets of differently (not necessarily equally) sized pieces.

Quantiles generalize this idea in a simple way. The \(p\) “quantile” cuts the sample into two pieces of \(100p\%\) and \(100(1-p)\%\). A combination of quantiles can represent any segmentation of the sample based on segment size. The median is the \(0.5\) quantile. Quartiles are the set of seq(0.1, 1, by=0.1) quantiles. Percentiles are the seq(0.1, 1, by=0.01) quantiles.

The following chunk creates an indicator variable called above_median which is \(1\) if a county has medianHouseholdIncome above the national median income and \(0\) otherwise.

# The ifelse() function takes three arguments: a condition which can be TRUE/FALSE; an output if TRUE; an output if FALSE. Run ?ifelse to learn more.
data_grpd <- data %>%
  mutate(above_median = ifelse(medianHouseholdIncome > quantile(medianHouseholdIncome, probs=0.5), 1, 0))
data_grpd

## # A tibble: 1,114 × 7
##    County    State           FIPS pm2.5_pct_average noHealthInsuranc… medianHouseholdI… above_median
##    <chr>     <chr>          <dbl>             <dbl>             <dbl>             <dbl>        <dbl>
##  1 Abbeville South Carolina 45001             0                 11.1              38741            0
##  2 Ada       Idaho          16001            22.2                8.27             66293            1
##  3 Adair     Oklahoma       40001            37.9               27.0              34695            0
##  4 Adams     Colorado        8001            13.3               10.1              71202            1
##  5 Adams     Illinois       17001             0.732              4.74             52993            0
##  6 Adams     Mississippi    28001            15.2               15.1              29936            0
##  7 Adams     Ohio           39001            89.9                7.60             39079            0
##  8 Adams     Pennsylvania   42001            35.2                5.58             67253            1
##  9 Adams     Washington     53001           100                 16.5              48294            0
## 10 Aiken     South Carolina 45003             0                 10.2              51399            0
## # … with 1,104 more rows

Exercise: Make a variable called deciles. deciles should be \(1\) for the \(0.1\) quantile, 2 for the \(0.2\) quantile, and so on until \(10\) for the \(1\) quantile. (You can do this using ifelse to progressively update the same variable in a mutate.)

2. Plot PM2.5 concentrations by income group

Let’s get a sense of the distribution of PM2.5 concentrations across different income groups.

Exercise: Make dot-and-whiskers plots of PM2.5 concentrations by income group using above_median and deciles.
- Use the mean PM2.5 level by group as the dots and \(1.96\) times the standard deviation by group as the whiskers.
- (If you do this with summarise, then save the output in a new object (don’t overwrite data_grpd). The next question will be easier if you can reuse data_grpd.)

Discuss as a group: How would you describe the distribution of PM2.5 concentrations? Do you see evidence of environmental inequality?

3. Plot insurance coverage by income group

Let’s repeat these steps to describe the relationship between health insurance coverage and income.

Exercise: Using the same income groupings but now make dot-and-whiskers plots of health insurance coverage by income group using above_median and deciles.
- Use the mean proportion of people without health insurance by income group as the dots and \(1.96\) times the standard deviation by group as the whiskers.

Discuss as a group: How would you describe the distribution of health insurance by income? How does it compare to the distribution of PM2.5 concentrations by income?

What might be driving these patterns? How might they interact? What data would you need to test your ideas?