Let’s continue looking at pollution concentrations (PM2.5) and income.
You have a county-level dataset containing variables describing exposure to PM2.5 concentrations (from county air pollution monitors) and income (from Census records). A county is said to have bad air quality if the AQI (“Air Quality Index”, a composite measure of different kinds of air pollution) exceeds a threshold. There are separate thresholds for each of the pollutants contained in the AQI.
From the course website, download and open the Week 3 dataset, week3_data_extract.csv. The dataset is a sample from the American Communities Survey and the EPA’s archive of pollution measurements. We’ll focus on three variables:
medianHouseholdIncome describes the county-level median household annual income (i.e. calculate the annual incomes of all the housholds in a county and select the median value) in 2021.
pm2.5_pct_average describes the proportion of days in 2011-2021 with bad air quality which were due to PM2.5 exceeding the threshold (i.e. the probability a given bad air quality day in 2011-2021 was due to PM2.5).
noHealthInsurance_pct describes the proportion of households which report not having health insurance in 2021.
You may find the Week 3.1 and 3.2 notes Rmd files helpful in writing code for some of the questions below.
The median is a useful measure of central tendency: it cuts the sample into two equally-sized pieces. Quartiles are a finer measure: they cut the sample into four equally-sized pieces. Percentiles are finer still: one hundred equally-sized pieces. You could imagine a family of similar measures, each cutting the sample into sets of differently (not necessarily equally) sized pieces.
Quantiles generalize this idea in a simple way. The \(p\) “quantile” cuts the sample into two pieces of \(100p\%\) and \(100(1-p)\%\). A combination of quantiles can represent any segmentation of the sample based on segment size. The median is the \(0.5\) quantile. Quartiles are the set of seq(0.1, 1, by=0.1) quantiles. Percentiles are the seq(0.1, 1, by=0.01) quantiles.
The following chunk creates an indicator variable called above_median which is \(1\) if a county has medianHouseholdIncome above the national median income and \(0\) otherwise.
# The ifelse() function takes three arguments: a condition which can be TRUE/FALSE; an output if TRUE; an output if FALSE. Run ?ifelse to learn more.
data_grpd <- data %>%
mutate(above_median = ifelse(medianHouseholdIncome > quantile(medianHouseholdIncome, probs=0.5), 1, 0))
data_grpd
## # A tibble: 1,114 × 7
## County State FIPS pm2.5_pct_average noHealthInsuranc… medianHouseholdI… above_median
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Abbeville South Carolina 45001 0 11.1 38741 0
## 2 Ada Idaho 16001 22.2 8.27 66293 1
## 3 Adair Oklahoma 40001 37.9 27.0 34695 0
## 4 Adams Colorado 8001 13.3 10.1 71202 1
## 5 Adams Illinois 17001 0.732 4.74 52993 0
## 6 Adams Mississippi 28001 15.2 15.1 29936 0
## 7 Adams Ohio 39001 89.9 7.60 39079 0
## 8 Adams Pennsylvania 42001 35.2 5.58 67253 1
## 9 Adams Washington 53001 100 16.5 48294 0
## 10 Aiken South Carolina 45003 0 10.2 51399 0
## # … with 1,104 more rows
Exercise: Make a variable called deciles. deciles should be \(1\) for the \(0.1\) quantile, 2 for the \(0.2\) quantile, and so on until \(10\) for the \(1\) quantile. (You can do this using ifelse to progressively update the same variable in a mutate.)
Let’s get a sense of the distribution of PM2.5 concentrations across different income groups.
Exercise: Make dot-and-whiskers plots of PM2.5 concentrations by income group using above_median and deciles.
- Use the mean PM2.5 level by group as the dots and \(1.96\) times the standard deviation by group as the whiskers.
- (If you do this with summarise, then save the output in a new object (don’t overwrite data_grpd). The next question will be easier if you can reuse data_grpd.)
Discuss as a group: How would you describe the distribution of PM2.5 concentrations? Do you see evidence of environmental inequality?
Let’s repeat these steps to describe the relationship between health insurance coverage and income.
Exercise: Using the same income groupings but now make dot-and-whiskers plots of health insurance coverage by income group using above_median and deciles.
- Use the mean proportion of people without health insurance by income group as the dots and \(1.96\) times the standard deviation by group as the whiskers.
Discuss as a group: How would you describe the distribution of health insurance by income? How does it compare to the distribution of PM2.5 concentrations by income?
What might be driving these patterns? How might they interact? What data would you need to test your ideas?