The data

Dataset 1: EPA Outdoor AirQuality Data for PM 2.5

The Environmental Protection Agency (EPA) is an independent agency within the US government responsible for protecting human and environmental health. We’ll be working with the EPA’s AirQuality data on PM 2.5- fine inhalable particles (particulate matter) 2.5 micrometers in diameter. PMs are often a byproduct of pollution, and are thought to have negative health outcomes. Click here

According to the EPA website, the AirQuality data “assists a wide range of people, from the concerned citizen who wants to know how many unhealthy air quality days there were in his county last year to air quality analysts in the regulatory, academic, and health research communities who need raw data.”

This is a big dataset to work with, but it contains a ton of interesting information. Focus on the variable Arithmetic mean, which is the average (arithmetic mean) value for the day.

Protip: many of the variables here have spaces in the names, which are a pain to work with. It’s probably a good idea to rename them to not have spaces.

Dataset 2: County Health Rankings & Roadmaps (CHRR)

Accessed via Social Explorer. This is the same data we used in Lab 6! Make sure you have a copy living in your PS_6 folder.

1. Generate a county level dataset for PM 2.5 for 2016. This dataset will have the average PM 2.5 levels for each county. In order to do this, you should note the following:

• Counties may have multiple air monitors. You will want to average these for each county for each month. So you will have a monthly average for each county.

- To do this, you will need to do some data cleaning. First, you'll need to generate a new `month` variable from `Date Local`. Below is one way to do this, using the `lubridate` functions `year`, `month`, and `day` (`lubridate` is one of the packages inside `tidyverse`---it isn't loaded by default, but you should be able to just load it. If you run into issues, try installing it separately with `install.packages("lubridate")`).
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:dplyr':
## 
##     intersect, setdiff, union
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
EPA <- EPA %>% rename(date = "Date Local") # rename to get rid of spaces, nobody likes spaces
EPA <- EPA %>% mutate(year = year(date),
                      month = month(date),
                      day = day(date))
- Next, you'll need to be able to merge on FIPS codes. But this dataset doesn't have FIPS codes! What to do? [Some quick internet sleuthing reveals that FIPS codes are a combination of two-digit state codes and three-digit county codes](https://transition.fcc.gov/oet/info/maps/census/fips/fips.txt). Looking in the `EPA` dataset, we have codes like that. Can we join those together in a new column called `FIPS`? Checking both `EPA` and `CHRR` datasets for a couple counties shows this might do the trick---Baldwin county in Alabama, for example, has FIPS code 01003 in the `CHRR` dataset, and in the `EPA` dataset it has state code 01 and county code 003. There are *many* ways to do this in R, so we'll stay in the tidyverse and use `unite`.
EPA <- EPA %>% rename(state_code = "State Code",
                      county_code = "County Code") # rename to get rid of spaces, nobody likes spaces
EPA <- EPA %>% unite(FIPS, c(state_code, county_code), sep="") # run ?unite to learn the syntax!

• Each county will then have 12 months of data. You will want to average this for each county in order to generate an annual average for each county.

- You can now do this with a `group_by(FIPS,month) %>% summarise()`
3. Report the number of matches, and the observations that don’t match.
4. Create two groups of counties: the first group will have a high level of PM 2.5 and the 2nd group a low level of PM 2.5. Report the mean PM 2.5 with CI for both groups.
5. There are three mortality measures in the CHRR data:

• infant_mortality: measures the number of deaths among children less than one year of age per 1,000 live births

• child_mortality: the number of deaths among children under age 18 per 100,000 population.

• ageadj_mortality: measures the number of deaths among residents under the age of 75 per 100,000 population

Is there evidence that PM 2.5 is leading to changes in mortality rates? Present a visual comparison of mortality rates between the two groups of counties.

6. To formalize your analysis, do a two sample proportion test (report difference in proportions, test-statistic, and p-value). Interpret these results.
7. What are the potential confounders/omitted variables in the analysis?