Part 1 - Introduction

According to the World Health Organization “Ambient air pollution accounts for an estimated 4.2 million deaths per year due to stroke, heart disease, lung cancer and chronic respiratory diseases.” I wanted to determine if I could find a relationship between pollution levels and deaths. Hence my research question would be :

Does the pollution level have an impact on the number of deaths caused by circulatory and respiratory diseases.

Where my null hypothesis is:

Ho - Change in pollution levels have no impact on the number of deaths caused by circulatory and respiratory diseases

Ha - Change in pollution levels have an impacts on the number of deaths caused by circulatory and respiratory diseases

This study will be conducted over a period of 5 years (2013-2017) and will focus solely on the LA county area which has consistently been classified as one of the most polluted cities in the US

In order to be as effective and efficient as possible I followed the OSEMN data science workflow. The following sections are arranged to coincide with this workflow

Part 2 - Data

2a. Gathering the Data

Pollution data

Pollution data was sourced from the EPA data repository which contains historical daily pollution data collected from various observation posts across the US. The data repository can be accessed here. Note that we are limited to extracting the data one year at a time. As such we extracted pollution data in five separate files.

For the purposes of this project we will be focusing on two indicators of pollution namely:

The Air Quality Index (AQI) reason being that the AQI is an aggregated index calculated based on the air concentration of various pollutants
pm2.5 which is the concentration of particulate matter 2.5 micrometers or less (in diameter) in the air. According to the EPA “When inhaled, particle pollution can travel deep into the lungs and cause or aggravate heart and lung diseases. Some”primary sources" of this pollutant are – Incomplete combustion – Automobile emissions – Dust – Cooking Hence pm2.5 is widely considered as a effective measure of overall pollution.

Time period: 2017-2013

Frequency - Daily

Causes of death Data

Data on the number of deaths caused by circulatory or respiratory was obtained from the Centers for Disease Control and Prevention’s “Underlying Causes of Death” data repository.

The following criteria were used to filter the data:

Grouping- By year, by month
Geographical Area - LA County
Underlying Cause of Death - Circulatory or Respiratory illness
Time period - 2013 through 2017

The resulting dataset contained the number of monthly deaths caused along with the percentage these monthly deaths denote against the total number of deaths caused by circulatory or respiratory illness that year.

Time period: 2017-2013

Frequency - Monthly

2b. Preparing the Data

Since the pollution data was extracted in 5 different files the first step of our data preparation was to combine these files into a single dataset.

Note that the column names and formatting was not very conducive for efficient coding, as such I was able to find a great function which helps standardize data upon import. This function can be found here. The credit for this function belongs to William Doane.

library(tidyverse)

# Importing the pollution dataset 
#2013
pol13 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2013.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)


# 2014
pol14 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2014.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)



# 2015
pol15 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2015.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)



# 2016
pol16 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2016.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)


# 2017
pol17 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2017.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)


# Clean column names function 
clean_names <- function(.data, unique = FALSE) {
  n <- if (is.data.frame(.data)) colnames(.data) else .data
  
  n <- gsub("%+", "_pct_", n)
  n <- gsub("\\$+", "_dollars_", n)
  n <- gsub("\\++", "_plus_", n)
  n <- gsub("-+", "_minus_", n)
  n <- gsub("\\*+", "_star_", n)
  n <- gsub("#+", "_cnt_", n)
  n <- gsub("&+", "_and_", n)
  n <- gsub("@+", "_at_", n)
  
  n <- gsub("[^a-zA-Z0-9_]+", "_", n)
  n <- gsub("([A-Z][a-z])", "_\\1", n)
  n <- tolower(trimws(n))
  
  n <- gsub("(^_+|_+$)", "", n)
  
  n <- gsub("_+", "_", n)
  
  if (unique) n <- make.unique(n, sep = "_")
  
  if (is.data.frame(.data)) {
    colnames(.data) <- n
    .data
  } else {
    n
  }
}
#Binding the datasets together to create a single dataset
pol_temp1 <- rbind(pol13,pol14,pol15,pol16,pol17) %>% clean_names()

#Renaming and limiting columns for convenience
names(pol_temp1)[5]= 'pm25'
names(pol_temp1)[7]= 'aqi'
vars=c("date","pm25","aqi")
pol=pol_temp1[vars]

#Review resulting pollution dataframe
head(pol)

## # A tibble: 6 x 3
##   date        pm25   aqi
##   <date>     <dbl> <dbl>
## 1 2013-01-01   5.9    25
## 2 2013-01-04   3.4    14
## 3 2013-01-07   5.2    22
## 4 2013-01-10   5.9    25
## 5 2013-01-13   4.2    18
## 6 2013-01-16   4.9    20

range(pol$date)

## [1] "2013-01-01" "2017-12-31"

To match the frequency of the ‘Causes of death’ dataset this pollution dataset will have to be re sampled to a monthly cadence. We will do so by averaging the pollutant levels.

# Converting the pollution dataset from daily to monthly
library(tidyquant)
polmonth <- pol %>%
  tq_transmute(
    mutate_fun = apply.monthly, 
    FUN        = mean,
    na.rm      = TRUE
  )

head(polmonth)

## # A tibble: 6 x 3
##   date        pm25   aqi
##   <date>     <dbl> <dbl>
## 1 2013-01-31  10.8  40.1
## 2 2013-02-28  11.2  41.6
## 3 2013-03-31  14.1  50.3
## 4 2013-04-30  11.0  42.2
## 5 2013-05-31  12.5  48.2
## 6 2013-06-30  12.0  46.6

We will now create a column called ‘dt’ which will then be used to join the two datasets together

polmonth$year <- year(polmonth$date)
polmonth$month <- month(polmonth$date)
polmonth$dt <- paste(polmonth$year,polmonth$month, sep='-')

head(polmonth)

## # A tibble: 6 x 6
##   date        pm25   aqi  year month dt    
##   <date>     <dbl> <dbl> <dbl> <dbl> <chr> 
## 1 2013-01-31  10.8  40.1  2013     1 2013-1
## 2 2013-02-28  11.2  41.6  2013     2 2013-2
## 3 2013-03-31  14.1  50.3  2013     3 2013-3
## 4 2013-04-30  11.0  42.2  2013     4 2013-4
## 5 2013-05-31  12.5  48.2  2013     5 2013-5
## 6 2013-06-30  12.0  46.6  2013     6 2013-6

Our pollution dataset is now ready!!!

#Importing the causes of death dataset
cod <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/ucodtotals.csv') %>% clean_names()

#Creating the dt column which combines month and year
cod$dt=paste(cod$year,cod$month, sep='-')

head(cod)

## # A tibble: 6 x 5
##    year month deaths pct_of_total_deaths dt    
##   <dbl> <dbl>  <dbl>               <dbl> <chr> 
## 1  2013     1   3034              0.111  2013-1
## 2  2013     2   2696              0.0987 2013-2
## 3  2013     3   2567              0.0939 2013-3
## 4  2013     4   2179              0.0797 2013-4
## 5  2013     5   2069              0.0757 2013-5
## 6  2013     6   2077              0.0760 2013-6

Our Causes of death dataset is now ready

Joining the two datasets together for the final dataset

final<- left_join(cod, polmonth, by='dt')

head(final)

## # A tibble: 6 x 10
##   year.x month.x deaths pct_of_total_de… dt    date        pm25   aqi year.y
##    <dbl>   <dbl>  <dbl>            <dbl> <chr> <date>     <dbl> <dbl>  <dbl>
## 1   2013       1   3034           0.111  2013… 2013-01-31  10.8  40.1   2013
## 2   2013       2   2696           0.0987 2013… 2013-02-28  11.2  41.6   2013
## 3   2013       3   2567           0.0939 2013… 2013-03-31  14.1  50.3   2013
## 4   2013       4   2179           0.0797 2013… 2013-04-30  11.0  42.2   2013
## 5   2013       5   2069           0.0757 2013… 2013-05-31  12.5  48.2   2013
## 6   2013       6   2077           0.0760 2013… 2013-06-30  12.0  46.6   2013
## # … with 1 more variable: month.y <dbl>

Part 3 - Exploratory data analysis

Visualizing pollution trends

#pm2.5

ggplot(final, aes(x=date,y=pm25, group=1))+geom_line(aes(color=year.x))+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=10))

#aqi

ggplot(final, aes(x=date,y=aqi, group=1))+geom_line(aes(color=year.x))+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=10))

Comparison of year over year to confirm a slight downward trend in pollution

#aqi

ggplot(final, aes(x=month.x,y=aqi, group=year.x))+geom_line(aes(color=factor(year.x)))

Let’s see if we can discern if there is a correlation between pollution and deaths:

# Scatter between pm2.5 and deaths
ggplot(final, aes(x=pm25,y=deaths,group=year.x))+geom_point(aes(color=factor(year.x)))

# Scatter between pm2.5 and deaths
ggplot(final, aes(x=aqi,y=pct_of_total_deaths,group=year.x))+geom_point(aes(color=factor(year.x)))

We can clearly see that there is some positive correlation between pollutants and no of deaths. Let us confirm by buildind a correlation matrix

vars=c("pm25","pct_of_total_deaths","month.x", "aqi")

stats <- final[vars]

library(corrplot)
corrplot(cor(stats), method = "color",addCoef.col = "dark grey",tl.col = "black" )

Part 4 - Inference

Based on the matrix above, we will build three models

# Model 1 With pct_of_total_deaths and pm25
model1 = lm(pct_of_total_deaths~pm25,data=final)
summary(model1)

## 
## Call:
## lm(formula = pct_of_total_deaths ~ pm25, data = final)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.012773 -0.007301 -0.003364  0.006825  0.028474 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0771929  0.0069999  11.028  7.3e-16 ***
## pm25        0.0004977  0.0005577   0.892    0.376    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.009924 on 58 degrees of freedom
## Multiple R-squared:  0.01354,    Adjusted R-squared:  -0.003466 
## F-statistic: 0.7962 on 1 and 58 DF,  p-value: 0.3759

The model with just pollutants as a predictor has

very high p values
low F statistic
low R2 s

Thus this model cannot be considered statistically and practically significant

# Model 2 With pct_of_total_deaths and month
model2 = lm(pct_of_total_deaths~month.x,data=final)
summary(model2)

## 
## Call:
## lm(formula = pct_of_total_deaths ~ month.x, data = final)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.010324 -0.006934 -0.002351  0.005451  0.023437 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.0919586  0.0024330  37.796  < 2e-16 ***
## month.x     -0.0013270  0.0003306  -4.014 0.000173 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.00884 on 58 degrees of freedom
## Multiple R-squared:  0.2174, Adjusted R-squared:  0.2039 
## F-statistic: 16.11 on 1 and 58 DF,  p-value: 0.0001735

The model with just months as a predictor has 1. low p values 2. moderately low R2 3. high F statistic

Thus this model can be considered statistically significant but not very practically useful

# Model 3 With pm25, month and aqi
model3 = lm(pct_of_total_deaths~pm25+aqi+month.x,data=final)
summary(model3)

## 
## Call:
## lm(formula = pct_of_total_deaths ~ pm25 + aqi + month.x, data = final)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -0.0160113 -0.0029024  0.0000407  0.0033174  0.0140835 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.1628240  0.0102075  15.951  < 2e-16 ***
## pm25         0.0161400  0.0019014   8.489 1.22e-11 ***
## aqi         -0.0059599  0.0007125  -8.364 1.95e-11 ***
## month.x     -0.0006820  0.0002354  -2.897  0.00536 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.005949 on 56 degrees of freedom
## Multiple R-squared:  0.6578, Adjusted R-squared:  0.6394 
## F-statistic: 35.88 on 3 and 56 DF,  p-value: 4.524e-13

The model with months, aqi and pm25 as predictors 1. low p values 2. high R2 3. very high F statistic

Thus this model can be considered statistically significant and practically useful. However we do note that aqi and pm25 are highly multicollinear, although multicollinear variables should not effect the fit of the model (only its interpretibility)

In order to determine if the conditions for linear regression hold let us review the residuals

# Checking for Homoscedasiticity
plot(model3$residuals)

# Checking for residual distribution
hist(model3$residuals)

# Checking for residual distribution
qqnorm(model3$residuals)
qqline(model3$residuals)

Part 5 - Conclusion

Reviewing the above diagnostic plots above, we can conclude that conditions for linear regression are being met and as such we can reject our null hypothesis that pollutants have no effect on the number of deaths caused by respiratory and circulatory illnesses.

However, more work needs to be done since the geographical area was limited to LA. Additionally, on an average the air quality in the US is above average. As such, picking a geographical area which is more polluted might result in a more robust model.

References

All relevant references have been linked within the text

Mortality due to Pollution

Devanshu Mehrotra

12/19/2019