According to the World Health Organization “Ambient air pollution accounts for an estimated 4.2 million deaths per year due to stroke, heart disease, lung cancer and chronic respiratory diseases.” I wanted to determine if I could find a relationship between pollution levels and deaths. Hence my research question would be :
Does the pollution level have an impact on the number of deaths caused by circulatory and respiratory diseases.
Where my null hypothesis is:
Ho - Change in pollution levels have no impact on the number of deaths caused by circulatory and respiratory diseases
Ha - Change in pollution levels have an impacts on the number of deaths caused by circulatory and respiratory diseases
This study will be conducted over a period of 5 years (2013-2017) and will focus solely on the LA county area which has consistently been classified as one of the most polluted cities in the US
In order to be as effective and efficient as possible I followed the OSEMN data science workflow. The following sections are arranged to coincide with this workflow
Pollution data was sourced from the EPA data repository which contains historical daily pollution data collected from various observation posts across the US. The data repository can be accessed here. Note that we are limited to extracting the data one year at a time. As such we extracted pollution data in five separate files.
For the purposes of this project we will be focusing on two indicators of pollution namely:
The Air Quality Index (AQI) reason being that the AQI is an aggregated index calculated based on the air concentration of various pollutants
pm2.5 which is the concentration of particulate matter 2.5 micrometers or less (in diameter) in the air. According to the EPA “When inhaled, particle pollution can travel deep into the lungs and cause or aggravate heart and lung diseases. Some”primary sources" of this pollutant are – Incomplete combustion – Automobile emissions – Dust – Cooking Hence pm2.5 is widely considered as a effective measure of overall pollution.
Time period: 2017-2013
Frequency - Daily
Data on the number of deaths caused by circulatory or respiratory was obtained from the Centers for Disease Control and Prevention’s “Underlying Causes of Death” data repository.
The following criteria were used to filter the data:
The resulting dataset contained the number of monthly deaths caused along with the percentage these monthly deaths denote against the total number of deaths caused by circulatory or respiratory illness that year.
Time period: 2017-2013
Frequency - Monthly
Since the pollution data was extracted in 5 different files the first step of our data preparation was to combine these files into a single dataset.
Note that the column names and formatting was not very conducive for efficient coding, as such I was able to find a great function which helps standardize data upon import. This function can be found here. The credit for this function belongs to William Doane.
library(tidyverse)
# Importing the pollution dataset
#2013
pol13 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2013.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2014
pol14 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2014.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2015
pol15 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2015.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2016
pol16 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2016.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2017
pol17 <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/la2017.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# Clean column names function
clean_names <- function(.data, unique = FALSE) {
n <- if (is.data.frame(.data)) colnames(.data) else .data
n <- gsub("%+", "_pct_", n)
n <- gsub("\\$+", "_dollars_", n)
n <- gsub("\\++", "_plus_", n)
n <- gsub("-+", "_minus_", n)
n <- gsub("\\*+", "_star_", n)
n <- gsub("#+", "_cnt_", n)
n <- gsub("&+", "_and_", n)
n <- gsub("@+", "_at_", n)
n <- gsub("[^a-zA-Z0-9_]+", "_", n)
n <- gsub("([A-Z][a-z])", "_\\1", n)
n <- tolower(trimws(n))
n <- gsub("(^_+|_+$)", "", n)
n <- gsub("_+", "_", n)
if (unique) n <- make.unique(n, sep = "_")
if (is.data.frame(.data)) {
colnames(.data) <- n
.data
} else {
n
}
}
#Binding the datasets together to create a single dataset
pol_temp1 <- rbind(pol13,pol14,pol15,pol16,pol17) %>% clean_names()
#Renaming and limiting columns for convenience
names(pol_temp1)[5]= 'pm25'
names(pol_temp1)[7]= 'aqi'
vars=c("date","pm25","aqi")
pol=pol_temp1[vars]
#Review resulting pollution dataframe
head(pol)## # A tibble: 6 x 3
## date pm25 aqi
## <date> <dbl> <dbl>
## 1 2013-01-01 5.9 25
## 2 2013-01-04 3.4 14
## 3 2013-01-07 5.2 22
## 4 2013-01-10 5.9 25
## 5 2013-01-13 4.2 18
## 6 2013-01-16 4.9 20
## [1] "2013-01-01" "2017-12-31"
To match the frequency of the ‘Causes of death’ dataset this pollution dataset will have to be re sampled to a monthly cadence. We will do so by averaging the pollutant levels.
# Converting the pollution dataset from daily to monthly
library(tidyquant)
polmonth <- pol %>%
tq_transmute(
mutate_fun = apply.monthly,
FUN = mean,
na.rm = TRUE
)
head(polmonth)## # A tibble: 6 x 3
## date pm25 aqi
## <date> <dbl> <dbl>
## 1 2013-01-31 10.8 40.1
## 2 2013-02-28 11.2 41.6
## 3 2013-03-31 14.1 50.3
## 4 2013-04-30 11.0 42.2
## 5 2013-05-31 12.5 48.2
## 6 2013-06-30 12.0 46.6
We will now create a column called ‘dt’ which will then be used to join the two datasets together
polmonth$year <- year(polmonth$date)
polmonth$month <- month(polmonth$date)
polmonth$dt <- paste(polmonth$year,polmonth$month, sep='-')
head(polmonth)## # A tibble: 6 x 6
## date pm25 aqi year month dt
## <date> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013-01-31 10.8 40.1 2013 1 2013-1
## 2 2013-02-28 11.2 41.6 2013 2 2013-2
## 3 2013-03-31 14.1 50.3 2013 3 2013-3
## 4 2013-04-30 11.0 42.2 2013 4 2013-4
## 5 2013-05-31 12.5 48.2 2013 5 2013-5
## 6 2013-06-30 12.0 46.6 2013 6 2013-6
Our pollution dataset is now ready!!!
#Importing the causes of death dataset
cod <- read_csv('https://raw.githubusercontent.com/DevMeh/Data606finalproj/master/ucodtotals.csv') %>% clean_names()
#Creating the dt column which combines month and year
cod$dt=paste(cod$year,cod$month, sep='-')
head(cod)## # A tibble: 6 x 5
## year month deaths pct_of_total_deaths dt
## <dbl> <dbl> <dbl> <dbl> <chr>
## 1 2013 1 3034 0.111 2013-1
## 2 2013 2 2696 0.0987 2013-2
## 3 2013 3 2567 0.0939 2013-3
## 4 2013 4 2179 0.0797 2013-4
## 5 2013 5 2069 0.0757 2013-5
## 6 2013 6 2077 0.0760 2013-6
Our Causes of death dataset is now ready
Joining the two datasets together for the final dataset
## # A tibble: 6 x 10
## year.x month.x deaths pct_of_total_de… dt date pm25 aqi year.y
## <dbl> <dbl> <dbl> <dbl> <chr> <date> <dbl> <dbl> <dbl>
## 1 2013 1 3034 0.111 2013… 2013-01-31 10.8 40.1 2013
## 2 2013 2 2696 0.0987 2013… 2013-02-28 11.2 41.6 2013
## 3 2013 3 2567 0.0939 2013… 2013-03-31 14.1 50.3 2013
## 4 2013 4 2179 0.0797 2013… 2013-04-30 11.0 42.2 2013
## 5 2013 5 2069 0.0757 2013… 2013-05-31 12.5 48.2 2013
## 6 2013 6 2077 0.0760 2013… 2013-06-30 12.0 46.6 2013
## # … with 1 more variable: month.y <dbl>
Visualizing pollution trends
#pm2.5
ggplot(final, aes(x=date,y=pm25, group=1))+geom_line(aes(color=year.x))+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=10))#aqi
ggplot(final, aes(x=date,y=aqi, group=1))+geom_line(aes(color=year.x))+theme(axis.text.x = element_text(angle = 90, hjust = 1, size=10))Comparison of year over year to confirm a slight downward trend in pollution
Let’s see if we can discern if there is a correlation between pollution and deaths:
# Scatter between pm2.5 and deaths
ggplot(final, aes(x=pm25,y=deaths,group=year.x))+geom_point(aes(color=factor(year.x)))# Scatter between pm2.5 and deaths
ggplot(final, aes(x=aqi,y=pct_of_total_deaths,group=year.x))+geom_point(aes(color=factor(year.x)))We can clearly see that there is some positive correlation between pollutants and no of deaths. Let us confirm by buildind a correlation matrix
vars=c("pm25","pct_of_total_deaths","month.x", "aqi")
stats <- final[vars]
library(corrplot)
corrplot(cor(stats), method = "color",addCoef.col = "dark grey",tl.col = "black" )Based on the matrix above, we will build three models
# Model 1 With pct_of_total_deaths and pm25
model1 = lm(pct_of_total_deaths~pm25,data=final)
summary(model1)##
## Call:
## lm(formula = pct_of_total_deaths ~ pm25, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.012773 -0.007301 -0.003364 0.006825 0.028474
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0771929 0.0069999 11.028 7.3e-16 ***
## pm25 0.0004977 0.0005577 0.892 0.376
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.009924 on 58 degrees of freedom
## Multiple R-squared: 0.01354, Adjusted R-squared: -0.003466
## F-statistic: 0.7962 on 1 and 58 DF, p-value: 0.3759
The model with just pollutants as a predictor has
Thus this model cannot be considered statistically and practically significant
# Model 2 With pct_of_total_deaths and month
model2 = lm(pct_of_total_deaths~month.x,data=final)
summary(model2)##
## Call:
## lm(formula = pct_of_total_deaths ~ month.x, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.010324 -0.006934 -0.002351 0.005451 0.023437
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.0919586 0.0024330 37.796 < 2e-16 ***
## month.x -0.0013270 0.0003306 -4.014 0.000173 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.00884 on 58 degrees of freedom
## Multiple R-squared: 0.2174, Adjusted R-squared: 0.2039
## F-statistic: 16.11 on 1 and 58 DF, p-value: 0.0001735
The model with just months as a predictor has 1. low p values 2. moderately low R2 3. high F statistic
Thus this model can be considered statistically significant but not very practically useful
# Model 3 With pm25, month and aqi
model3 = lm(pct_of_total_deaths~pm25+aqi+month.x,data=final)
summary(model3)##
## Call:
## lm(formula = pct_of_total_deaths ~ pm25 + aqi + month.x, data = final)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.0160113 -0.0029024 0.0000407 0.0033174 0.0140835
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1628240 0.0102075 15.951 < 2e-16 ***
## pm25 0.0161400 0.0019014 8.489 1.22e-11 ***
## aqi -0.0059599 0.0007125 -8.364 1.95e-11 ***
## month.x -0.0006820 0.0002354 -2.897 0.00536 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.005949 on 56 degrees of freedom
## Multiple R-squared: 0.6578, Adjusted R-squared: 0.6394
## F-statistic: 35.88 on 3 and 56 DF, p-value: 4.524e-13
The model with months, aqi and pm25 as predictors 1. low p values 2. high R2 3. very high F statistic
Thus this model can be considered statistically significant and practically useful. However we do note that aqi and pm25 are highly multicollinear, although multicollinear variables should not effect the fit of the model (only its interpretibility)
In order to determine if the conditions for linear regression hold let us review the residuals
Reviewing the above diagnostic plots above, we can conclude that conditions for linear regression are being met and as such we can reject our null hypothesis that pollutants have no effect on the number of deaths caused by respiratory and circulatory illnesses.
However, more work needs to be done since the geographical area was limited to LA. Additionally, on an average the air quality in the US is above average. As such, picking a geographical area which is more polluted might result in a more robust model.
All relevant references have been linked within the text