# load data
You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.
According to the World Health Organization “Ambient air pollution accounts for an estimated 4.2 million deaths per year due to stroke, heart disease, lung cancer and chronic respiratory diseases.” I wanted to determine if I could find a relationship between pollution levels and deaths. Hence my research question would be :
Does the pollution level have an impact on the number of deaths caused by circulatory and respiratory diseases.
Where my null hypothesis is:
Ho- Change in pollution levels have no impact on deaths caused by circulatory and respiratory diseases
The major outdoor pollution sources include vehicles, power generation, building heating systems, agriculture/waste incineration and industry. Policies and investments supporting cleaner transport, energy-efficient housing, power generation, industry and better municipal waste management can effectively reduce key sources of ambient air pollution.
What are the cases, and how many are there?
In order to determine the number of cases I will import two sets of data, namely:
library(tidyverse)
## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0.9000 ──
## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0.9000 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
# Importing the pollution dataset
#2013
pol13 <- read_csv('la2013.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2014
pol14 <- read_csv('la2014.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2015
pol15 <- read_csv('la2015.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2016
pol16 <- read_csv('la2016.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
# 2017
pol17 <- read_csv('la2017.csv',cols(
Date = col_date("%m/%d/%y"),
Source = col_character(),
`Site ID` = col_double(),
POC = col_double(),
`Daily Mean PM2.5 Concentration` = col_double(),
UNITS = col_character(),
DAILY_AQI_VALUE = col_double(),
`Site Name` = col_character()
), col_names = TRUE)
#Binding the datasets together to create a single dataset
pol <- rbind(pol13,pol14,pol15,pol16,pol17)
#Remaining and limiting columns for convenience
names(pol)[5]= 'pm25'
names(pol)[7]= 'aqi'
vars=c("Date","pm25","aqi")
pol=pol[vars]
#Review resulting pollution dataframe
head(pol)
## # A tibble: 6 x 3
## Date pm25 aqi
## <date> <dbl> <dbl>
## 1 2013-01-01 5.9 25
## 2 2013-01-04 3.4 14
## 3 2013-01-07 5.2 22
## 4 2013-01-10 5.9 25
## 5 2013-01-13 4.2 18
## 6 2013-01-16 4.9 20
#Importing the dataset containing deaths
cod <- read_csv('ucodtotals.csv')
## Parsed with column specification:
## cols(
## Year = col_double(),
## Month = col_double(),
## Deaths = col_double(),
## `% of Total Deaths` = col_character()
## )
head(cod)
## # A tibble: 6 x 4
## Year Month Deaths `% of Total Deaths`
## <dbl> <dbl> <dbl> <chr>
## 1 2013 1 3034 11.10%
## 2 2013 2 2696 9.87%
## 3 2013 3 2567 9.39%
## 4 2013 4 2179 7.97%
## 5 2013 5 2069 7.57%
## 6 2013 6 2077 7.60%
Describe the method of data collection. For my project I will source pollution data from the EPA data repository here and compare these against the monthly number of deaths obtained from the CDC Underlying Causes of Death database here
Note that to limit the scope of the project I have focused on the Los Angeles county since this county has consistently been rated as one of the most polluted in the country.
What type of study is this (observational/experiment)?
This will be an observational study
If you collected the data, state self-collected. If not, provide a citation/link. Pollution data - https://www.epa.gov/outdoor-air-quality-data/download-daily-data Cause of Deaths data- https://wonder.cdc.gov/controller/datarequest/D76 Note that the dataset has already been filtered for deaths caused by circulatory and respiratory diseases.
The temporal scope of the project is 5 years (2013 through 2017)
What is the response variable? Is it quantitative or qualitative?
The response variable will be the number of deaths or the percentage of deaths as a function of the total deaths caused by circulatory and respiratory diseases for the year.
You should have two independent variables, one quantitative and one qualitative.
Independent variables will be the concentration of pollutants in the air either (AQI or pm25) and since there is bound to be some seasonality within the data, I will be using the month as a categorical(ordinal) variable.
Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.
# Summary stats for deaths by year
psych::describeBy(cod$Deaths,cod$Year)
##
## Descriptive statistics by group
## group: 2013
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 12 2277.25 335.62 2141.5 2238.4 175.69 1909 3034 1125 0.95 -0.39
## se
## X1 96.89
## ------------------------------------------------------------
## group: 2014
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 12 2159.83 202.62 2096.5 2135.2 169.02 1951 2615 664 0.94 -0.35
## se
## X1 58.49
## ------------------------------------------------------------
## group: 2015
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 12 2298.17 265.19 2251.5 2263.6 217.94 1991 2951 960 1.06 0.39
## se
## X1 76.55
## ------------------------------------------------------------
## group: 2016
## vars n mean sd median trimmed mad min max range skew kurtosis
## X1 1 12 2323.58 292.64 2187 2308.9 188.29 2008 2786 778 0.49 -1.66
## se
## X1 84.48
## ------------------------------------------------------------
## group: 2017
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 2334 296.24 2195.5 2309 192 2040 2878 838 0.65 -1.29 85.52
# Converting the pollution dataset from daily to monthly
library(tidyquant)
## Loading required package: lubridate
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
## Loading required package: PerformanceAnalytics
## Loading required package: xts
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Registered S3 method overwritten by 'xts':
## method from
## as.zoo.xts zoo
##
## Attaching package: 'xts'
## The following objects are masked from 'package:dplyr':
##
## first, last
##
## Attaching package: 'PerformanceAnalytics'
## The following object is masked from 'package:graphics':
##
## legend
## Loading required package: quantmod
## Loading required package: TTR
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
## Version 0.4-0 included new data defaults. See ?getSymbols.
## ══ Need to Learn tidyquant? ═════════════════════════════════════════════════════
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>
polmonth <- pol %>%
tq_transmute(
mutate_fun = apply.monthly,
FUN = mean,
na.rm = TRUE
)
#Creating a month and date columns for grouping
polmonth$month <- month(polmonth$Date)
polmonth$year <- year(polmonth$Date)
#Summary stats for aqi
psych::describeBy(polmonth$aqi,polmonth$year)
##
## Descriptive statistics by group
## group: 2013
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 45.89 3.51 47.15 46.03 3.88 40.1 50.33 10.23 -0.3 -1.66 1.01
## ------------------------------------------------------------
## group: 2014
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 49.35 7.1 47.99 48.86 6.54 40.48 63.08 22.6 0.61 -0.82 2.05
## ------------------------------------------------------------
## group: 2015
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 45.61 6.95 43.69 45.44 5.81 35.01 57.85 22.84 0.41 -1.11 2.01
## ------------------------------------------------------------
## group: 2016
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 43.67 4.65 43.85 43.49 6.07 37.78 51.38 13.6 0.12 -1.52 1.34
## ------------------------------------------------------------
## group: 2017
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 45.53 7.37 43.8 45.05 6.67 33.81 61.98 28.17 0.57 -0.23 2.13
#Summary stats for pm25
psych::describeBy(polmonth$pm25,polmonth$year)
##
## Descriptive statistics by group
## group: 2013
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 12.19 1.13 12.24 12.14 1.64 10.77 14.07 3.3 0.19 -1.56 0.33
## ------------------------------------------------------------
## group: 2014
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 13.68 2.72 13.02 13.39 2.22 10.66 19.58 8.92 0.86 -0.39 0.79
## ------------------------------------------------------------
## group: 2015
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 12.26 2.73 11.46 12.08 1.92 8.66 17.66 9 0.79 -0.65 0.79
## ------------------------------------------------------------
## group: 2016
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 11.45 1.53 11.45 11.4 1.93 9.43 14.01 4.58 0.11 -1.38 0.44
## ------------------------------------------------------------
## group: 2017
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 12 12.11 2.72 11.48 11.79 2.31 8.38 19.04 10.66 1.1 0.89 0.79
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
#Create a column that combines month and year
cod$dt=paste(cod$Year,cod$Month, sep='-')
#Visualizing trends in deaths over the years
xform=list(categoryorder = "array",categoryarray = cod$dt)
plot_ly(cod,x = ~ factor(dt), y = ~Deaths , type='scattergl', mode ='lines') %>%layout(title = "Circulatory and Respiratory Deaths",xaxis = xform)
# Yearly ranges for aqi and pm25
par(mfrow=c(1,2))
boxplot(pm25~year,
data=polmonth,
main="Monthly pm2.5",
xlab="Year",
col="light blue")
boxplot(aqi~year(Date),
data=polmonth,
main="Monthly AQI",
xlab="Year",
col="light green")
# AQI and pm25 ranges grouped by month
par(mfrow=c(1,2))
boxplot(pm25~month,
data=polmonth,
main="Monthly pm2.5",
xlab="Month",
col="light blue")
boxplot(aqi~month,
data=polmonth,
main="Monthly AQI",
xlab="Year",
col="light green")