DATA 606 Data Project Proposal

Data Preparation

# load data

Research question

You should phrase your research question in a way that matches up with the scope of inference your dataset allows for.

According to the World Health Organization “Ambient air pollution accounts for an estimated 4.2 million deaths per year due to stroke, heart disease, lung cancer and chronic respiratory diseases.” I wanted to determine if I could find a relationship between pollution levels and deaths. Hence my research question would be :

Does the pollution level have an impact on the number of deaths caused by circulatory and respiratory diseases.

Where my null hypothesis is:

Ho- Change in pollution levels have no impact on deaths caused by circulatory and respiratory diseases

The major outdoor pollution sources include vehicles, power generation, building heating systems, agriculture/waste incineration and industry. Policies and investments supporting cleaner transport, energy-efficient housing, power generation, industry and better municipal waste management can effectively reduce key sources of ambient air pollution.

Cases

What are the cases, and how many are there?

In order to determine the number of cases I will import two sets of data, namely:

Pollution levels in the LA County area from 2013-2017
Monthly deaths caused by circulatory and respiratory diseases in the LA County

library(tidyverse)

## ── Attaching packages ─────────────────────────────────── tidyverse 1.3.0.9000 ──

## ✔ ggplot2 3.2.1          ✔ purrr   0.3.3     
## ✔ tibble  2.1.3          ✔ dplyr   0.8.3     
## ✔ tidyr   1.0.0.9000     ✔ stringr 1.4.0     
## ✔ readr   1.3.1          ✔ forcats 0.4.0

## ── Conflicts ─────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# Importing the pollution dataset 
#2013
pol13 <- read_csv('la2013.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)


# 2014
pol14 <- read_csv('la2014.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)



# 2015
pol15 <- read_csv('la2015.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)



# 2016
pol16 <- read_csv('la2016.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)


# 2017
pol17 <- read_csv('la2017.csv',cols(
  Date = col_date("%m/%d/%y"),
  Source = col_character(),
  `Site ID` = col_double(),
  POC = col_double(),
  `Daily Mean PM2.5 Concentration` = col_double(),
  UNITS = col_character(),
  DAILY_AQI_VALUE = col_double(),
  `Site Name` = col_character()
), col_names = TRUE)

#Binding the datasets together to create a single dataset
pol <- rbind(pol13,pol14,pol15,pol16,pol17)

#Remaining and limiting columns for convenience
names(pol)[5]= 'pm25'
names(pol)[7]= 'aqi'
vars=c("Date","pm25","aqi")
pol=pol[vars]

#Review resulting pollution dataframe
head(pol)

## # A tibble: 6 x 3
##   Date        pm25   aqi
##   <date>     <dbl> <dbl>
## 1 2013-01-01   5.9    25
## 2 2013-01-04   3.4    14
## 3 2013-01-07   5.2    22
## 4 2013-01-10   5.9    25
## 5 2013-01-13   4.2    18
## 6 2013-01-16   4.9    20

#Importing the dataset containing deaths
cod <- read_csv('ucodtotals.csv')

## Parsed with column specification:
## cols(
##   Year = col_double(),
##   Month = col_double(),
##   Deaths = col_double(),
##   `% of Total Deaths` = col_character()
## )

head(cod)

## # A tibble: 6 x 4
##    Year Month Deaths `% of Total Deaths`
##   <dbl> <dbl>  <dbl> <chr>              
## 1  2013     1   3034 11.10%             
## 2  2013     2   2696 9.87%              
## 3  2013     3   2567 9.39%              
## 4  2013     4   2179 7.97%              
## 5  2013     5   2069 7.57%              
## 6  2013     6   2077 7.60%

Data collection

Describe the method of data collection. For my project I will source pollution data from the EPA data repository here and compare these against the monthly number of deaths obtained from the CDC Underlying Causes of Death database here

Note that to limit the scope of the project I have focused on the Los Angeles county since this county has consistently been rated as one of the most polluted in the country.

Type of study

What type of study is this (observational/experiment)?

This will be an observational study

Data Source

If you collected the data, state self-collected. If not, provide a citation/link. Pollution data - https://www.epa.gov/outdoor-air-quality-data/download-daily-data Cause of Deaths data- https://wonder.cdc.gov/controller/datarequest/D76 Note that the dataset has already been filtered for deaths caused by circulatory and respiratory diseases.

The temporal scope of the project is 5 years (2013 through 2017)

Dependent Variable

What is the response variable? Is it quantitative or qualitative?

The response variable will be the number of deaths or the percentage of deaths as a function of the total deaths caused by circulatory and respiratory diseases for the year.

Independent Variable

You should have two independent variables, one quantitative and one qualitative.

Independent variables will be the concentration of pollutants in the air either (AQI or pm25) and since there is bound to be some seasonality within the data, I will be using the month as a categorical(ordinal) variable.

Relevant summary statistics

Provide summary statistics for each the variables. Also include appropriate visualizations related to your research question (e.g. scatter plot, boxplots, etc). This step requires the use of R, hence a code chunk is provided below. Insert more code chunks as needed.

# Summary stats for deaths by year
psych::describeBy(cod$Deaths,cod$Year)

## 
##  Descriptive statistics by group 
## group: 2013
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 12 2277.25 335.62 2141.5  2238.4 175.69 1909 3034  1125 0.95    -0.39
##       se
## X1 96.89
## ------------------------------------------------------------ 
## group: 2014
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 12 2159.83 202.62 2096.5  2135.2 169.02 1951 2615   664 0.94    -0.35
##       se
## X1 58.49
## ------------------------------------------------------------ 
## group: 2015
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 12 2298.17 265.19 2251.5  2263.6 217.94 1991 2951   960 1.06     0.39
##       se
## X1 76.55
## ------------------------------------------------------------ 
## group: 2016
##    vars  n    mean     sd median trimmed    mad  min  max range skew kurtosis
## X1    1 12 2323.58 292.64   2187  2308.9 188.29 2008 2786   778 0.49    -1.66
##       se
## X1 84.48
## ------------------------------------------------------------ 
## group: 2017
##    vars  n mean     sd median trimmed mad  min  max range skew kurtosis    se
## X1    1 12 2334 296.24 2195.5    2309 192 2040 2878   838 0.65    -1.29 85.52

# Converting the pollution dataset from daily to monthly
library(tidyquant)

## Loading required package: lubridate

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

## Loading required package: PerformanceAnalytics

## Loading required package: xts

## Loading required package: zoo

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

## Registered S3 method overwritten by 'xts':
##   method     from
##   as.zoo.xts zoo

## 
## Attaching package: 'xts'

## The following objects are masked from 'package:dplyr':
## 
##     first, last

## 
## Attaching package: 'PerformanceAnalytics'

## The following object is masked from 'package:graphics':
## 
##     legend

## Loading required package: quantmod

## Loading required package: TTR

## Registered S3 method overwritten by 'quantmod':
##   method            from
##   as.zoo.data.frame zoo

## Version 0.4-0 included new data defaults. See ?getSymbols.

## ══ Need to Learn tidyquant? ═════════════════════════════════════════════════════
## Business Science offers a 1-hour course - Learning Lab #9: Performance Analysis & Portfolio Optimization with tidyquant!
## </> Learn more at: https://university.business-science.io/p/learning-labs-pro </>

polmonth <- pol %>%
  tq_transmute(
    mutate_fun = apply.monthly, 
    FUN        = mean,
    na.rm      = TRUE
  )

#Creating a month and date columns for grouping
polmonth$month <- month(polmonth$Date)
polmonth$year <- year(polmonth$Date)

#Summary stats for aqi
psych::describeBy(polmonth$aqi,polmonth$year)

## 
##  Descriptive statistics by group 
## group: 2013
##    vars  n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 12 45.89 3.51  47.15   46.03 3.88 40.1 50.33 10.23 -0.3    -1.66 1.01
## ------------------------------------------------------------ 
## group: 2014
##    vars  n  mean  sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 49.35 7.1  47.99   48.86 6.54 40.48 63.08  22.6 0.61    -0.82 2.05
## ------------------------------------------------------------ 
## group: 2015
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 45.61 6.95  43.69   45.44 5.81 35.01 57.85 22.84 0.41    -1.11 2.01
## ------------------------------------------------------------ 
## group: 2016
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 43.67 4.65  43.85   43.49 6.07 37.78 51.38  13.6 0.12    -1.52 1.34
## ------------------------------------------------------------ 
## group: 2017
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 45.53 7.37   43.8   45.05 6.67 33.81 61.98 28.17 0.57    -0.23 2.13

#Summary stats for pm25
psych::describeBy(polmonth$pm25,polmonth$year)

## 
##  Descriptive statistics by group 
## group: 2013
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 12.19 1.13  12.24   12.14 1.64 10.77 14.07   3.3 0.19    -1.56 0.33
## ------------------------------------------------------------ 
## group: 2014
##    vars  n  mean   sd median trimmed  mad   min   max range skew kurtosis   se
## X1    1 12 13.68 2.72  13.02   13.39 2.22 10.66 19.58  8.92 0.86    -0.39 0.79
## ------------------------------------------------------------ 
## group: 2015
##    vars  n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 12 12.26 2.73  11.46   12.08 1.92 8.66 17.66     9 0.79    -0.65 0.79
## ------------------------------------------------------------ 
## group: 2016
##    vars  n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 12 11.45 1.53  11.45    11.4 1.93 9.43 14.01  4.58 0.11    -1.38 0.44
## ------------------------------------------------------------ 
## group: 2017
##    vars  n  mean   sd median trimmed  mad  min   max range skew kurtosis   se
## X1    1 12 12.11 2.72  11.48   11.79 2.31 8.38 19.04 10.66  1.1     0.89 0.79

library(plotly)

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

#Create a column that combines month and year
cod$dt=paste(cod$Year,cod$Month, sep='-')

#Visualizing trends in deaths over the years 
xform=list(categoryorder = "array",categoryarray = cod$dt)

plot_ly(cod,x = ~ factor(dt), y = ~Deaths , type='scattergl', mode ='lines') %>%layout(title = "Circulatory and Respiratory Deaths",xaxis = xform)

# Yearly ranges for aqi and pm25
par(mfrow=c(1,2))
boxplot(pm25~year,
        data=polmonth,
        main="Monthly pm2.5",
        xlab="Year",
        col="light blue")

boxplot(aqi~year(Date),
        data=polmonth,
         main="Monthly AQI",
        xlab="Year",
        col="light green")

# AQI and pm25 ranges grouped by month
par(mfrow=c(1,2))
boxplot(pm25~month,
        data=polmonth,
        main="Monthly pm2.5",
        xlab="Month",
        col="light blue")

boxplot(aqi~month,
        data=polmonth,
         main="Monthly AQI",
        xlab="Year",
        col="light green")