Final Project CUNY 607

Analyze air pollution by US States on health and mortality rates:-

I am going to try and analyse a relationship between air pollution and respiratory problems and mortality rates associated with it.

The datasets i am targetting are for US geography.

Pollution Dataset:- https://data.world/data-society/us-air-pollution-data
Respiratory diseases Dataset:- http://ghdx.healthdata.org/record/united-states-chronic-respiratory-disease-mortality-rates-county-1980-2014

Use the neonatal mortality rate data to establish child deaths with pollution for that state. This rate defines how many child younger than 28 days die per 100 births in that state. https://data.world/health/infant-neonatal-mortality-rate https://www.kff.org/other/state-indicator/child-death-rate/?currentTimeframe=0&sortModel=%7B%22colId%22:%22Location%22,%22sort%22:%22asc%22%7D

Use Asthma data to show the mortality rates because of Asthma and highlight which states have most asthma rates and how that relates to the pollution rates in that state. https://www.cdc.gov/asthma/most_recent_data_states.htm

My goal to is establish the impact of air pollution on general respiratory health and mortality rates of population in US.

require(RCurl)

## Loading required package: RCurl

## Loading required package: bitops

library(ggplot2) # Data visualization
library(readr) # CSV file I/O, e.g. the read_csv function
library(data.table)
library(lubridate)

## Warning: package 'lubridate' was built under R version 3.4.4

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:data.table':
## 
##     hour, isoweek, mday, minute, month, quarter, second, wday,
##     week, yday, year

## The following object is masked from 'package:base':
## 
##     date

# Input data files are available in the Local directory.

pollution_data<- fread('/Users/ashishsm1986/git/Cuny-Assignments/Final-project/pollution_us_2000_2016.csv',sep = ',')

## 
Read 5.2% of 1746661 rows
Read 31.5% of 1746661 rows
Read 57.8% of 1746661 rows
Read 83.6% of 1746661 rows
Read 1746661 rows and 29 (of 29) columns from 0.373 GB file in 00:00:06

print(dim(pollution_data))

## [1] 1746661      29

##Fields(n=28)

#State Code : The code allocated by US EPA to each state - {OMIT}
#County code : The code of counties in a specific state allocated by US EPA
#Site Num : The site number in a specific county allocated by US EPA
#Address: Address of the monitoring site
#State : State of monitoring site
#County : County of monitoring site
#City : City of the monitoring site
#Date Local : Date of monitoring 

#The four pollutants (NO2, O3, SO2 and CO) each has 5 specific columns. For instance, for NO2:
#NO2 Units : The units measured for NO2
#NO2 Mean : The arithmetic mean of concentration of NO2 within a given day
#NO2 AQI : The calculated air quality index of NO2 within a given day
#NO2 1st Max Value : The maximum value obtained for NO2 concentration in a given day
#NO2 1st Max Hour : The hour when the maximum NO2 concentration was recorded in a given day

#Notes
# - Prediction Fields Designate Measurement Location & Date

#summary(source_data)
names(pollution_data)<-sapply(names(pollution_data),function(x)gsub(' ','',x))
pollution_data[,YEAR:=year(DateLocal)]
pollution_data[,MONTH:=month(DateLocal)]

Question1: How has Pollution Changed Over time?

#ggplots

g_NO2_month <- ggplot(data = pollution_data,aes(x = as.factor(MONTH), y = NO2Mean))
g_NO2_month <- g_NO2_month + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily NO2 per Month')
g_NO2_month

g_NO2_year <- ggplot(data = pollution_data,aes(x = as.factor(YEAR), y = NO2Mean))
g_NO2_year <- g_NO2_year + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Monthly NO2 per Year')
g_NO2_year

#Year 2000 seems to have the highest level of NO2 Mean and Year 2016 has the lowest.

g_SO2_month <- ggplot(data = pollution_data,aes(x = as.factor(MONTH), y = SO2Mean))
g_SO2_month <- g_SO2_month + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily SO2 per Month')
g_SO2_month

g_SO2_year <- ggplot(data = pollution_data,aes(x = as.factor(YEAR), y = SO2Mean))
g_SO2_year <- g_SO2_year + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Monthly SO2 per Year')
g_SO2_year

# Year 2006 seem to have highest levels of SO2 Mean


g_CO_month <- ggplot(data = pollution_data,aes(x = as.factor(MONTH), y = COMean))
g_CO_month <- g_CO_month + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily CO per Month')
g_CO_month

g_CO_year <- ggplot(data = pollution_data,aes(x = as.factor(YEAR), y = COMean))
g_CO_year <- g_CO_year + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily CO per Year')
g_CO_year

# Year 2000 seem to have highest levels of CO Mean


g_O3_month <- ggplot(data = pollution_data,aes(x = as.factor(MONTH), y = O3Mean))
g_O3_month <- g_O3_month + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily O3 per Month')
g_O3_month

g_O3_year <- ggplot(data = pollution_data,aes(x = as.factor(YEAR), y = O3Mean))
g_O3_year <- g_O3_year + geom_violin(draw_quantiles=c(.25,.5,.75))+ ggtitle('Mean Daily O3 per Year')
g_O3_year

# Year 2000 seem to have highest levels of O3 Mean. Probably mean Ozone is on the decline as we see the Year 2016 has lowest O3 Mean.



# Lets do some Density Plots

g_NO2_dens <- ggplot(data = pollution_data,aes( x=NO2Mean,fill = as.factor(YEAR)))
g_NO2_dens <- g_NO2_dens + geom_density(alpha = 0.5) + ggtitle('Yearly Distributions of Daily NO2 Measurements')
g_NO2_dens

g_SO2_dens <- ggplot(data = pollution_data[SO2Mean<30,],aes( x=SO2Mean,fill = as.factor(YEAR)))
g_SO2_dens <- g_SO2_dens + geom_density(alpha = 0.5) + ggtitle('Yearly Distributions of Daily SO2 Measurements')
g_SO2_dens

g_CO_dens <- ggplot(data = pollution_data,aes( x=COMean,fill = as.factor(YEAR)))
g_CO_dens <- g_CO_dens + geom_density(alpha = 0.5) + ggtitle('Yearly Distributions of Daily CO Measurements')
g_CO_dens

g_O3_dens <- ggplot(data = pollution_data,aes( x=O3Mean,fill = as.factor(YEAR)))
g_O3_dens <- g_O3_dens + geom_density(alpha = 0.5) + ggtitle('Yearly Distributions of Daily O3 Measurements')
g_O3_dens

Question2: What is the State-wise Distribution of each Pollutant for each year?

Will try to map the pollutants with the state on the US Map.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:lubridate':
## 
##     intersect, setdiff, union

## The following objects are masked from 'package:data.table':
## 
##     between, first, last

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(maps)

## Warning: package 'maps' was built under R version 3.4.4

library(RColorBrewer)
data = pollution_data

stateCoordinates = fread("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/stateCoordinates.csv") %>%
  mutate(region = tolower(State)) %>%
  select(-State) %>%
  data.table()

## Warning: package 'bindrcpp' was built under R version 3.4.4

#Fields=4
#State
#StateCode
#Latitude
#Longitude

pollutantsByState = data %>% 
  group_by(State) %>%
  summarise(O3Mean = mean(O3Mean),
            NO2Mean = mean(NO2Mean),
            COMean = mean(COMean),
            SO2Mean = mean(SO2Mean)) %>%
  ungroup() %>%
  mutate(State = tolower(State)) %>%
  rename(region = State) %>%
  merge(stateCoordinates, by = 'region', all.x = TRUE) %>%
  data.table()

setkey(pollutantsByState, region)


all_states <- data.table(map_data('state'))
setkey(all_states,region)

dataPlot <- all_states[pollutantsByState] %>% select(-subregion)
dim(dataPlot)

## [1] 14210    12

p1 <- ggplot() + 
  geom_polygon(data=dataPlot, 
               aes(x=long, 
                   y=lat, 
                   group = group, 
                   fill=O3Mean)) + 
  scale_fill_gradientn("O3 mean",
                       colours=brewer.pal(9,"YlOrRd"))+ 
  theme_bw()  + 
  labs(fill = "O3 mean" ,
       title = "Pollution in different states of USA (O3 level)", 
       x="", 
       y="") + 
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank()) + 
  geom_text(data = stateCoordinates %>% 
              filter(tolower(region) %in% unique(tolower(dataPlot$region)) &
                       !region %in% c("hawaii","alaska")), 
            aes(y = Latitude, x = Longitude, label = StateCode), color = 'Black', check_overlap = TRUE,size = 3)

p1

p2 <- ggplot() + 
  geom_polygon(data=dataPlot, 
               aes(x=long, 
                   y=lat, 
                   group = group, 
                   fill=NO2Mean)) + 
  scale_fill_gradientn("NO2 mean",
                       colours=brewer.pal(9,"Blues"))+ 
  theme_bw()  + 
  labs(fill = "Nitrogen dioxide (NO2) mean" ,
       title = "Pollution in different states of USA (NO2 level)", 
       x="", 
       y="") + 
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank()) + 
  geom_text(data = stateCoordinates %>% 
              filter(tolower(region) %in% unique(tolower(dataPlot$region)) &
                       !region %in% c("hawaii","alaska")), 
            aes(y = Latitude, x = Longitude, label = StateCode), color = 'Black', check_overlap = TRUE,size = 3)
p2

p3 <- ggplot() + 
  geom_polygon(data=dataPlot, 
               aes(x=long, 
                   y=lat, 
                   group = group, 
                   fill=SO2Mean)) + 
  scale_fill_gradientn("SO2 mean",
                       colours=brewer.pal(9,"Oranges"))+ 
  theme_bw()  + 
  labs(fill = "Nitrogen dioxide (SO2) mean" ,
       title = "Pollution in different states of USA (SO2 level)", 
       x="", 
       y="") + 
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank())+ 
  geom_text(data = stateCoordinates %>% 
              filter(tolower(region) %in% unique(tolower(dataPlot$region)) &
                       !region %in% c("hawaii","alaska")), 
            aes(y = Latitude, x = Longitude, label = StateCode), color = 'Black', check_overlap = TRUE,size = 3)
p3

p4 <- ggplot() + 
  geom_polygon(data=dataPlot, 
               aes(x=long, 
                   y=lat, 
                   group = group, 
                   fill=COMean)) + 
  scale_fill_gradientn("CO mean",
                       colours=brewer.pal(9,"Purples"))+ 
  theme_bw()  + 
  labs(fill = "Nitrogen dioxide (CO) mean" ,
       title = "Pollution in different states of USA (CO level)", 
       x="", 
       y="") + 
  scale_y_continuous(breaks=c()) + 
  scale_x_continuous(breaks=c()) + 
  theme(panel.border =  element_blank())+ 
  geom_text(data = stateCoordinates %>% 
              filter(tolower(region) %in% unique(tolower(dataPlot$region)) &
                       !region %in% c("hawaii","alaska")), 
            aes(y = Latitude, x = Longitude, label = StateCode), color = 'Black', check_overlap = TRUE,size = 3)
p4

Looking at the Maps and understanding NO2,SO2 and CO to be the primary air pollutants it seems the Eastern states have more concentration of Pollutants.

Q3:- What is the realtive Pollution to Asthma death in the US States.

I have considered Asthma to be the primary respiratory disease and will try to map it over the states. I will be considering NO2 as the primary most adverse pollutant as to my knowledge it is the gas which is most poisonous.

Now that we have got some analysis done on Pollution. I will try to relate the pollution with Mortality.

deaths = fread("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/AsthmaMortality.csv")

#FIelds=3
#State,Number of Deaths, Percent Of Deaths

names(deaths) = c("State","Deaths","PercentageDeaths")
deaths = deaths %>% filter(!PercentageDeaths %in% c("Suppressed","Unreliable")) %>% mutate(region = tolower(State),PercentageDeaths = as.numeric(PercentageDeaths), State = tolower(State)) %>% data.table() %>% setkey(region) 

deathCoordinates = merge(stateCoordinates, deaths, by = "region")

deathsPlot = dataPlot[deaths]


p2 + geom_point(data = deathCoordinates%>% 
                  filter(tolower(State) %in% unique(tolower(dataPlot$region)) &
                           !State %in% c("Hawaii","Alaska")), aes(x = Longitude, y = Latitude, size = PercentageDeaths),alpha = 0.5, color = 'lightpink')+ 
  geom_point(data = deathCoordinates%>% 
               filter(tolower(State) %in% unique(tolower(dataPlot$region)) &
                        !State %in% c("Hawaii","Alaska")), aes(x = Longitude, y = Latitude, size = PercentageDeaths),shape = 1,alpha = 0.5, color = 'black') +
  scale_size_continuous(range = c(.1,20)) + 
  labs(title = "NO2 level vs Asthma deaths")

Arizona, Colorado and New York seem to have higest levels of NO2 but only New York (Eastern State) shows a higher percentage of Asthma Deaths. There might be other factors for Asthma in NY however but i couldnt locate them in the dataset unfortunately.

Q4:- Does the pollution show any relations to Per capita Healthcare expenses of States with Infant deaths and Asthma Deaths?

library(readr) # CSV file I/O, e.g. the read_csv function
library(reshape2)

## 
## Attaching package: 'reshape2'

## The following objects are masked from 'package:data.table':
## 
##     dcast, melt

library(repr)

## Warning: package 'repr' was built under R version 3.4.4

d <- read.csv("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/mort.csv", stringsAsFactors=F)
d$Category <- as.factor(d$Category)
d <- d[d$Category=='Neonatal disorders',]
d <- d[,-2:-3]
state_d <- d[!(grepl(',', d$Location) | is.na(d$Location)),]
state_d <- state_d[!state_d$Location=='United States',]
# get all 'Mortality.Rate..<YEAR>.' labels

#Taking 5 year intervals
years <- c('1980', '1985', '1990', '1995', '2000', '2005', '2010', '2014')
mort_labels <- paste('Mortality.Rate..', years, sep='')
mort_labels <- paste(mort_labels, '.', sep='')

# reshape data
m.state_d <- melt(state_d,
                  id='Location',
                  measure.vars=mort_labels)

colnames(m.state_d) <- c('State', 'Year', 'Rate')

# replace 'Mortality.Rate..<YEAR>.' with '<YEAR>'
m.state_d$Year <- gsub('(^.*\\.\\.)|(\\.)', '', m.state_d$Year)
neoNatal = m.state_d %>% mutate(region = tolower(State)) %>% select(-State) %>% filter(Year == "2010") %>% rename(neoNatalDeathRate = Rate) %>% select(-Year)

# neoNatal %>% head()
# pollutantsByState %>% head()
asthmaDeaths = deaths %>% rename(asthmaMortality = PercentageDeaths) %>% select(-State,-Deaths)
infantDeaths = fread("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/Linked_Birth_Infant_Death_Records.txt") %>%  select(State,`Death Rate`) %>% rename(region = State, infantDeathRate = `Death Rate`) %>% mutate(region = tolower(region))

## Warning in fread("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/
## Linked_Birth_Infant_Death_Records.txt"): Stopped reading at empty line 54
## but text exists afterwards (discarded): "---"

#Considering only 2014 in order to avoid Neutralizing the Trend

healthCareExpense = fread("/Users/ashishsm1986/git/Cuny-Assignments/Final-project/State-healthcare-expenditure.csv", nrows = 52) %>% select(Location, `2014__Health Spending per Capita`) %>% rename(region = Location, healthCareSpending = `2014__Health Spending per Capita`) %>% mutate(region = tolower(region), healthCareSpending = as.numeric(healthCareSpending), healthCareNorm = healthCareSpending/mean(healthCareSpending, na.rm = T)) %>% select(-healthCareSpending)
# healthCareExpense %>% head()

allData = stateCoordinates %>% 
  merge(healthCareExpense, by = 'region', all.x = T) %>%
  merge(pollutantsByState %>% select(-StateCode, -starts_with("L")) , by = 'region', all.x = T) %>%
  merge(asthmaDeaths, by = 'region', all.x = T) %>%
  merge(infantDeaths, by = 'region', all.x = T) %>%
  merge(neoNatal, by = 'region', all.x = T)

# dim(allData)
#The bar plot plots the normalized mean pollutant values for each pollutant in each state.
#To normalize i divide the state's mean with the mean for all the states combined.
#this brings the values for different variables to similar scale and it becomes easier to see trends
#first two plots are same. they are different representations of the same values.
allData_m = melt(allData, id.vars = c("region", "StateCode")) %>% filter(!variable %in% c("Latitude","Longitude"))
# allData_m %>% head()
allData_m = allData_m %>% group_by(variable) %>% mutate(value = (value)/mean(value, na.rm = T))%>%rename(valueNorm = value)

# plot pollution
ggplot(allData_m %>% filter(variable %in% c("SO2Mean","COMean","O3Mean","NO2Mean")), aes(x = StateCode, y = valueNorm, fill = variable)) + geom_bar(stat = "identity", position = "dodge") + labs(y = "Normalized value") + theme(axis.text.x = element_text(angle = 90, size=20))

## Warning: Removed 20 rows containing missing values (geom_bar).

# plot mortality rates and healthcare expense

#This plot plots the healthcare expense, and different mortality rates. 

ggplot(allData_m %>% filter(!variable %in% c("SO2Mean","COMean","O3Mean","NO2Mean")), aes(x = StateCode, y = valueNorm, fill = variable)) + geom_bar(stat = "identity", position = "dodge") + labs(y = "Normalized value") + theme(axis.text.x = element_text(angle = 90, size=20))

## Warning: Removed 12 rows containing missing values (geom_bar).

# plot pollution

# Plot mortality rates and healthcare expense

ggplot(allData_m %>% filter(!variable %in% c("SO2Mean","COMean","O3Mean","NO2Mean")), aes(x = StateCode, y = variable, fill = valueNorm)) + geom_tile(stat = "identity", color = 'black') + labs(y = "Normalized value") + scale_fill_gradient(low='white', high='blue') +coord_flip()

Conclusions:-

Its interesting DC seems to have the highest Neonatal Deaths in the country.

So, In my analysis We realised eastern states to have comparatively more Air poollution.
From the plot above it seems like for most of the states the Percapita healthcare expenses is usually less than Infant death, Neonatal deaths and Asthma related death. Somewhat we get a feeling this proportion could be better.

Pollutants like SO2 and NO2 have recently Come down from earlier years like 2000 and 2006. Which may be an indicator of states taking steps to bring the pollution down over the years.
Although levels of particle pollution and ground-level ozone pollution are substantially lower than in the past, levels are unhealthy in numerous areas of the country. Both pollutants are the result of emissions from diverse sources, and travel long distances and across state lines.
Ozone can increase the frequency of asthma attacks, cause shortness of breath, aggravate lung diseases, and cause permanent damage to lungs through long-term exposure. Elevated ozone levels are linked to increases in hospitalizations, emergency room visits and premature death.
For unhealthy peak levels of sulfur dioxide and nitrogen dioxide, EPA is working with states and others on ways to determine where and how often unhealthy peaks occur.
Both pollutants cause multiple adverse respiratory effects including increased asthma symptoms, and are associated with increased emergency department visits and hospital admissions for respiratory illness.
Both pollutants cause environmental damage, and are byproducts of fossil fuel combustion.

Some of the current and future challenges with Air pollution were described here:- https://www.epa.gov/clean-air-act-overview/air-pollution-current-and-future-challenges

Final Project CUNY 607

Ashish Kumar

5/13/2018