Reproducible Research Assignment 2

To assess the significant weather contributors to public health and economic impact in the USA, the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm data was analysed from 1950 to 2011. The data was transformed using R Studio ([Details at end of this document]). The findings of the analysis:

Tornados contribute to the most number of fatalties and injuries
This is followed by excessive heat
Floods cause the most damage to propertyloss
Hurricanes economic impact - propertyloss

In conclusion, hurricanes have the most adverse public health and economic impact of all weather in the USA.

Download and read dataset

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url,"archiveRaw.csv.bz2")
storm <- read.csv("archiveRaw.csv.bz2", header = TRUE, sep = ",", quote = "\"", stringsAsFactors = FALSE)

storm$PROPDMGEXP <- toupper(storm$PROPDMGEXP)
storm$CROPDMGEXP <- toupper(storm$CROPDMGEXP)

cropDmgKey <-  c("\"\"" = 10^0,
             "?" = 10^0, 
             "0" = 10^0,
             "K" = 10^3,
             "M" = 10^6,
             "B" = 10^9)

propDmgKey <-  c("\"\"" = 10^0,
             "-" = 10^0, 
             "+" = 10^0,
             "0" = 10^0,
             "1" = 10^1,
             "2" = 10^2,
             "3" = 10^3,
             "4" = 10^4,
             "5" = 10^5,
             "6" = 10^6,
             "7" = 10^7,
             "8" = 10^8,
             "9" = 10^9,
             "H" = 10^2,
             "K" = 10^3,
             "M" = 10^6,
             "B" = 10^9)

storm$PROPDMGEXP <- propDmgKey[as.character(storm$PROPDMGEXP)]
storm$PROPDMGEXP[is.na(storm$PROPDMGEXP)] <- 10^0

# Map crop damage alphanumeric exponents to numeric values
cropDmgKey <-  c("\"\"" = 10^0,
             "?" = 10^0, 
             "0" = 10^0,
             "K" = 10^3,
             "M" = 10^6,
             "B" = 10^9)
storm$CROPDMGEXP <- cropDmgKey[as.character(storm$CROPDMGEXP)]
storm$CROPDMGEXP[is.na(storm$CROPDMGEXP)] <- 10^0

Analysis

Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?

for this question, we need to define a response variable aggregating the health effects. There are two: INJURIES and FATALITIES

I assume that there is a relationship between number of fatalities and relationships:

summary(lm(FATALITIES ~ INJURIES, data = storm))

## 
## Call:
## lm(formula = FATALITIES ~ INJURIES, data = storm)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -70.07  -0.01  -0.01  -0.01 582.99 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.0097265  0.0007631   12.74   <2e-16 ***
## INJURIES    0.0453207  0.0001404  322.71   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7246 on 902295 degrees of freedom
## Multiple R-squared:  0.1035, Adjusted R-squared:  0.1035 
## F-statistic: 1.041e+05 on 1 and 902295 DF,  p-value: < 2.2e-16

There is a relationship between INJURIES and FATALITIES. The regression tells us that each on average there is about 22 injuries per FATALITY. We will use these regression parameter estimates to create a summary variable of damages on health later on. Here is a plot of the discussed relationship:

library(ggplot2)

      ggplot(storm, aes(x = INJURIES, y = FATALITIES)) +
            geom_point(alpha = 0.1) + 
            geom_smooth(method = "lm") + 
            ggtitle(label="relationship of INJURIES vs FATALITIES in weather related accidents") +    coord_cartesian(xlim=c(0,1000),ylim=c(0,1000))

Next, we will create a new variable which will describe the total HEALTH damages caused by a weather event. We will use the regression estimate to assign weight to FATALITIES. Each FATALITY will have a value of roughly 22 INJURIES. New variable combining the health-related effects will be called PEOPLEDAMAGE.

In the following command, we aggregate the storm dataset by EVTYPE to get SUMs of this newly calculated PEOPLEDAMAGE variable.

We find out that the most harmful type of weather event in the United States is a tornadoe.

storm$PEOPLEDAMAGE <- storm$INJURIES + 1/0.0453207 * storm$FATALITIES

aggregated <- aggregate(PEOPLEDAMAGE ~ EVTYPE, storm, FUN = sum)

aggregated[which.max(aggregated$PEOPLEDAMAGE),]

##      EVTYPE PEOPLEDAMAGE
## 834 TORNADO       215638

aggregated <- aggregated[with(aggregated,order(PEOPLEDAMAGE, decreasing = TRUE)),]

Across the United States, which types of events have the greatest economic consequences?

To analyze economic damage, we will have to construct a variable combining all the damages on assets. We obtain this variable by summing PROPDMG, and CROPDMG. The new variable will be calles THINGDMG.

storm$THINGDMG <- as.numeric(storm$PROPDMG) + as.numeric(storm$CROPDMG) 
aggregated2 <- aggregate(THINGDMG ~ EVTYPE, storm, FUN = sum)
as.character(aggregated2[which.max(aggregated2$THINGDMG),1])

## [1] "TORNADO"

aggregated2 <- aggregated2[with(aggregated2,order(THINGDMG, decreasing = TRUE)),]

Summary of results

library(reshape2)
agg <- melt(head(aggregated,10))

## Warning in melt.data.table(head(aggregated, 10)): To be consistent with
## reshape2's melt, id.vars and measure.vars are internally guessed when both
## are 'NULL'. All non-numeric/integer/logical type columns are conisdered
## id.vars, which in this case are columns [EVTYPE, variable]. Consider
## providing at least one of 'id' or 'measure' vars in future.

## Duplicate column names found in molten data.table. Setting unique names using 'make.names'

ggplot(agg, aes(x = reorder(EVTYPE, value),y = value)) + geom_bar(stat = "identity")  +  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label = "Health  effects of weather events") + xlab("Event type") + ylab("Injuries and equivalents") +
coord_flip()

agg <- melt(head(aggregated2,15))

## Using EVTYPE as id variables

ggplot(agg, aes(x = reorder(EVTYPE, value),y = value)) + geom_bar(stat = "identity")  +  theme(axis.text.x = element_text(angle = 90, hjust = 1)) + ggtitle(label = "Economic  effects of weather events") + xlab("Event type") + ylab("Economic damage in USD") + scale_y_continuous(labels = scales::dollar) +
 coord_flip()

The most economically harmful type of meteorological event is the United States are tornadoes.

Session Info

Analysis performed in R, sessioninfo follows.

sessionInfo()

## R version 3.3.2 (2016-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: macOS Sierra 10.12.1
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] data.table_1.9.6 reshape2_1.4.2   ggplot2_2.1.0   
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.7      digest_0.6.10    assertthat_0.1   chron_2.3-47    
##  [5] plyr_1.8.4       grid_3.3.2       gtable_0.2.0     formatR_1.4     
##  [9] magrittr_1.5     evaluate_0.10    scales_0.4.0     stringi_1.1.2   
## [13] rmarkdown_1.1    labeling_0.3     tools_3.3.2      stringr_1.1.0   
## [17] munsell_0.4.3    yaml_2.1.13      colorspace_1.2-7 htmltools_0.3.5 
## [21] knitr_1.14       tibble_1.2

Reproducible Research Assignment 2

Michal Hron

11/15/2016

Download and read dataset

Analysis

Summary of results

Session Info