Synopsis (Executive Summary)

In the following, an analysis of the US NOAA storm data is performed to identify and assess the main hazards to human life, human health as well as economic impact. After extensive data cleaning, the data is summarized in graphical form for the main event types that pose hazards to the three categories outlined above.

The results showed that storm (tornado, hurricane, etc.), heat and flood (river, ocean, etc.) are the main hazards with respect to all categories which intutively makes sense. More granular results can be seen in the results section of this article.

After the identification of the main harard types and their impact, additional analysis on the spatial and temporal distributions of the fatalities/damages would be required to make better and more informed descision about the allocation of emergency funds across the United States. A preliminary analysis was coinducted by the author and can be found here.

Data Processing

First, the required packages are loaded.

setwd("~/Google Drive/DataScienceClasses/Reproducible Research/Assignment2")
library(data.table)
library(dplyr)
library(ggplot2)
library(lubridate)
library(R.utils)

To pre-process the data, the following steps were performed:

  1. The data is downloaded and unzipped.
  2. The data is read into a data table for processing in R.
  3. Only the relevant columns are selected.
  4. Only the datapoints with fatality, injury or damage (crop and property) are selected.
  5. A decoding table for the exponents is generated and merged into the data for both, crop damage and propoerty damage.
  6. The final dollar-values for crop and property damage are calculated.
  7. The data is reduced by getting rid of the now obsolete exponent columns.
# Dowload File and unzip
if(!file.exists("StormData.csv")){
  download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2")
  bunzip2("StormData.csv.bz2")
}

# Read data (datatable package for faster processing)
dat <- fread("StormData.csv")
## 
Read 0.0% of 967216 rows
Read 25.8% of 967216 rows
Read 43.4% of 967216 rows
Read 54.8% of 967216 rows
Read 72.4% of 967216 rows
Read 78.6% of 967216 rows
Read 85.8% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:11
# Selecting relevant colums and only harm > 0 in at least one column to reduce data strain
dat <- dat %>% select(BGN_DATE, STATE, EVTYPE, FATALITIES:CROPDMGEXP) %>% filter(FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)

# Parsing dates (only needed for appendix)
dat$BGN_DATE <- mdy_hms(dat$BGN_DATE)

# Extrating the unique exponential units and making decoding table
decode <- data.frame(symb = unique(append(unique(dat$CROPDMGEXP), unique(dat$PROPDMGEXP))),
           decode = 10^c(0,6,3,6,9,0,0,3,0,5,6,4,2,2,7,3,3,0))

# The following merges in the decode table, renames the merged columns, computes the value 
# (multiply exponent with value) and reduces the data by removing unwanted columns
# Pipeline operator for the win :-)
dat <- dat %>% 
  merge(y = decode, by.x = "PROPDMGEXP", by.y = "symb") %>% 
  rename(PropExp = decode) %>%
  merge(y = decode, by.x = "CROPDMGEXP", by.y = "symb") %>% 
  rename(CropExp = decode) %>%
  mutate(PropDam = PROPDMG*PropExp) %>% 
  mutate(CropDam = CROPDMG*CropExp) %>%
  select(BGN_DATE:INJURIES,PropDam,CropDam)

After all these steps, the data can then used for the analysis as outlined in the next section.

Analysis

In this section, the cleaned data from the previous step is summarized for

for the respective event type. After the summary, the Top 10 event types for are plotted.

REMARK: According to a Coursera Forum discussion, the flood data for California in 2006 does not have the correct multiplier in the data (i.e. it should be millions instead of billions). For the sake of consistency and simplicity of this homework, as well as the fact, that the data may contain other errors as well (see link above), I have not corrected for this going forward in the analysis.

# Summarizing Fatalities
fat <- dat %>% 
  group_by(EVTYPE) %>% 
  summarise(Fatalities = sum(FATALITIES)) %>% 
  arrange(desc(Fatalities)) %>% 
  top_n(10, Fatalities)

# Plot Fatalities
ggplot(data = fat, aes(x = reorder(EVTYPE, Fatalities), y = Fatalities)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  xlab("Event Type") + 
  ggtitle("Fatalities")


# Summarizing Injuries
inj <- dat %>% 
  group_by(EVTYPE) %>% 
  summarise(Injuries = sum(INJURIES)) %>% 
  arrange(desc(Injuries)) %>% 
  top_n(10, Injuries)

# Plot Injuries
ggplot(data = inj, aes(x = reorder(EVTYPE,Injuries), y = Injuries)) + 
  geom_bar(stat = "identity") + 
  coord_flip() + 
  xlab("Event Type") + 
  ggtitle("Injuries")


# Summarizing total Damand (Crop + Property)
dmg <- dat %>% 
  mutate(totdmg = CropDam+PropDam) %>% 
  group_by(EVTYPE) %>% 
  summarise(TotalDamage = sum(totdmg)) %>% 
  arrange(desc(TotalDamage)) %>% 
  top_n(10, TotalDamage)

# Plot Damages
ggplot(data = dmg, aes(x = reorder(EVTYPE,TotalDamage), y = TotalDamage)) + 
  geom_bar(stat = "identity") +
  coord_flip() + 
  xlab("Event Type") + ylab("Total Damage in $") + ggtitle("Economic Damage in USD")


# Assessing composition of damages
perc <- dat %>% summarize(CD = sum(CropDam), PD = sum(PropDam))
# Percent of property damage to total damage
perc$PD / (perc$CD + perc$PD)
## [1] 0.8971272
# Percent of crop damage to total damage
perc$CD / (perc$CD + perc$PD)
## [1] 0.1028728

Results

It can be seen, that:

Based on these conclusions, research and emergency funds should be allocated to improving the resilience of the US population as well as the economy agaist these hazard types.

In addition, it was found that crop damages account for only around 10% of all damages, whereas property damages account for 90%.

To further inform allocation of research and emergency funding, additional analysis on the temporal and spatial distribution of the fatalities and damages was performed and can be found HERE.


Appendix: Session info for full reproducibility

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.3 (El Capitan)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] R.utils_2.2.0     R.oo_1.19.0       R.methodsS3_1.7.0 lubridate_1.5.0  
## [5] ggplot2_1.0.1     dplyr_0.4.3       data.table_1.9.6 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.12.1      knitr_1.12       magrittr_1.5     MASS_7.3-43     
##  [5] munsell_0.4.2    colorspace_1.2-6 R6_2.1.1         stringr_1.0.0   
##  [9] plyr_1.8.3       tools_3.2.2      parallel_3.2.2   grid_3.2.2      
## [13] gtable_0.1.2     DBI_0.3.1        htmltools_0.2.6  lazyeval_0.1.10 
## [17] yaml_2.1.13      assertthat_0.1   digest_0.6.8     reshape2_1.4.1  
## [21] formatR_1.2.1    evaluate_0.8     rmarkdown_0.9.2  labeling_0.3    
## [25] stringi_0.5-5    scales_0.3.0     chron_2.3-47     proto_0.3-10

EOF