Synopsis:

The goal of this analysis was to determine the storm events that contributed the most to human and monetary costs. The Storms Database was very large, and had numerous issues with respect to storm type categorization, missing data, and unlikely outliers that needed to be managed prior to analysis. This report includes every step taken to process and analyze the data to arrive at a final conclusion. Tornadoes accounted for the greatest monetary costs and the greatest numbers of fatalities and injuries as well. The second greatest cost along all fronts came from Flood events, when Floods and Flash Floods were combined.

Data Processing:

The Storms Database is quite large, so I initially just read in one line to determine which columns I wanted to keep, then wrote a code bit to capture the column classes, and changed the ones I didn’t want to keep to ‘NULL’. I chose to use ‘read.csv()’ for the data import, even though it is slower. This is because my work Systems Administrator doesn’t allow different unzip utilities. I did cache the data to improve the functionality. I also did some initial cleaning of the data, excluding rows that didn’t report any damage, injuries, or fatalities.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# read in the data
        Data <- read.csv("repdata_data_StormData.csv.bz2",nrows=1)
        Class <- t(as.data.frame(lapply(Data,class)))
        Class <- as.data.frame(Class)
        Class$V1 <- as.character(Class$V1)
        Class[c(1,3,9:22,29,30,34:35,37),1] <- "NULL"
        Class[which(Class$V1=="logical"|Class$V1=="factor"),1] <- "character"
        Data <- read.csv("repdata_data_StormData.csv.bz2",stringsAsFactors=F,colClasses=Class[,1])
#Get rid of data without injuries, fatalities, or costs
        DataInj <- Data[which(Data$FATALITIES!=0|Data$INJURIES!=0),]
        DataCost <- Data[which(Data$PROPDMG!=0|Data$CROPDMG!=0),]
# recombine
        Data <- left_join(DataCost,DataInj)
## Joining by: c("BGN_DATE", "TIME_ZONE", "COUNTY", "COUNTYNAME", "STATE", "EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP", "ZONENAMES", "LATITUDE", "LONGITUDE", "REMARKS")
# cleanup
        rm(DataInj,DataCost,Class)

After import, and initial cleaning, the data structure is as follows:

str(Data)
## 'data.frame':    245117 obs. of  16 variables:
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ REMARKS   : chr  "" "" "" "" ...

There’s a lot more cleaning to do, so I coerced the dates to date format, and made all of the text fields upper case. Then, I imported table 2.1.1 “Storm Data Event Table” from the NATIONAL WEATHER SERVICE INSTRUCTION 10-1605, Storm Data Preparation, and used the ‘stringdist’ package to match the ‘EVTYPE’ field to the “official” Storm Events. I chose ‘Optimal String Alignment distance’ as a matching method, with a maximum distance of ‘1’. This reduces the number of ‘NA’ values introduced considerably. Increasing the distance only has a marginal effect on the number of ‘NA’ values introduced. This method decreases the number of unique event types to 46, and only leaves 1.9% of the dataset uncategorized. (Increasing the maximum distance to ‘2’ leaves 1.88% of the dataset uncategorized.)

Additionally, I did some data manipulations to calculate the total cost for each event. (Total Cost = Property Damage + Crop Damage)

library(tm)
## Loading required package: NLP
library(stringr)
library(stringdist)
Data$BGN_DATE <- as.Date(Data$BGN_DATE,format = "%m/%d/%Y")  # Dates to dates
Data$TIME_ZONE <- toupper(Data$TIME_ZONE)
Data$COUNTYNAME  <- toupper(Data$COUNTYNAME)
Data$STATE <- toupper(Data$STATE)
Data$EVTYPE  <- toupper(Data$EVTYPE)
Data$EVTYPE <- gsub("TSTM","THUNDERSTORM",Data$EVTYPE)
Data$EVTYPE <- gsub("  "," ",Data$EVTYPE)
Data$EVTYPE <- gsub("[(]","",Data$EVTYPE,perl = T)
Data$EVTYPE <- gsub("[)]","",Data$EVTYPE,perl=T)
pdf <- readPDF(control = list(c(text = "-layout")))
pdf <- pdf(elem=list(uri="pd01016005curr.pdf"),language="en")
List <- c(pdf$content[seq(397, 420)], pdf$content[seq(425, 448)])
List <- toupper(List)
# Categorize the Event Types...
Data$newEV <- List[amatch(Data$EVTYPE,List,method='osa',maxDist = 1)]
# Calculate the damage costs 
Data$PROPDMGEXP <- toupper(Data$PROPDMGEXP)
Data$CROPDMGEXP <- toupper(Data$CROPDMGEXP)
Data[which(Data$PROPDMGEXP=="K"),10] <- as.numeric(10^3)
Data[which(Data$PROPDMGEXP=="M"),10] <- as.numeric(10^6)
Data[which(Data$PROPDMGEXP=="B"),10] <- as.numeric(10^9)
Data[which(Data$CROPDMGEXP=="K"),12] <- as.numeric(10^3)
Data[which(Data$CROPDMGEXP=="M"),12] <- as.numeric(10^6)
Data[which(Data$CROPDMGEXP=="B"),12] <- as.numeric(10^9)
Data[which(Data$CROPDMGEXP<1000),12] <- 1
Data[which(Data$PROPDMGEXP<1000),10] <- 1
Data$PROPDMGEXP <- as.numeric(Data$PROPDMGEXP)
Data$CROPDMGEXP <- as.numeric(Data$CROPDMGEXP)
# Multiply out Damage Exponents
Data$TotProp <- Data$PROPDMG*Data$PROPDMGEXP
Data$TotCrop <- Data$CROPDMG*Data$CROPDMGEXP
Data$TotAll <- Data$TotProp+Data$TotCrop
library(knitr)
knit_hooks$set(htmlcap = function(before, options, envir) {
  if(!before) {
    paste('<p class="caption">',options$htmlcap,"</p>",sep="")
    }
    })
opts_chunk$set(warning=FALSE,
             message=FALSE,
                 echo=TRUE,
                 dpi=96,
                 fig.width=8, fig.height=5, # Default figure widths
                 dev="png", dev.args=list(type="cairo"), # The png device
                 # Change to dev="postscript" if you want the EPS-files
                 # for submitting. Also remove the dev.args() as the postscript
                 # doesn't accept the type="cairo" argument.
                 error=FALSE)

Results:

We are asked to answer two major questions with this data set, the most harmful storm events with respect to monetary and human costs.

Across the United States, which types of events are most harmful with respect to population health?

library(dplyr)
library(ggplot2)
HumFatal <- Data%>%
  group_by(newEV)%>%
  summarise(TotFatal=sum(FATALITIES))%>%
  arrange(-TotFatal)
# Remove NA row and grab top 5 causes
HumFatal <- HumFatal[-3,]
HumFatal <- HumFatal[1:5,]
HumFatal$newEV <- gsub("THUNDERSTORM","TSTM",HumFatal$newEV)
barplot(height =HumFatal$TotFatal,names.arg = HumFatal$newEV,main="Top Five Storm Events by Fatalities",xlab="Event",ylab="Number of Injuries")

Figure 1: This figure shows that Tornadoes caused the highest number of fatalities.

HumInj <- Data%>%
  group_by(newEV)%>%
  summarise(TotInj=sum(INJURIES))%>%
  arrange(-TotInj)
# Remove NA row and grab top 5 causes
HumInj <- HumInj[-4,]
HumInj <- HumInj[1:5,]
HumInj$newEV <- gsub("THUNDERSTORM","TSTM",HumInj$newEV)
barplot(height =HumInj$TotInj,names.arg = HumInj$newEV,main="Top Five Storm Events by Injuries",xlab="Event",ylab="Number of Injuries")

Figure 2: This figure shows that Tornadoes caused the highest number of injuries.

Tornadoes caused the highest number of Fatalities and Injuries during the period studied. Flooding, if you combine ‘Flash Floods’ and ‘Floods’ have the second highest Fatalities and Injuries.

Across the United States, which types of events have the greatest economic consequences?

On initial inspection of the cost data, it’s clear that there is an outlier somewhere. There is a flood event that reports around a trillion dollars in property damage. The associated remarks indicate that the flood damage to non-homes (e.g., businesses) was around $76 million. For the purposes of this assessment, I will reduce that reported value by 1000-fold (From ‘Billions’ to ‘Millions’).

library(dplyr)
library(knitr)
# Data[which(Data$TotAll==max(Data$TotAll,na.rm=T)),c(16)] --Not run in the report...
Data[which(Data$TotAll==max(Data$TotAll,na.rm=T)),20] <- Data[which(Data$TotAll==max(Data$TotAll,na.rm=T)),20]/1000 
Cost <- Data%>%
  group_by(newEV)%>%
  summarise(Total=sum(TotAll))%>%
  arrange(-Total)
Cost <- Cost[1:5,]
Cost[,2] <- Cost[,2]/10^9 #Make reasonable numbers
labels <- c("Storm Event","Cost in Billions of Dollars")
kable(Cost,format = "markdown",col.names = labels,align = 'c',digits = 2)
Storm Event Cost in Billions of Dollars
TORNADO 58.12
FLOOD 36.29
FLASH FLOOD 17.57
DROUGHT 15.02
ICE STORM 8.97
barplot(height =Cost$Total,names.arg = Cost$newEV,main="Top Five Storm Events by Cost",xlab="Event",ylab="Cost in Billions of Dollars")

Figure 3: This shows that Tornadoes caused the greatest amount of damage.

Once again, tornadoes are the most costly event accounting for $58 Billion in property and crop damage. Combined Flood related costs (Flash Floods and Floods combined) account for $53.9 Billion in costs.