Mortality and Economic Analysis of the NOAA Storm Database

Synopsis

The NOAA Storm Data contains 902,297 distinct storm events from 1950 to November 2011. Storm events are classified into 48 unique types (although due to inconsistent data input methods, there were almost 1000 different storm event types). The goal of this analysis was to determine: (1) the storm events with the most impact on population and (2) the storm events with the most impact on property/crop damage.

The following steps were undertaken: 1. Clean up the storm event type (EVTYPE) to reduce them from almost 1,000 different storm event types. Using regular expressions, I was able to narrow down the event types to just over 300 unique storm event types.
2. Transform the property and crop damage amounts into dollars. 3. Group fatalities and injuries by storm event type and plot histograms of the top-5 most impactful storm events (as measured by total fatalies and injuries). 4. Group property and crop damage by storm event typ and plot histograms of the top-5 most impactful storm events (as measured by total property and crop damage).

Set Global Options

opts_chunk$set(echo = TRUE)

Set working directory

setwd("/Users/GaryLo/Desktop/Data Science Specialization/Reproducible Research/Peer Assessment 2")

Load Packages

install.packages("ggplot2")
## Error: trying to use CRAN without setting a mirror
library(ggplot2)
library(ggplot2)
install.packages("knitr")
## Error: trying to use CRAN without setting a mirror
library(knitr)

Defining Multiplot function. Credit to: http://www.cookbook-r.com/Graphs/Multiple_graphs_on_one_page_(ggplot2)/

# Multiple plot function
#
# ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
# - cols:   Number of columns in layout
# - layout: A matrix specifying the layout. If present, 'cols' is ignored.
#
# If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
# then plot 1 will go in the upper left, 2 will go in the upper right, and
# 3 will go all the way across the bottom.
#
multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  require(grid)

  # Make a list from the ... arguments and plotlist
  plots <- c(list(...), plotlist)

  numPlots = length(plots)

  # If layout is NULL, then use 'cols' to determine layout
  if (is.null(layout)) {
    # Make the panel
    # ncol: Number of columns of plots
    # nrow: Number of rows needed, calculated from # of cols
    layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
                    ncol = cols, nrow = ceiling(numPlots/cols))
  }

 if (numPlots==1) {
    print(plots[[1]])

  } else {
    # Set up the page
    grid.newpage()
    pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))

    # Make each plot, in the correct location
    for (i in 1:numPlots) {
      # Get the i,j matrix positions of the regions that contain this subplot
      matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))

      print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
                                      layout.pos.col = matchidx$col))
    }
  }
}

Data Processing

  1. If data doesn't exist in the working directory, the data was downloaded from the NOAA website using download.file()
  2. The data is read into R using read.csv()
  3. Formatting data (see below)
  4. Creating histograms of fatalaties/injuries/property damage/crop damage by EVTYPE. ## Download/Unzip Data
url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (file.exists("StormData.csv.bz2")){
    #skip download step
}else{
    download.file(url,"StormData.csv.bz2",method="curl")
}
## NULL

Read data into R

data = read.csv("StormData.csv.bz2")

Format and data cleanup.

First step was to modify date's to proper date variables. Leading/trailing blanks were removed from EVTYPE. All EVTYPE's were converted to lower case. Next, I matched several regular expressions to clean up the EVTYPE data. I then converted the PROPDMG and CROPDMG columns to proper dollar amounts.

data$BGN_DATE = as.Date(data$BGN_DATE, "%m/%d/%Y")
data$END_DATE = as.Date(data$END_DATE, "%m/%d/%Y")

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
data$EVTYPE = trim(data$EVTYPE) #977
data$EVTYPE = tolower(data$EVTYPE) #890

#fixing thunderstorm
thunderMatch = grep("thunder",x=unique(data$EVTYPE),fixed=TRUE)
stormMatch = grep("storm",x=unique(data$EVTYPE),fixed=TRUE)
tstmMatch = grep("tstm",x=unique(data$EVTYPE),fixed=TRUE)
tstmMatch = tstmMatch[1:24] #removing non-tstm and marine tstm
thunderAndStormMatch = unique(c(stormMatch[stormMatch %in% thunderMatch],tstmMatch))
thunderstormMatchTerm = unique(data$EVTYPE)[thunderAndStormMatch]
data$EVTYPE[data$EVTYPE %in% thunderstormMatchTerm] = "thunderstorm wind" #785

#fixing hurricane/typhoon
hurricaneMatch = grep("hurricane",x=unique(data$EVTYPE),fixed=TRUE)
typhoonMatch = grep("typhoon",x=unique(data$EVTYPE),fixed=TRUE)
hurricaneOrTyphoonMatch = unique(c(hurricaneMatch,typhoonMatch))
hurricaneTyphoonMatchTerm = unique(data$EVTYPE)[hurricaneOrTyphoonMatch]
data$EVTYPE[data$EVTYPE %in% hurricaneTyphoonMatchTerm] = "hurricane/typhoon" #775

#fixing surf
surfMatch = grep("surf",x=unique(data$EVTYPE),fixed=TRUE)
swellMatch = grep("swell",x=unique(data$EVTYPE),fixed=TRUE)
surfOrSwellMatch = unique(c(surfMatch,swellMatch))
surfSwellMatchTerm = unique(data$EVTYPE)[surfOrSwellMatch]
data$EVTYPE[data$EVTYPE %in% surfSwellMatchTerm] = "high surf" #761

#grouping summary
summaryMatch = grep("summary",x=unique(data$EVTYPE),fixed=TRUE)
summaryMatchTerm = unique(data$EVTYPE)[summaryMatch]
data$EVTYPE[data$EVTYPE %in% summaryMatchTerm] = "summary of date" #695

#tornado match
tornadoMatch = grep("tornado",x=unique(data$EVTYPE),fixed=TRUE)
tornadoMatchTerm = unique(data$EVTYPE)[tornadoMatch]
data$EVTYPE[data$EVTYPE %in% tornadoMatchTerm] = "tornado" #682

#cold match
coldMatch = grep("cold",x=unique(data$EVTYPE),fixed=TRUE)
chillMatch = grep("chill",x=unique(data$EVTYPE),fixed=TRUE)
coldOrChillMatch = unique(c(coldMatch,chillMatch))
coldChillMatchTerm = unique(data$EVTYPE)[coldOrChillMatch]
data$EVTYPE[data$EVTYPE %in% coldChillMatchTerm] = "cold/wind chill" #633

#fixing flash flood
flashMatch = grep("flash",x=unique(data$EVTYPE),fixed=TRUE)
floodMatch = grep("flood",x=unique(data$EVTYPE),fixed=TRUE)
flashAndFloodMatch = floodMatch[floodMatch %in% flashMatch]
flashFloodMatchTerm = unique(data$EVTYPE)[flashAndFloodMatch]
data$EVTYPE[data$EVTYPE %in% flashFloodMatchTerm] = "flash flood" #611

#fixing flood
floodMatch = grep("flood",x=unique(data$EVTYPE),fixed=TRUE)
floodMatchTerm = unique(data$EVTYPE)[floodMatch]
floodMatchTerm = floodMatchTerm[floodMatchTerm != "flash flood"]
data$EVTYPE[data$EVTYPE %in% floodMatchTerm] = "flood" #542

#fixing tropical storm
tropicalMatch = grep("tropical",x=unique(data$EVTYPE),fixed=TRUE)
tropicalMatchTerm = unique(data$EVTYPE)[tropicalMatch]
tropicalMatchTerm = tropicalMatchTerm[tropicalMatchTerm != "tropical depression"]
data$EVTYPE[data$EVTYPE %in% tropicalMatchTerm] = "tropical storm" #538

# fixing hail
hailMatch = grep("hail",x=unique(data$EVTYPE),fixed=TRUE)
hailMatchTerm = unique(data$EVTYPE)[hailMatch]
data$EVTYPE[data$EVTYPE %in% hailMatchTerm] = "hail" #503

#fixing dry
dryMatch = grep("dry",x=unique(data$EVTYPE),fixed=TRUE)
droughtMatch = grep("drought",x=unique(data$EVTYPE),fixed=TRUE)
dryDroughtMatch = unique(c(dryMatch,droughtMatch))
dryDroughtMatchTerm = unique(data$EVTYPE)[dryDroughtMatch]
data$EVTYPE[data$EVTYPE %in% dryDroughtMatchTerm] = "drought" #470

#fix blizzard
blizzardMatch = grep("blizzard",x=unique(data$EVTYPE),fixed=TRUE)
blizzardMatchTerm = unique(data$EVTYPE)[blizzardMatch]
data$EVTYPE[data$EVTYPE %in% blizzardMatchTerm] = "blizzard" #456

#fix snow
snowMatch = grep("snow",x=unique(data$EVTYPE),fixed=TRUE)
snowMatchTerm = unique(data$EVTYPE)[snowMatch]
data$EVTYPE[data$EVTYPE %in% snowMatchTerm] = "snow" #354

#fix frost/freeze
frostMatch = grep("frost",x=unique(data$EVTYPE),fixed=TRUE)
freezeMatch = grep("freez",x=unique(data$EVTYPE),fixed=TRUE)
frostFreezeMatch = unique(c(frostMatch,freezeMatch))
frostFreezeMatchTerm = unique(data$EVTYPE)[frostFreezeMatch]
data$EVTYPE[data$EVTYPE %in% frostFreezeMatchTerm] = "frost/freeze" #332

#fix heat
heatMatch = grep("heat",x=unique(data$EVTYPE),fixed=TRUE)
heatMatchTerm = unique(data$EVTYPE)[heatMatch]
data$EVTYPE[data$EVTYPE %in% heatMatchTerm] = "heat" #324

#lightning match
lightningMatch = grep("lightn",x=unique(data$EVTYPE),fixed=TRUE)
lightningMatchTerm = unique(data$EVTYPE)[lightningMatch]
data$EVTYPE[data$EVTYPE %in% lightningMatchTerm] = "lightning" #315

#fixing property and crop damage values
data$PROPDMG[data$PROPDMGEXP == ""] <- data$PROPDMG[data$PROPDMGEXP == ""] * 1
data$PROPDMG[data$PROPDMGEXP == "h"] <- data$PROPDMG[data$PROPDMGEXP == "h"] * 100
data$PROPDMG[data$PROPDMGEXP == "k"] <- data$PROPDMG[data$PROPDMGEXP == "k"] * 1000
data$PROPDMG[data$PROPDMGEXP == "m"] <- data$PROPDMG[data$PROPDMGEXP == "m"] * 1e+06
data$PROPDMG[data$PROPDMGEXP == "B"] <- data$PROPDMG[data$PROPDMGEXP == "B"] * 1e+09
data$CROPDMG[data$CROPDMGEXP == ""] <- data$CROPDMG[data$CROPDMGEXP == ""] * 1
data$CROPDMG[data$CROPDMGEXP == "h"] <- data$CROPDMG[data$CROPDMGEXP == "h"] * 100
data$CROPDMG[data$CROPDMGEXP == "k"] <- data$CROPDMG[data$CROPDMGEXP == "k"] * 1000
data$CROPDMG[data$CROPDMGEXP == "m"] <- data$CROPDMG[data$CROPDMGEXP == "m"] * 1e+06
data$CROPDMG[data$CROPDMGEXP == "B"] <- data$CROPDMG[data$CROPDMGEXP == "B"] * 1e+09

EVTYPE Frequency

freqDF = data.frame(table(data$EVTYPE))
names(freqDF) = c("EVTYPE","FREQ")
freqDF = freqDF[order(freqDF$FREQ,decreasing=TRUE),]

Analysing FATALITIES and INJURIES as a function of EVTYPE

fatalitiesDF <- aggregate(data$FATALITIES, by = list(data$EVTYPE), FUN = sum)
names(fatalitiesDF) = c("EVTYPE","FATALITIES")
injuriesDF <- aggregate(data$INJURIES, by = list(data$EVTYPE), FUN = sum)
names(injuriesDF) = c("EVTYPE","INJURIES")

Fatalities and Injuries by Event Type

fatalitiesDF = fatalitiesDF[order(fatalitiesDF$FATALITIES,decreasing=TRUE),]
gFAT = ggplot(fatalitiesDF[1:5,],aes(EVTYPE,FATALITIES))
gFATplot = gFAT + geom_bar(stat="identity") + labs(x="Event Type",y="Number of Fatalities",title="Fatalities by Event Type (Top 5 - 75% of all Fatalities)")

injuriesDF = injuriesDF[order(injuriesDF$INJURIES,decreasing=TRUE),]
gINJ = ggplot(injuriesDF[1:5,],aes(EVTYPE,INJURIES))
gINJplot = gINJ + geom_bar(stat="identity") + labs(x="Event Type",y="Number of Injuries",title="Injuries by Event Type (Top 5 - 83% of all Injuries)")

multiplot(gFATplot,gINJplot,cols=2)
## Loading required package: grid

plot of chunk unnamed-chunk-9

Analysing PROPDMG and CROPDMG as a function of EVTYPE

propdmgDF <- aggregate(data$PROPDMG, by = list(data$EVTYPE), FUN = sum)
names(propdmgDF) = c("EVTYPE","PROPDMG")
cropdmgDF <- aggregate(data$CROPDMG, by = list(data$EVTYPE), FUN = sum)
names(cropdmgDF) = c("EVTYPE","CROPDMG")

Property and Crop Damage by Event Type

propdmgDF = propdmgDF[order(propdmgDF$PROPDMG,decreasing=TRUE),]
gPROPDMG = ggplot(propdmgDF[1:5,],aes(EVTYPE,PROPDMG))
gPROPDMGplot = gPROPDMG + geom_bar(stat="identity") + labs(x="Event Type",y="Property Damage $",title="Prop. Damage by Event (Top 5 - 92% of all Prop. Damage)")

cropdmgDF = cropdmgDF[order(cropdmgDF$CROPDMG,decreasing=TRUE),]
gCROPDMG = ggplot(cropdmgDF[1:5,],aes(EVTYPE,CROPDMG))
gCROPDMGplot = gCROPDMG + geom_bar(stat="identity") + labs(x="Event Type",y="Crop Damage $",title="Crop Damage by Event (Top 5 - 99% of all Crop Damage)")

multiplot(gPROPDMGplot,gCROPDMGplot,cols=2)

plot of chunk unnamed-chunk-11

Results

Once the data was cleaned up, I proceeded to first aggregate the data by fatalities and injuries by event type. Based on those two plots, it is clear that tornadoes are the most harmful storm event, despite the fact that tornado events only made up 7% of all of the event types. Doing the same for property and crop damage, it's clear that floods are the most costly event type. Again, this despite the fact that floods only made up 3% of all of the storm events. Interestingly, while flash floods were more than two times more common than “floods”, floods resulted in property damage that was more than 100x higher than flash floods (and more than 25,000x higher with respect to crop damage. The ice storm event came in a very close second to floods with respect to crop damage.