With this analysis we seek to inform interested parties of the historically costliest storm events with regard to health and economic factors. To perform this analysis use the NOAA US Storm Data which encompasses county reported data regarding storm type and outcomes across the U.S. from 1950 - 2011. From this we summarize and rank the historic total number of injuries and fatalities, and the dollar value of crop damage and property damage by storm type across the entire U.S. We find that most damage and costs are associated with just a few storm event categories, with tornadoes being the historically costliest overall. Full documentation regarding the data is available at Storm Data Documentation. Further clarifications can be found in Storm Events FAQ.
We use the NOAA US Storm Data to assess historic health and economic costs of U.S. Storms.
To perform this analysis we make use of the reshape2, ggplot2, and plyr libraries written by Hadley Wickham.
require(reshape2)
require(ggplot2)
require(plyr)
We load the full dataset into the stormdata data frame. Note: We will need to order and rank elements of the dataset later, so we choose not to have R assign the factor levels, at this time.
stormdatafull <- read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"),
stringsAsFactors=FALSE)
From the Storm Data Documentation we know there is much more data than we need. We see that the data consists of multiple numeric values and classifiers reported for individual storm events at specific county locations. We are interested in the storm event type classifier (“EVTYPE”), the health cost data (“FATALITIES”, “INJURIES”), the economic cost data (“PROPDMG”, “CROPDMG”), and the multipliers for the economic cost datum (“PROPDMGEXP”, “CROPDMGEXP”) . We are only interested in the historic aggregate effect across the whole U.S.; date and location data aren’t neccessary for our analysis. We subset the full stormdata data frame, keeping the categories delineated above and discarding the rest.
ofinterest <- c("EVTYPE", "FATALITIES", "INJURIES",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
stormdata <- stormdatafull[, ofinterest]
A quick look at our data subset shows us:
str(stormdata)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
There are 902297 datapoints.
The data is not in a tidy format.
First we clean up the columns names:
names(stormdata) <- tolower(names(stormdata))
Then we insure our classifier column evtype has tidy labels as values:
stormdata$evtype <- tolower(stormdata$evtype)
stormdata$evtype <- gsub("[[:punct:]]", "", stormdata$evtype)
stormdata$evtype <- gsub("[[:space:]]", "", stormdata$evtype)
According to the Storm Data Documentation, there should only be four types of propexp and cropexp values, (“B”, “M”, “K”, “”). Looking at the actual data we see more entries than just the official classifiers.
unique(stormdata$propdmgexp)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
unique(stormdata$cropdmgexp)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Since these provide the dollar scale for the numeric economic values, any labels outside of the four allowed render that data point potentially meaningless. We wouldn’t properly know how to account for those values. So we first make sure these classifiers are uniformly uppercase.
stormdata$propdmgexp <- toupper(stormdata$propdmgexp)
stormdata$cropdmgexp <- toupper(stormdata$cropdmgexp)
Then we remove all rows with garbage propdmgexp and cropdmgexp values.
stormdata <- stormdata[stormdata$propdmgexp %in% c("B", "M", "K", ""),]
stormdata <- stormdata[stormdata$cropdmgexp %in% c("B", "M", "K", ""),]
The propdmgexp and cropdmgexp represent encoded multipliers of the economic data. To give property damage and crop damage data proper scale we need to multiply each monetary data point by the proper factor.
To do this we replace the encoded multiplier values with the numeric values (i.e. “B” = 1000000000) for each dmgexp column.
stormdata$propdmgexp <- sapply(stormdata$propdmgexp,
function(x) switch(x, "B" = 1000000000, "M" = 1000000, "K" = 1000, 1))
stormdata$cropdmgexp <- sapply(stormdata$cropdmgexp,
function(x) switch(x, "B" = 1000000000, "M" = 1000000, "K" = 1000, 1))
Then we multiply each dmg column with the corresponding dmgexp column and replace the dmg column scaled to units of millions of dollars .
stormdata$propdmg <- stormdata$propdmg*stormdata$propdmgexp/1000000
stormdata$cropdmg <- stormdata$cropdmg*stormdata$cropdmgexp/1000000
We no longer need the propdmgexp and cropdmgexp columns, so we remove them.
stormdata <- stormdata[,-c(5,7)]
A brief summary of the numeric data, reveals there are no (labelled) missing data points and gives us a sense of magnitude of each cost measure:
summary(stormdata[2:5])
## fatalities injuries propdmg cropdmg
## Min. : 0 Min. : 0.0 Min. : 0 Min. : 0
## 1st Qu.: 0 1st Qu.: 0.0 1st Qu.: 0 1st Qu.: 0
## Median : 0 Median : 0.0 Median : 0 Median : 0
## Mean : 0 Mean : 0.2 Mean : 0 Mean : 0
## 3rd Qu.: 0 3rd Qu.: 0.0 3rd Qu.: 0 3rd Qu.: 0
## Max. :583 Max. :1700.0 Max. :115000 Max. :5000
Seeing that one single event cost $115B Dollars, we investigate further by looking at the top five events in terms of property damage costs.
stormdata <- arrange(stormdata, desc(propdmg))
head(stormdata)
## evtype fatalities injuries propdmg cropdmg
## 1 flood 0 0 115000 32.5
## 2 stormsurge 0 0 31300 0.0
## 3 hurricanetyphoon 0 0 16930 0.0
## 4 stormsurge 0 0 11260 0.0
## 5 hurricanetyphoon 5 0 10000 0.0
## 6 hurricanetyphoon 0 0 7350 0.0
The largest stormsurge in U.S. History was hurricane Katrina. It was also the most expensive weather event in U.S. history. While we prefer not to remove data, we believe it is clear that a flooding event four times more expensive than Katrina is misreported, so we remove it from the dataset.
stormdata <- stormdata[-c(1),]
The Storm Data Documentation provides reporting guidelines. Officially there are only 48 storm event categories. Though a quick check tells us reporting officials do not comply with these guidelines.
length(unique(stormdata$evtype))
## [1] 814
Rather than spend computational resources attempting to reconcile all these differences, our analysis will focus on the most dangerous or damaging storm event types. Our results will show the storm events with the highest human and economic costs were encompassed and reported within the 48 official categories.
We are primarily interested in the overall cost of storms by category across the U.S. Below we generate the total sum of each cost by storm event type for later use.
injuries <- dcast(stormdata, evtype~"injuries", sum, value.var="injuries")
fatalities <- dcast(stormdata, evtype~"fatalities", sum, value.var="fatalities")
propdmg <- dcast(stormdata, evtype~"propdmg", sum, value.var="propdmg")
cropdmg <- dcast(stormdata, evtype~"cropdmg", sum, value.var="cropdmg")
These will be the core data frames that we will build our results on.
A summary of the injuries and fatality numeric data shows us most storm event types had no health cost, while some had very high health costs.
summary(injuries$injuries)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 173 0 91300
summary(fatalities$fatalities)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 19 0 5630
Rather than simply report the results of the effects of each type of storm, we feel it is more elucidating to identify the most costly storm categories. To do this, we first rank the event types for each health cost.
injuries <- arrange(injuries, desc(injuries))
fatalities <- arrange(fatalities, desc(fatalities))
By analyzing the proportion of injuries caused by the top ten event types to the number of injuries cause by all events, we see that almost 90% of all injuries were caused by the top ten types of storms.
sum(injuries$injuries[1:10])/sum(injuries$injuries)
## [1] 0.8936
Doing a similar analysis of fatalities shows that almost 80% of all deaths were caused by ten types of storm events.
sum(fatalities$fatalities[1:10])/sum(fatalities$fatalities)
## [1] 0.798
While focusing on just the top ten event types, we still wish to maintain perspective on the health cost of other storm events.
To reduce the evtype categories, we store the names of the top 50 costliest for each health cost category. Hoping to at least encompass the 50 official categories.
t5injuries <- as.character(injuries$evtype[1:50])
t5fatalities <- as.character(fatalities$evtype[1:50])
We replace the event type of all events not in the top 50 of either cost category by the type “allothers” for each health cost category.
t5evtype <- c(t5injuries, t5fatalities)
injuries$evtype[!(injuries$evtype %in% t5evtype)] <- "allothers"
fatalities$evtype[!(fatalities$evtype %in% t5evtype)] <- "allothers"
We recast the data to get the sum total of the new “allothers” categories.
injuries <- dcast(injuries, evtype~"injuries", sum, value.var="injuries")
fatalities <- dcast(fatalities, evtype~"fatalities", sum, value.var="fatalities")
The ten most dangerous storm events by health cost type are ranked, then summarized below.
injuries <- arrange(injuries, desc(injuries))
head(injuries[1:10,],10)
## evtype injuries
## 1 tornado 91321
## 2 tstmwind 6957
## 3 flood 6789
## 4 excessiveheat 6525
## 5 lightning 5230
## 6 heat 2100
## 7 icestorm 1975
## 8 flashflood 1777
## 9 thunderstormwind 1488
## 10 hail 1358
fatalities <- arrange(fatalities, desc(fatalities))
head(fatalities[1:10,],10)
## evtype fatalities
## 1 tornado 5630
## 2 excessiveheat 1903
## 3 flashflood 978
## 4 heat 937
## 5 lightning 817
## 6 tstmwind 504
## 7 flood 470
## 8 ripcurrent 368
## 9 allothers 307
## 10 highwind 246
This summary of the ranked list shows that most storms with a significant cost probably were reported in compliance with NOAA guidelines. The “allothers”" designation does not remove a significant portion of the most damaging events. The 700+ combined events that “allothers” covers does not enter the top five ranking in either injuries or fatalities.
To get an over all sense of the health costs of storms in the U.S. we need to combine the cost categories and look at them ranked by total cost.
We merge our two cost categories.
healthdata <- merge(injuries, fatalities)
We add a new column for the total health cost of each storm event type.
healthdata$tot <- rowSums(healthdata[,c(2,3)])
Then we add levelled factors and rerank our healthdata data frame by total cost.
healthdata$evtype <- factor(healthdata$evtype,
levels=healthdata[
order(healthdata$tot, decreasing=TRUE), "evtype"])
healthdata <- arrange(healthdata, desc(tot), injuries)
From the resulting data table, we see that tornados are far and away the most historically dangerous storm event across the U.S. Notice the “allothers” category collectively does not rank in the top ten.
healthdata[1:10,]
## evtype injuries fatalities tot
## 1 tornado 91321 5630 96951
## 2 excessiveheat 6525 1903 8428
## 3 tstmwind 6957 504 7461
## 4 flood 6789 470 7259
## 5 lightning 5230 817 6047
## 6 heat 2100 937 3037
## 7 flashflood 1777 978 2755
## 8 icestorm 1975 89 2064
## 9 thunderstormwind 1488 133 1621
## 10 winterstorm 1321 206 1527
To represent our results graphically we need to melt the overall health data
healthmelt <- melt(healthdata[1:10,1:3], id.vars="evtype", variable.name="cost")
Finally, our results on the historic health costs of storms across the U.S. are summed up in the stacked bargraph below.
ggplot(healthmelt, aes(x=evtype, y=value, fill = cost)) +
geom_bar(stat="identity") +
labs(title="Historical Health Costs of the Most Dangerous Storm Categories") +
xlab("Event Type") +
ylab("Human Health Cost") +
theme(axis.text.x=element_text(angle = 90))
Overall tornados were responsible for ~60% of all health costs.
healthdata$tot[1]/sum(healthdata$tot)
## [1] 0.6231
The analysis of the economic cost of storms is similar to the analysis of health costs. The verbage is almost identical, but the result differ.
A summary of the property damage and property damage cost data shows us most storm event types had no economic cost, while some had very high health costs.
summary(propdmg$propdmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 384 0 69300
summary(cropdmg$cropdmg)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0 0 0 60 0 14000
Rather than simply report the results of the effects of each type of storm, we feel it is more elucidating to identify the most costly storm categories. To do this, we first rank the event types for each type of economic cost.
propdmg <- arrange(propdmg, desc(propdmg))
cropdmg <- arrange(cropdmg, desc(cropdmg))
The economic damage is more evenly spread of the storm categories, so it is more appropriate to look into the top fifteen types of events, rather than top ten. By analyzing the proportion of property damage caused by the top fifteen event types to the property damage caused by all events, we see that over 90% of all property damage was caused by just fifteen types of storms.
sum(propdmg$propdmg[1:15])/sum(propdmg$propdmg)
## [1] 0.9144
Doing a similar analysis of crop damage shows that over 90% of all damage was also caused by fifteen types of storm events.
sum(cropdmg$cropdmg[1:15])/sum(cropdmg$cropdmg)
## [1] 0.917
While focusing on just the top fifteen event types, we still wish to maintain perspective on the economic costs of other storm events.
To reduce the evtype categories, we store the names of the top 50 costliest for each economic category. Hoping to at least encompass the 50 official categories.
t5propdmg <- as.character(propdmg$evtype[1:50])
t5cropdmg <- as.character(cropdmg$evtype[1:50])
We replace the event type of all events not in the top 50 of either cost category by the type “allothers” for each economic cost category.
t5evtype <- c(t5propdmg, t5cropdmg)
propdmg$evtype[!(propdmg$evtype %in% t5evtype)] <- "allothers"
cropdmg$evtype[!(cropdmg$evtype %in% t5evtype)] <- "allothers"
We recast the data to get the sum total of the new, economic “allothers” categories.
propdmg <- dcast(propdmg, evtype~"propdmg", sum, value.var="propdmg")
cropdmg <- dcast(cropdmg, evtype~"cropdmg", sum, value.var="cropdmg")
The ten most damaging storm events by economic cost type are ranked, then summarized below.
propdmg <- arrange(propdmg, desc(propdmg))
head(propdmg[1:15,], 15)
## evtype propdmg
## 1 hurricanetyphoon 69306
## 2 tornado 56937
## 3 stormsurge 43324
## 4 flood 29658
## 5 flashflood 16141
## 6 hail 15732
## 7 hurricane 11868
## 8 tropicalstorm 7704
## 9 winterstorm 6688
## 10 highwind 5270
## 11 riverflood 5119
## 12 wildfire 4765
## 13 stormsurgetide 4641
## 14 tstmwind 4493
## 15 icestorm 3945
cropdmg <- arrange(cropdmg, desc(cropdmg))
head(cropdmg[1:15,], 15)
## evtype cropdmg
## 1 drought 13972.6
## 2 flood 5629.5
## 3 riverflood 5029.5
## 4 icestorm 5022.1
## 5 hail 3001.0
## 6 hurricane 2741.9
## 7 hurricanetyphoon 2607.9
## 8 flashflood 1420.7
## 9 extremecold 1313.0
## 10 frostfreeze 1094.2
## 11 heavyrain 733.4
## 12 tropicalstorm 678.3
## 13 highwind 638.6
## 14 tstmwind 554.0
## 15 excessiveheat 492.4
This summary of the ranked list shows that most storms with a significant cost probably were reported in compliance with NOAA guidelines. The “allothers”" designation does not remove a significant portion of damaging events. The 700+ combined events that “allothers” covers does not break the top fifteen ranking in either property damage or crop damage.
Unlike the health costs data, we see that tornadoes do not dominate both categories. Strong Wind sorts of events seem to cause the most property damage (which includes tornadoes), while Flooding and Drought have clearly caused the most crop damage historically.
To get an over all sense of the economic costs of storms in the U.S. we need to combine the cost categories and look at the ranked total.
We merge our two cost categories.
econdata <- merge(propdmg, cropdmg)
We add a new column for the total economic cost of each storm event type.
econdata$tot <- rowSums(econdata[,c(2,3)])
Then we add levelled factors and rerank our econdata data frame by total cost.
econdata$evtype <- factor(econdata$evtype,
levels=econdata[
order(econdata$tot, decreasing=TRUE), "evtype"])
econdata <- arrange(econdata, desc(tot), propdmg)
From the resulting data table, we see that economic damage is caused by a wide range of storm types.
econdata[1:15,]
## evtype propdmg cropdmg tot
## 1 hurricanetyphoon 69306 2607.873 71914
## 2 tornado 56937 364.950 57302
## 3 stormsurge 43324 0.005 43324
## 4 flood 29658 5629.468 35287
## 5 hail 15732 3000.954 18733
## 6 flashflood 16141 1420.727 17562
## 7 drought 1046 13972.566 15019
## 8 hurricane 11868 2741.910 14610
## 9 riverflood 5119 5029.459 10148
## 10 icestorm 3945 5022.110 8967
## 11 tropicalstorm 7704 678.346 8382
## 12 winterstorm 6688 26.944 6715
## 13 highwind 5270 638.571 5909
## 14 wildfire 4765 295.473 5061
## 15 tstmwind 4493 554.007 5047
To represent our results graphically we need to melt the overall economic data
econmelt <- melt(econdata[1:10,1:3], id.vars="evtype", variable.name="cost")
Finally, our results on the historic economic costs of storms across the U.S. are summed up in the stacked bargraph below. We only graph the top ten for readability.
ggplot(econmelt, aes(x=evtype, y=value, fill = cost)) +
geom_bar(stat="identity") +
labs(title="Historical Economic Costs of the Most Damaging Storm Categories") +
xlab("Event Type") +
ylab("Economic Cost (millions of dollars)") +
theme(axis.text.x=element_text(angle = -90))
Overall hurricane/typhoons are the most economically costly weather event, causing ~20% of property and crop storm losses.
econdata$tot[1]/sum(econdata$tot)
## [1] 0.199
This represents an inital analysis of the U.S. storm data. As such there is room for improvement and deeper analysis.
Notably the event categories do not seem to have standardized reporting labels. Both ‘heat’ and ‘excessive heat’ are dangerous in terms of human cost individually, but might actually represent the similar enough weather conditions, for planning purposes, to be combined. An effort to clarify and combine synonymous event types might lead to more relevent ranking results.
Also stratifying the data by time period and analyzing the change in health and economic cost over time might be more relevent to modern planners. As might analyzing the data by cost per event occurance.
Finally, having worked on this report over multiple machines, there are slight differences in how different systems handle file names. So, for completeness, we include the session info for the final compilation of this report.
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
## [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
## [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] plyr_1.8.1 ggplot2_1.0.0 reshape2_1.4
##
## loaded via a namespace (and not attached):
## [1] colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_0.10
## [5] grid_3.1.2 gtable_0.1.2 htmltools_0.2.4 knitr_1.6
## [9] labeling_0.3 MASS_7.3-35 munsell_0.4.2 proto_0.3-10
## [13] Rcpp_0.11.2 rmarkdown_0.2.64 scales_0.2.4 stringr_0.6.2
## [17] tools_3.1.2 yaml_2.1.13