by epigenus

Nov 2014

Synopsis / Introduction

With this analysis we seek to inform interested parties of the historically costliest storm events with regard to health and economic factors. To perform this analysis use the NOAA US Storm Data which encompasses county reported data regarding storm type and outcomes across the U.S. from 1950 - 2011. From this we summarize and rank the historic total number of injuries and fatalities, and the dollar value of crop damage and property damage by storm type across the entire U.S. We find that most damage and costs are associated with just a few storm event categories, with tornadoes being the historically costliest overall. Full documentation regarding the data is available at Storm Data Documentation. Further clarifications can be found in Storm Events FAQ.

Data Processing

We use the NOAA US Storm Data to assess historic health and economic costs of U.S. Storms.

Libraries

To perform this analysis we make use of the reshape2, ggplot2, and plyr libraries written by Hadley Wickham.

    require(reshape2)
    require(ggplot2)
    require(plyr)

Loading and Subsetting

We load the full dataset into the stormdata data frame. Note: We will need to order and rank elements of the dataset later, so we choose not to have R assign the factor levels, at this time.

    stormdatafull <- read.csv(bzfile("repdata%2Fdata%2FStormData.csv.bz2"),
                          stringsAsFactors=FALSE)

From the Storm Data Documentation we know there is much more data than we need. We see that the data consists of multiple numeric values and classifiers reported for individual storm events at specific county locations. We are interested in the storm event type classifier (“EVTYPE”), the health cost data (“FATALITIES”, “INJURIES”), the economic cost data (“PROPDMG”, “CROPDMG”), and the multipliers for the economic cost datum (“PROPDMGEXP”, “CROPDMGEXP”) . We are only interested in the historic aggregate effect across the whole U.S.; date and location data aren’t neccessary for our analysis. We subset the full stormdata data frame, keeping the categories delineated above and discarding the rest.

    ofinterest <- c("EVTYPE", "FATALITIES", "INJURIES",
                    "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
    stormdata <- stormdatafull[, ofinterest]

A quick look at our data subset shows us:

    str(stormdata)
## 'data.frame':    902297 obs. of  7 variables:
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...

There are 902297 datapoints.

Tidying the Data

The data is not in a tidy format.

First we clean up the columns names:

    names(stormdata) <- tolower(names(stormdata))

Then we insure our classifier column evtype has tidy labels as values:

    stormdata$evtype <- tolower(stormdata$evtype)
    stormdata$evtype <- gsub("[[:punct:]]", "", stormdata$evtype)
    stormdata$evtype <- gsub("[[:space:]]", "", stormdata$evtype)

Economic Data Processing

According to the Storm Data Documentation, there should only be four types of propexp and cropexp values, (“B”, “M”, “K”, “”). Looking at the actual data we see more entries than just the official classifiers.

    unique(stormdata$propdmgexp)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-"
## [18] "1" "8"
    unique(stormdata$cropdmgexp)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Since these provide the dollar scale for the numeric economic values, any labels outside of the four allowed render that data point potentially meaningless. We wouldn’t properly know how to account for those values. So we first make sure these classifiers are uniformly uppercase.

    stormdata$propdmgexp <- toupper(stormdata$propdmgexp)
    stormdata$cropdmgexp <- toupper(stormdata$cropdmgexp)

Then we remove all rows with garbage propdmgexp and cropdmgexp values.

    stormdata <- stormdata[stormdata$propdmgexp %in% c("B", "M", "K", ""),]
    stormdata <- stormdata[stormdata$cropdmgexp %in% c("B", "M", "K", ""),]

The propdmgexp and cropdmgexp represent encoded multipliers of the economic data. To give property damage and crop damage data proper scale we need to multiply each monetary data point by the proper factor.

To do this we replace the encoded multiplier values with the numeric values (i.e. “B” = 1000000000) for each dmgexp column.

    stormdata$propdmgexp <- sapply(stormdata$propdmgexp, 
                                   function(x) switch(x, "B" = 1000000000, "M" = 1000000, "K" = 1000, 1))
    stormdata$cropdmgexp <- sapply(stormdata$cropdmgexp, 
                                   function(x) switch(x, "B" = 1000000000, "M" = 1000000, "K" = 1000, 1))

Then we multiply each dmg column with the corresponding dmgexp column and replace the dmg column scaled to units of millions of dollars .

    stormdata$propdmg <- stormdata$propdmg*stormdata$propdmgexp/1000000
    stormdata$cropdmg <- stormdata$cropdmg*stormdata$cropdmgexp/1000000

We no longer need the propdmgexp and cropdmgexp columns, so we remove them.

    stormdata <- stormdata[,-c(5,7)]

A brief summary of the numeric data, reveals there are no (labelled) missing data points and gives us a sense of magnitude of each cost measure:

    summary(stormdata[2:5])
##    fatalities     injuries         propdmg          cropdmg    
##  Min.   :  0   Min.   :   0.0   Min.   :     0   Min.   :   0  
##  1st Qu.:  0   1st Qu.:   0.0   1st Qu.:     0   1st Qu.:   0  
##  Median :  0   Median :   0.0   Median :     0   Median :   0  
##  Mean   :  0   Mean   :   0.2   Mean   :     0   Mean   :   0  
##  3rd Qu.:  0   3rd Qu.:   0.0   3rd Qu.:     0   3rd Qu.:   0  
##  Max.   :583   Max.   :1700.0   Max.   :115000   Max.   :5000

Seeing that one single event cost $115B Dollars, we investigate further by looking at the top five events in terms of property damage costs.

    stormdata <- arrange(stormdata, desc(propdmg))
    head(stormdata)
##             evtype fatalities injuries propdmg cropdmg
## 1            flood          0        0  115000    32.5
## 2       stormsurge          0        0   31300     0.0
## 3 hurricanetyphoon          0        0   16930     0.0
## 4       stormsurge          0        0   11260     0.0
## 5 hurricanetyphoon          5        0   10000     0.0
## 6 hurricanetyphoon          0        0    7350     0.0

The largest stormsurge in U.S. History was hurricane Katrina. It was also the most expensive weather event in U.S. history. While we prefer not to remove data, we believe it is clear that a flooding event four times more expensive than Katrina is misreported, so we remove it from the dataset.

    stormdata <- stormdata[-c(1),]

Event Type Data Processing

The Storm Data Documentation provides reporting guidelines. Officially there are only 48 storm event categories. Though a quick check tells us reporting officials do not comply with these guidelines.

    length(unique(stormdata$evtype))
## [1] 814

Rather than spend computational resources attempting to reconcile all these differences, our analysis will focus on the most dangerous or damaging storm event types. Our results will show the storm events with the highest human and economic costs were encompassed and reported within the 48 official categories.

We are primarily interested in the overall cost of storms by category across the U.S. Below we generate the total sum of each cost by storm event type for later use.

injuries <- dcast(stormdata, evtype~"injuries", sum, value.var="injuries")
fatalities <- dcast(stormdata, evtype~"fatalities", sum, value.var="fatalities")
propdmg <- dcast(stormdata, evtype~"propdmg", sum, value.var="propdmg")
cropdmg <- dcast(stormdata, evtype~"cropdmg", sum, value.var="cropdmg")

These will be the core data frames that we will build our results on.

Results

Historic Health Costs of Storm Events Across the U.S.

Individual Health Category Costs by Storm Type

A summary of the injuries and fatality numeric data shows us most storm event types had no health cost, while some had very high health costs.

    summary(injuries$injuries)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     173       0   91300
    summary(fatalities$fatalities)    
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      19       0    5630

Rather than simply report the results of the effects of each type of storm, we feel it is more elucidating to identify the most costly storm categories. To do this, we first rank the event types for each health cost.

    injuries <- arrange(injuries, desc(injuries))
    fatalities <- arrange(fatalities, desc(fatalities))

By analyzing the proportion of injuries caused by the top ten event types to the number of injuries cause by all events, we see that almost 90% of all injuries were caused by the top ten types of storms.

    sum(injuries$injuries[1:10])/sum(injuries$injuries)
## [1] 0.8936

Doing a similar analysis of fatalities shows that almost 80% of all deaths were caused by ten types of storm events.

    sum(fatalities$fatalities[1:10])/sum(fatalities$fatalities)
## [1] 0.798

While focusing on just the top ten event types, we still wish to maintain perspective on the health cost of other storm events.

To reduce the evtype categories, we store the names of the top 50 costliest for each health cost category. Hoping to at least encompass the 50 official categories.

    t5injuries <- as.character(injuries$evtype[1:50])
    t5fatalities <- as.character(fatalities$evtype[1:50])

We replace the event type of all events not in the top 50 of either cost category by the type “allothers” for each health cost category.

    t5evtype <- c(t5injuries, t5fatalities)
    injuries$evtype[!(injuries$evtype %in% t5evtype)] <- "allothers"
    fatalities$evtype[!(fatalities$evtype %in% t5evtype)] <- "allothers"

We recast the data to get the sum total of the new “allothers” categories.

injuries <- dcast(injuries, evtype~"injuries", sum, value.var="injuries")
fatalities <- dcast(fatalities, evtype~"fatalities", sum, value.var="fatalities")

The ten most dangerous storm events by health cost type are ranked, then summarized below.

    injuries <- arrange(injuries, desc(injuries))
    head(injuries[1:10,],10)
##              evtype injuries
## 1           tornado    91321
## 2          tstmwind     6957
## 3             flood     6789
## 4     excessiveheat     6525
## 5         lightning     5230
## 6              heat     2100
## 7          icestorm     1975
## 8        flashflood     1777
## 9  thunderstormwind     1488
## 10             hail     1358
    fatalities <- arrange(fatalities, desc(fatalities))
    head(fatalities[1:10,],10)
##           evtype fatalities
## 1        tornado       5630
## 2  excessiveheat       1903
## 3     flashflood        978
## 4           heat        937
## 5      lightning        817
## 6       tstmwind        504
## 7          flood        470
## 8     ripcurrent        368
## 9      allothers        307
## 10      highwind        246

This summary of the ranked list shows that most storms with a significant cost probably were reported in compliance with NOAA guidelines. The “allothers”" designation does not remove a significant portion of the most damaging events. The 700+ combined events that “allothers” covers does not enter the top five ranking in either injuries or fatalities.

Overall Health Cost by Storm Type

To get an over all sense of the health costs of storms in the U.S. we need to combine the cost categories and look at them ranked by total cost.

We merge our two cost categories.

    healthdata <- merge(injuries, fatalities)

We add a new column for the total health cost of each storm event type.

    healthdata$tot <- rowSums(healthdata[,c(2,3)])

Then we add levelled factors and rerank our healthdata data frame by total cost.

    healthdata$evtype <- factor(healthdata$evtype,
                        levels=healthdata[
                            order(healthdata$tot, decreasing=TRUE), "evtype"])
    healthdata <- arrange(healthdata, desc(tot), injuries)

From the resulting data table, we see that tornados are far and away the most historically dangerous storm event across the U.S. Notice the “allothers” category collectively does not rank in the top ten.

    healthdata[1:10,]
##              evtype injuries fatalities   tot
## 1           tornado    91321       5630 96951
## 2     excessiveheat     6525       1903  8428
## 3          tstmwind     6957        504  7461
## 4             flood     6789        470  7259
## 5         lightning     5230        817  6047
## 6              heat     2100        937  3037
## 7        flashflood     1777        978  2755
## 8          icestorm     1975         89  2064
## 9  thunderstormwind     1488        133  1621
## 10      winterstorm     1321        206  1527

To represent our results graphically we need to melt the overall health data

    healthmelt <- melt(healthdata[1:10,1:3], id.vars="evtype", variable.name="cost")

Finally, our results on the historic health costs of storms across the U.S. are summed up in the stacked bargraph below.

    ggplot(healthmelt, aes(x=evtype, y=value, fill = cost)) + 
    geom_bar(stat="identity") +
    labs(title="Historical Health Costs of the Most Dangerous Storm Categories") + 
    xlab("Event Type") +
    ylab("Human Health Cost") +
    theme(axis.text.x=element_text(angle = 90))

plot of chunk healthgraph

Overall tornados were responsible for ~60% of all health costs.

    healthdata$tot[1]/sum(healthdata$tot)
## [1] 0.6231

Historic Economic Costs of Storm Events Across the U.S.

Individual Economic Category Costs by Storm Type

The analysis of the economic cost of storms is similar to the analysis of health costs. The verbage is almost identical, but the result differ.

A summary of the property damage and property damage cost data shows us most storm event types had no economic cost, while some had very high health costs.

    summary(propdmg$propdmg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0     384       0   69300
    summary(cropdmg$cropdmg)    
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##       0       0       0      60       0   14000

Rather than simply report the results of the effects of each type of storm, we feel it is more elucidating to identify the most costly storm categories. To do this, we first rank the event types for each type of economic cost.

    propdmg <- arrange(propdmg, desc(propdmg))
    cropdmg <- arrange(cropdmg, desc(cropdmg))

The economic damage is more evenly spread of the storm categories, so it is more appropriate to look into the top fifteen types of events, rather than top ten. By analyzing the proportion of property damage caused by the top fifteen event types to the property damage caused by all events, we see that over 90% of all property damage was caused by just fifteen types of storms.

    sum(propdmg$propdmg[1:15])/sum(propdmg$propdmg)
## [1] 0.9144

Doing a similar analysis of crop damage shows that over 90% of all damage was also caused by fifteen types of storm events.

    sum(cropdmg$cropdmg[1:15])/sum(cropdmg$cropdmg)
## [1] 0.917

While focusing on just the top fifteen event types, we still wish to maintain perspective on the economic costs of other storm events.

To reduce the evtype categories, we store the names of the top 50 costliest for each economic category. Hoping to at least encompass the 50 official categories.

    t5propdmg <- as.character(propdmg$evtype[1:50])
    t5cropdmg <- as.character(cropdmg$evtype[1:50])

We replace the event type of all events not in the top 50 of either cost category by the type “allothers” for each economic cost category.

    t5evtype <- c(t5propdmg, t5cropdmg)
    propdmg$evtype[!(propdmg$evtype %in% t5evtype)] <- "allothers"
    cropdmg$evtype[!(cropdmg$evtype %in% t5evtype)] <- "allothers"

We recast the data to get the sum total of the new, economic “allothers” categories.

propdmg <- dcast(propdmg, evtype~"propdmg", sum, value.var="propdmg")
cropdmg <- dcast(cropdmg, evtype~"cropdmg", sum, value.var="cropdmg")

The ten most damaging storm events by economic cost type are ranked, then summarized below.

    propdmg <- arrange(propdmg, desc(propdmg))
    head(propdmg[1:15,], 15)
##              evtype propdmg
## 1  hurricanetyphoon   69306
## 2           tornado   56937
## 3        stormsurge   43324
## 4             flood   29658
## 5        flashflood   16141
## 6              hail   15732
## 7         hurricane   11868
## 8     tropicalstorm    7704
## 9       winterstorm    6688
## 10         highwind    5270
## 11       riverflood    5119
## 12         wildfire    4765
## 13   stormsurgetide    4641
## 14         tstmwind    4493
## 15         icestorm    3945
    cropdmg <- arrange(cropdmg, desc(cropdmg))
    head(cropdmg[1:15,], 15)
##              evtype cropdmg
## 1           drought 13972.6
## 2             flood  5629.5
## 3        riverflood  5029.5
## 4          icestorm  5022.1
## 5              hail  3001.0
## 6         hurricane  2741.9
## 7  hurricanetyphoon  2607.9
## 8        flashflood  1420.7
## 9       extremecold  1313.0
## 10      frostfreeze  1094.2
## 11        heavyrain   733.4
## 12    tropicalstorm   678.3
## 13         highwind   638.6
## 14         tstmwind   554.0
## 15    excessiveheat   492.4

This summary of the ranked list shows that most storms with a significant cost probably were reported in compliance with NOAA guidelines. The “allothers”" designation does not remove a significant portion of damaging events. The 700+ combined events that “allothers” covers does not break the top fifteen ranking in either property damage or crop damage.

Unlike the health costs data, we see that tornadoes do not dominate both categories. Strong Wind sorts of events seem to cause the most property damage (which includes tornadoes), while Flooding and Drought have clearly caused the most crop damage historically.

Overall Economic Cost by Storm Type

To get an over all sense of the economic costs of storms in the U.S. we need to combine the cost categories and look at the ranked total.

We merge our two cost categories.

    econdata <- merge(propdmg, cropdmg)

We add a new column for the total economic cost of each storm event type.

    econdata$tot <- rowSums(econdata[,c(2,3)])

Then we add levelled factors and rerank our econdata data frame by total cost.

    econdata$evtype <- factor(econdata$evtype,
                        levels=econdata[
                            order(econdata$tot, decreasing=TRUE), "evtype"])
    econdata <- arrange(econdata, desc(tot), propdmg)

From the resulting data table, we see that economic damage is caused by a wide range of storm types.

    econdata[1:15,]
##              evtype propdmg   cropdmg   tot
## 1  hurricanetyphoon   69306  2607.873 71914
## 2           tornado   56937   364.950 57302
## 3        stormsurge   43324     0.005 43324
## 4             flood   29658  5629.468 35287
## 5              hail   15732  3000.954 18733
## 6        flashflood   16141  1420.727 17562
## 7           drought    1046 13972.566 15019
## 8         hurricane   11868  2741.910 14610
## 9        riverflood    5119  5029.459 10148
## 10         icestorm    3945  5022.110  8967
## 11    tropicalstorm    7704   678.346  8382
## 12      winterstorm    6688    26.944  6715
## 13         highwind    5270   638.571  5909
## 14         wildfire    4765   295.473  5061
## 15         tstmwind    4493   554.007  5047

To represent our results graphically we need to melt the overall economic data

    econmelt <- melt(econdata[1:10,1:3], id.vars="evtype", variable.name="cost")

Finally, our results on the historic economic costs of storms across the U.S. are summed up in the stacked bargraph below. We only graph the top ten for readability.

    ggplot(econmelt, aes(x=evtype, y=value, fill = cost)) + 
    geom_bar(stat="identity") +
    labs(title="Historical Economic Costs of the Most Damaging Storm Categories") + 
    xlab("Event Type") +
    ylab("Economic Cost (millions of dollars)") +
    theme(axis.text.x=element_text(angle = -90))

plot of chunk econgraph

Overall hurricane/typhoons are the most economically costly weather event, causing ~20% of property and crop storm losses.

    econdata$tot[1]/sum(econdata$tot)
## [1] 0.199

Concluding Remarks

This represents an inital analysis of the U.S. storm data. As such there is room for improvement and deeper analysis.

Notably the event categories do not seem to have standardized reporting labels. Both ‘heat’ and ‘excessive heat’ are dangerous in terms of human cost individually, but might actually represent the similar enough weather conditions, for planning purposes, to be combined. An effort to clarify and combine synonymous event types might lead to more relevent ranking results.

Also stratifying the data by time period and analyzing the change in health and economic cost over time might be more relevent to modern planners. As might analyzing the data by cost per event occurance.

Finally, having worked on this report over multiple machines, there are slight differences in how different systems handle file names. So, for completeness, we include the session info for the final compilation of this report.

    sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] plyr_1.8.1    ggplot2_1.0.0 reshape2_1.4 
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_0.10    
##  [5] grid_3.1.2       gtable_0.1.2     htmltools_0.2.4  knitr_1.6       
##  [9] labeling_0.3     MASS_7.3-35      munsell_0.4.2    proto_0.3-10    
## [13] Rcpp_0.11.2      rmarkdown_0.2.64 scales_0.2.4     stringr_0.6.2   
## [17] tools_3.1.2      yaml_2.1.13