Synopsis

In this peer assessed project, I analyze a data set of storm events in the United States published by the National Weather Service. The data set is available here and there is documentation available here and here. The storm events included in the data set are diverse, including droughts, tornadoes, winter weather events, and others. Events are recorded from 1950 to 2011, but events from the early years in the data set are sparse, indicating changes over the years in record keeping. Observations recorded for the storm events include event type, location, time/date, deaths, injuries, property/crop damage, and others. In this analysis, I will examine the types of storm events in this data set and the consequences for these events for human health and the economy.

Data Processing

First, I load the libraries used in this analysis and read the storm event data from the CSV file.

library(dplyr, warn.conflicts = FALSE)
library(reshape2)
library(ggplot2)
storms <- read.csv("./repdata-data-StormData.csv.bz2", stringsAsFactors = FALSE)

To examine the economic consequences of storm events, I need the estimated property and crop damage as recorded in this data set. These damage estimates are coded in two columns each, one with 3 significant digits and one with an exponent: “K” for thousands, “M” for millions, and “B” for billions. Not all events have properly coded damage exponents, however; some have values such as “?”, “2”, or so on. Let’s examine how many storm events have incorrectly coded exponents, to see if it would be reasonable to remove these for my analysis.

meanprop <- mean(storms$PROPDMGEXP != "K" & 
        storms$PROPDMGEXP != "M" & 
        storms$PROPDMGEXP != "B" & 
        storms$PROPDMGEXP != "")
        
meancrop <- mean(storms$CROPDMGEXP != "K" & 
        storms$CROPDMGEXP != "M" & 
        storms$CROPDMGEXP != "B" & 
        storms$CROPDMGEXP != "")

meanprop
## [1] 0.0003635167
meancrop
## [1] 5.430584e-05

So 0.0363517% of the values have incorrectly coded exponents for the property damage and 0.0054306% of the values have incorrectly coded exponents for the crop damage. It does not seem unreasonable to remove these events from my analysis so that we have complete information on the economic impact of the storm events.

As I keep only the storm events with properly coded exponents (first for property), I will also only keep the observations I will use for the analysis: event type, fatalities, injuries, property damage, and crop damage.

myStorms <- storms[storms$PROPDMGEXP == "K" | 
                           storms$PROPDMGEXP == "M" | 
                           storms$PROPDMGEXP == "B" | 
                           storms$PROPDMGEXP == "",c(8, 23:28)]

Now, I keep only the observations with properly coded exponents for the crop damage.

myStorms <- myStorms[myStorms$CROPDMGEXP == "K" | 
                           myStorms$CROPDMGEXP == "M" | 
                           myStorms$CROPDMGEXP == "B" | 
                           myStorms$CROPDMGEXP == "",]

For my analysis, I want the actual dollar amounts for the damage estimates, so I make new columns for the property and crop damage estimates based on the 3 significant figures and the exponent. I use mutate and nested ifelse statements to make the new column that contains the value in dollars.

myStorms <- myStorms %>% 
        mutate(propdollars = ifelse(PROPDMGEXP == "K", PROPDMG*1e3, 
                                    ifelse(PROPDMGEXP == "M", PROPDMG*1e6, 
                                           ifelse(PROPDMGEXP == "B", PROPDMG*1e9, PROPDMG))))
myStorms <- myStorms %>% 
        mutate(cropdollars = ifelse(CROPDMGEXP == "K", CROPDMG*1e3, 
                                    ifelse(CROPDMGEXP == "M", CROPDMG*1e6, 
                                           ifelse(CROPDMGEXP == "B", CROPDMG*1e9, CROPDMG))))

The documentation for this data set has 48 officially recognized event types, but there are actually 985 unique entries in the event type column. For example, the documentation includes the event types “WINTER WEATHER” and “WINTER STORM”, but actual entries in the event type column also include “WINTER STORMS”, “WINTERY MIX”, “WINTRY MIX”, “WINTER WEATHER/MIX”, “WINTER WEATHER MIX”, and so forth. To get a handle on the wide variety of event types recorded in this data set, I chose to make a new set of 28 event types (including “Other”), based roughly on the event types from documentation but combining several together. My scheme for assigning event type is to first assign everything to type “Other”, then to use regex to step through the other 27 event types and see if the original event type from the data set matches any of my new event types. In this type of scheme, the order of the regex searches matters, as later searches will overwrite previous ones. Thus, I have put the search for “Tornado” after the search for “Thunderstorm Wind” since they often were coded together but it seems more appropriate to me to put those events in the “Tornado” category. As another example, the category for “Marine Thunderstorm Wind” should be searched after the category for “Thunderstorm Wind”.

myStorms$stormtype <- "Other"
myStorms$stormtype[grepl("^.*?avalanche.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Avalanche"
myStorms$stormtype[grepl("^.*?drought.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Drought"
myStorms$stormtype[grepl("^.*?dust.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Dust Devil/Storm"
myStorms$stormtype[grepl("^.*?fire.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Wildfire"
myStorms$stormtype[grepl("^.*?volcanic.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Volcanic Ash/Eruption"
myStorms$stormtype[grepl("^.*?tsunami.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Tsunami"
myStorms$stormtype[grepl("^.*?landslide.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Landslide"
myStorms$stormtype[grepl("^.*?tropical storm|hurricane.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Hurricane/Tropical Storm"
myStorms$stormtype[grepl("^.*?tropical depression.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Tropical Depression"
myStorms$stormtype[grepl("^.*?seiche.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Seiche"
myStorms$stormtype[grepl("^.*?surf.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "High Surf"
myStorms$stormtype[grepl("^.*?wind.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Wind"
myStorms$stormtype[grepl("^.*?astronomical.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Astronomical Low Tide"
myStorms$stormtype[grepl("^.*?current.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Rip Current"
myStorms$stormtype[grepl("^.*?storm surge|high seas.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Storm Surge/Tide"
myStorms$stormtype[grepl("^.*?rain.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Heavy Rain"
myStorms$stormtype[grepl("^.*?fog.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Dense Fog"
myStorms$stormtype[grepl("^.*?smoke.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Dense Smoke"
myStorms$stormtype[grepl("^.*?tstm|thunderstorm.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Thunderstorm Wind"
myStorms$stormtype[grepl("^.*?lightning.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Lightning"
myStorms$stormtype[grepl("^.*?tornado|waterspout|funnel.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Tornado/Funnel"
myStorms$stormtype[grepl("^.*?flood|fld.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Flooding"
myStorms$stormtype[grepl("^.*?wint|cold|snow|sleet|ic[ey]|freez|frost|chill|blizzard.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Winter/Cold"
myStorms$stormtype[grepl("^.*?hail.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Hail"
myStorms$stormtype[grepl("^.*?heat|warm.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Summer/Heat"
myStorms$stormtype[grepl("^.*?marine.*?wind.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Marine Wind"
myStorms$stormtype[grepl("^.*?marine.*?tstm|thunderstorm.*?$", myStorms$EVTYPE, 
                         ignore.case = TRUE)] <- "Marine Thunderstorm Wind"
myStorms$stormtype <- as.factor(myStorms$stormtype)

This approach is flexible and could be adjusted for different needs. For example, this approach could be used to exactly match the 48 official event types from the documentation, or to look at the differences in consequences between wind chill and snow events, for example. My scheme here is intended to get a general idea of all the types of events included in the storm data set.

Results

Let’s look at how the storm events in the data set are distributed among the 28 event types that I assigned.

ggplot(myStorms, aes(x=stormtype, fill = stormtype)) + 
        geom_bar(state = "identity") +
        theme(legend.position = "none", axis.title.x = element_blank(), 
              axis.text.x= element_text(angle=45, hjust = 1)) +
        ggtitle("Storm Events in the Data Set") +
        ylab("Number of storm events")

This figure illustrates how many storm events of each type there are in the data set. Hail events are most numerous in this data set, followed by thunderstorm wind events and then marine thunderstorm wind events.

Storm Events and Human Health

To analyze the impact of storm events on human health, I use dplyr to group the storm events by storm type and then sum up the total deaths and injuries for each storm type. By summing (rather than taking another estimator such as the mean or median), I am looking at the total effect of all events of a certain type.

health <- myStorms %>% group_by(stormtype) %>% 
        summarize(deaths = sum(FATALITIES), injuries = sum(INJURIES))

To find the storm types that have caused the most deaths in this data set, I order the storm types by number of deaths and then look at the top 10.

deaths <- health[order(health$deaths, decreasing = TRUE), c(1:2)]
deaths <- deaths[c(1:10),]
deaths
## Source: local data frame [10 x 2]
## 
##                   stormtype deaths
##                      (fctr)  (dbl)
## 1            Tornado/Funnel   5636
## 2               Summer/Heat   3178
## 3                  Flooding   1553
## 4               Winter/Cold   1116
## 5                 Lightning    817
## 6               Rip Current    577
## 7         Thunderstorm Wind    505
## 8                      Wind    433
## 9                 Avalanche    224
## 10 Marine Thunderstorm Wind    219

Tornadoes have caused the most deaths, about as many as the next three storm types (intense summer heat, flooding, and the various winter weather storms) put together.

I can repeat the same process for injuries.

injuries <- health[order(health$injuries, decreasing = TRUE), c(1,3)]
injuries <- injuries[c(1:10),]
injuries
## Source: local data frame [10 x 2]
## 
##                   stormtype injuries
##                      (fctr)    (dbl)
## 1            Tornado/Funnel    91378
## 2               Summer/Heat     9243
## 3                  Flooding     8681
## 4         Thunderstorm Wind     6962
## 5               Winter/Cold     6485
## 6                 Lightning     5231
## 7  Marine Thunderstorm Wind     2486
## 8                      Wind     1846
## 9  Hurricane/Tropical Storm     1711
## 10                 Wildfire     1608

Again, the most people have been injured by tornadoes, about 10 times more than the next most common cause of injuries among these storm events, intense summer heat.

These numbers are measures of the total health impact over time of certain storm types, but storm types also vary in their health impact per event. Some storm types may be rare but have a high number of deaths/injuries each time they occur. Let’s look at the distribution of death and injuries per event for the most harmful storm types for human health.

I use melt to reshape the dataframe for plotting. I only include the storm types that were in the top 10 for either total deaths or injuries.

meltedDF <- melt(myStorms[,c(10,2:3)])
## Using stormtype as id variables
colnames(meltedDF) <- c("stormtype", "type", "number")
meltedDF$type <- factor(c("Deaths","Injuries"))

ggplot(meltedDF[meltedDF$stormtype %in% deaths$stormtype | 
                        meltedDF$stormtype %in% injuries$stormtype,], 
       aes(x = stormtype, y = number, fill = type)) +
        facet_wrap(~type, ncol = 1) +
        theme(legend.position = "none", axis.title.x = element_blank(), 
              axis.text.x= element_text(angle=45, hjust = 1)) +
        ggtitle("Storm Events and Human Health") +
        ylab("Number of harmful health events per storm event") +
        geom_boxplot() + scale_y_log10()

This box plot shows the distribution of deaths and injuries per event for each type of storm event; notice that these are log plots. Deaths and injuries are quite rare in storm events. The median is zero or close to zero for most of these types, although the outliers can reach quite high for many of these event types. The overall pattern for deaths and injuries per event look quite similar to each other, even in the outliers.

Storm Events and the Economy

To examine the economic impact of storm events, I take a similar approach as above. I group the storm events by storm type and then sum up the total property and crop damage for each storm type. By summing, I am examining the total effect of all events of a certain type. One detail to keep in mind is that the damage estimates were recorded at the time of each storm and have not been adjusted in any way for inflation, etc. This should not have a significant effect on comparing between storm types unless certain types were recorded more or less often in the past than today.

economy <- myStorms %>% group_by(stormtype) %>% 
        summarize(property = sum(propdollars), cropdamage = sum(cropdollars))

propertydamage <- economy[order(economy$property, decreasing = TRUE), c(1:2)]
propertydamage <- propertydamage[c(1:10),]
propertydamage
## Source: local data frame [10 x 2]
## 
##                   stormtype     property
##                      (fctr)        (dbl)
## 1                  Flooding 167574306194
## 2  Hurricane/Tropical Storm  92350570560
## 3            Tornado/Funnel  56991181983
## 4          Storm Surge/Tide  47964739500
## 5                      Hail  17613798277
## 6               Winter/Cold  12713642910
## 7                  Wildfire   8496628500
## 8  Marine Thunderstorm Wind   6435243317
## 9                      Wind   6171580653
## 10        Thunderstorm Wind   4493691940
cropdamage <- economy[order(economy$cropdamage, decreasing = TRUE), c(1,3)]
cropdamage <- cropdamage[c(1:10),]
cropdamage
## Source: local data frame [10 x 2]
## 
##                   stormtype  cropdamage
##                      (fctr)       (dbl)
## 1                   Drought 13972566000
## 2                  Flooding 12382947200
## 3               Winter/Cold  8752124450
## 4  Hurricane/Tropical Storm  6190188800
## 5                      Hail  3088666853
## 6               Summer/Heat   904479280
## 7                Heavy Rain   806162800
## 8                      Wind   760815400
## 9  Marine Thunderstorm Wind   652998808
## 10        Thunderstorm Wind   554007350

Flooding has caused the most property damage over time by far while drought and flooding have caused the most crop damage, in close to equal measure.

Again, let’s look at how individual storm events cause property and crop damage. I’ll include here the top 10 most harmful storm types for either property or crop damage.

meltedDF <- melt(myStorms[,c(8:10)])
## Using stormtype as id variables
colnames(meltedDF) <- c("stormtype", "damage", "dollars")
meltedDF$damage <- factor(c("Property Damage","Crop Damage"))

ggplot(meltedDF[meltedDF$stormtype %in% propertydamage$stormtype | 
                        meltedDF$stormtype %in% cropdamage$stormtype,], 
       aes(x = stormtype, y = dollars, fill = damage)) +
        facet_wrap(~damage, ncol = 1) +
        theme(legend.position = "none", axis.title.x = element_blank(), 
              axis.text.x= element_text(angle=45, hjust = 1)) +
        ggtitle("Storm Events and the Economy") +
        ylab("Damage estimate per storm event (dollars)") +
        geom_boxplot() + scale_y_log10()

This box plot shows the distribution of crop and property damage per event for each type of storm event; notice that this is a log plot again. The pattern of crop and property damage per event is very similar across event type, with droughts and hurricanes causing a high level of damage per event. The difference in total crop and property damage over time must be because some events, such as drought, do not always cause property damage, or for such damage to be reported in the same way as crop damage.