1. Synopsis

Storms across the U.S. cost billions in damage every year. Even worse, storms can kill. Between 1990 and 2011 there were over 230,000 storms that caused significant damage and/or casualties. The most deadly types of weather events were tornadoes, heat waves, and wind. These three types of events injured or killed over 50,000 people during that time period. The most economically devastating types of weather-related events were floods, hurricanes/typhoons, and wind. Altogether, these three types of events cost nearly a third of a trillion dollars in the 21 years between 1990 and 2011. Given the amount of damage caused, it is interesting that most storms take place over less than half the area of the continental U.S., mostly in the east and eastern mid-west. Storms are particularly concentrated in what we can call the “storm center of America,” namely Iowa.

2. Data Processing

2.1. Loading Packages

library(dplyr)
library(reshape2)
library(ggplot2)
library(ggmap)
library(lubridate)
library(jsonlite)

2.2. Loading Data

The first thing to do after loading the required packages is to load the data set. The data was originally obtained from the National Oceanic and Atmospheric Administration (NOAA). Documentation on the data is available here, here, here, and also here.

There are 902,297 observations of 37 columns total in our data (downloaded from the course project site at the URL in the code block below). We will only need 10 of the columns:

  • BGN_DATE: this is the date on which the storm began. Once loaded, we translate this column to a lubridate Date type.
  • EVTYPE: this is the type of storm event. This column requires significant cleaning (see below).
  • FATALITIES, INJURIES: together, these give a measure of how bad the storm was in terms of population health.
  • PROPDMG, PROPDMGEXP and CROPDMG, CROPDMGEXP: the columns PROPDMG and CROPDMG give the mantissa and the columns PROPDMGEXP and CROPDMGEXP give the exponent for the values of property damage and crop damage, respectively. These columns give a measure of the economic impact of the storm.
  • LATITUDE, LONGITUDE: the latitude and longitude where the storm began. These values enable us to plot the storms.

Since very few types of events other than tornadoes are available in the data before 1990, we choose to restrict our analysis to the 1990-2011 period.

thefileurl <- 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
if (!file.exists('repdata_data_StormData.csv.bz2')) {
    download.file(thefileurl, 'repdata_data_StormData.csv.bz2')
}
data <- read.csv('repdata_data_StormData.csv.bz2', 
                  header = TRUE, 
                  sep = ',', 
                  nrows = 902298,
                  na.strings = '',
                  colClasses = c('NULL', 
                                 'character',
                                 rep('NULL', 5),
                                 'character',
                                 rep('NULL', 14),
                                 rep('numeric', 3),
                                 'character',
                                 'numeric', 
                                 'character',
                                 rep('NULL', 3),
                                 rep('numeric',2),
                                 rep('NULL', 4))) %>%
    mutate(DATE = as.Date(BGN_DATE, format = '%m/%d/%Y %H:%M:%S')) %>%
    filter(year(DATE) > 1989)

2.3. Pre-Processing

According to the FAQ here, the first two digits of the latitude and longitude are degrees, the next two digits are minutes, and the last two digits (after the decimal point) are seconds. Also, latitude is degrees north (positive) whereas longitude is degrees west (negative). Thus we can easily rectify the values with the following code.

data <- mutate(data, LAT = LATITUDE/100, LNG = -LONGITUDE/100)

Now for the event types. There are only 48 official event types according to NOAA (see here, for example). However when we get the number of distinct values in the EVTYPE column, it is almost 900. All these are different ways people have of entering the same 48 event types.

data$EVTYPE <- toupper(data$EVTYPE)
nrow(distinct(data, EVTYPE))
## [1] 898

In order to ensure the most consistent translation of the extra EVTYPE values, I manually provided the map between official event types and actual event types. To facilitate this, I created a program in JavaScript that captured my input for each actual type and matched it to the official type that I indicated. The result was a JSON file I could easily load in R using the jsonlite package. (The file ‘event_matches.json’ is available here.)

matches <- fromJSON('event_matches.json')
head(matches, 2)
## $HAIL
##  [1] "HAIL"                    "THUNDERSTORM WINDS/HAIL"
##  [3] "THUNDERSTORM WINDS HAIL" "HAIL 1.75)"             
##  [5] "HAIL STORM"              "HAIL 75"                
##  [7] "SMALL HAIL"              "HAIL 80"                
##  [9] "FUNNEL CLOUD/HAIL"       "HAIL 0.75"              
## [11] "HAIL 1.00"               "HAIL/WINDS"             
## [13] "HAIL/WIND"               "HAIL 1.75"              
## [15] "WIND/HAIL"               "HAIL 225"               
## [17] "HAIL 0.88"               "DEEP HAIL"              
## [19] "HAIL 88"                 "HAIL 175"               
## [21] "HAIL 100"                "HAIL 150"               
## [23] "HAIL 075"                "HAIL 125"               
## [25] "HAIL 200"                "HAIL DAMAGE"            
## [27] "THUNDERSTORM HAIL"       "HAIL 088"               
## [29] "THUNDERSTORM WINDSHAIL"  "HAIL/ICY ROADS"         
## [31] "HAIL ALOFT"              "THUNDERSTORM WIND/HAIL" 
## [33] "HAIL 275"                "HAIL 450"               
## [35] "HAILSTORM"               "HAILSTORMS"             
## [37] "TSTM WIND/HAIL"          "HAIL(0.75)"             
## [39] "GUSTY WIND/HAIL"         "LATE SEASON HAIL"       
## [41] "NON SEVERE HAIL"        
## 
## $`THUNDERSTORM WIND`
##   [1] "TSTM WIND"                      "THUNDERSTORM WINDS"            
##   [3] "THUNDERSTORM WIND"              "THUNDERSTORM WINS"             
##   [5] "THUNDERSTORM WINDS LIGHTNING"   "THUNDERSTORM"                  
##   [7] "THUNDERSTORM WINDS/FUNNEL CLOU" "SEVERE THUNDERSTORM"           
##   [9] "SEVERE THUNDERSTORMS"           "SEVERE THUNDERSTORM WINDS"     
##  [11] "THUNDERSTORMS WINDS"            "THUNDERSTORMS"                 
##  [13] "LIGHTNING THUNDERSTORM WINDSS"  "THUNDERSTORM WINDS 60"         
##  [15] "THUNDERSTORM WINDSS"            "LIGHTNING THUNDERSTORM WINDS"  
##  [17] "LIGHTNING AND THUNDERSTORM WIN" "THUNDERSTORM WINDS53"          
##  [19] "THUNDERSTORM WINDS 13"          "THUNDERSTORM WINDS SMALL STREA"
##  [21] "THUNDERSTORM WINDS 2"           "TSTM WIND 51"                  
##  [23] "TSTM WIND 50"                   "TSTM WIND 52"                  
##  [25] "TSTM WIND 55"                   "THUNDERSTORM WINDS 61"         
##  [27] "THUNDERSTORM DAMAGE"            "THUNDERTORM WINDS"             
##  [29] "THUNDERSTORMW 50"               "THUNDERSTORMS WIND"            
##  [31] "THUNDERSTORM  WINDS"            "TUNDERSTORM WIND"              
##  [33] "THUNDERTSORM WIND"              "THUNDERSTORM WINDS/ HAIL"      
##  [35] "THUNDERSTORM WIND/LIGHTNING"    "THUNDESTORM WINDS"             
##  [37] "THUNDERSTORM WIND G50"          "THUNDERSTORM WINDS/HEAVY RAIN" 
##  [39] "THUNDERSTROM WINDS"             "THUNDERSTORM WINDS      LE CEN"
##  [41] "THUNDERSTORM WINDS G"           "THUNDERSTORM WIND G60"         
##  [43] "THUNDERSTORM WINDS."            "THUNDERSTORM WIND G55"         
##  [45] "THUNDERSTORM WINDS G60"         "THUNDERSTORM WINDS FUNNEL CLOU"
##  [47] "THUNDERSTORM WINDS 62"          "THUNDERSTORM WINDS 53"         
##  [49] "THUNDERSTORM WIND 59"           "THUNDERSTORM WIND 52"          
##  [51] "THUNDERSTORM WIND 69"           "TSTM WIND G58"                 
##  [53] "THUNDERSTORMW WINDS"            "THUNDERSTORM WIND 60 MPH"      
##  [55] "THUNDERSTORM WIND 65MPH"        "THUNDERSTORM WIND/ TREES"      
##  [57] "THUNDERSTORM WIND/AWNING"       "THUNDERSTORM WIND 98 MPH"      
##  [59] "THUNDERSTORM WIND TREES"        "THUNDERSTORM WIND 59 MPH"      
##  [61] "THUNDERSTORM WINDS 63 MPH"      "THUNDERSTORM WIND/ TREE"       
##  [63] "THUNDERSTORM DAMAGE TO"         "THUNDERSTORM WIND 65 MPH"      
##  [65] "THUNDERSTORM WIND."             "THUNDERSTORM WIND 59 MPH."     
##  [67] "THUDERSTORM WINDS"              "THUNDERSTORM WINDS AND"        
##  [69] "TSTM WIND DAMAGE"               "THUNDERSTORM WINDS 50"         
##  [71] "THUNDERSTORM WIND G52"          "THUNDERSTORM WINDS 52"         
##  [73] "THUNDERSTORM WIND G51"          "THUNDERSTORM WIND G61"         
##  [75] "THUNDERESTORM WINDS"            "THUNDEERSTORM WINDS"           
##  [77] "THUNDERSTORM W INDS"            "THUNDERSTORM WIND 50"          
##  [79] "THUNERSTORM WINDS"              "THUNDERSTORM WIND 56"          
##  [81] "THUNDERSTORMW"                  "TSTM WINDS"                    
##  [83] "TSTMW"                          "TSTM WIND 65)"                 
##  [85] "THUNDERSTORM WINDS/ FLOOD"      "THUNDERSTORMWINDS"             
##  [87] "THUNDERSTORM WINDS HEAVY RAIN"  "THUNDERSTROM WIND"             
##  [89] "METRO STORM, MAY 26"            "TSTM WIND (G45)"               
##  [91] "TSTM WIND 40"                   "TSTM WIND 45"                  
##  [93] "TSTM WIND (41)"                 "TSTM WIND (G40)"               
##  [95] "TSTM WND"                       " TSTM WIND"                    
##  [97] "TSTM WIND AND LIGHTNING"        " TSTM WIND (G45)"              
##  [99] "TSTM WIND  (G45)"               "TSTM WIND (G35)"               
## [101] "TSTM"                           "TSTM WIND G45"                 
## [103] "THUNDERSTORM WIND (G40)"        "GUSTY THUNDERSTORM WINDS"      
## [105] "GUSTY THUNDERSTORM WIND"

The variable matches is a named list of lists. The name (for example ‘HAIL’) is the official event, and the values (e.g. [[‘HAIL’]]) are the actual events that exist in the data which I matched up to the best of my knowledge. To get the final results that conform to the 48 official event types (plus one event type “none” I used for incomprehensible entries), we just set each element of the data that is in a particular list to the name of that list, as in the following code block.

s <- lapply(seq_along(matches), 
            function(i) { data[data$EVTYPE %in% matches[[i]], 'EVTYPE'] <<- names(matches)[i] })

Now that EVTYPES is taken care of, it remains to deal with the property damage and crop damage columns. First, the exponents. People had several ways of entering exponents, which we can encode to numeric values as follows.

data <- mutate(data, PROPDMGEXP=ifelse(PROPDMGEXP %in% c("M", "m", "6"), 1000000,
                                  ifelse(PROPDMGEXP %in% c("K", "k", "3"), 1000,
                                    ifelse(PROPDMGEXP %in% c("B", "b", "9"), 1000000000,
                                      ifelse(PROPDMGEXP %in% c("+", "0", "?", "-"), 1,
                                        ifelse(PROPDMGEXP %in% c("5"), 100000,
                                           ifelse(PROPDMGEXP %in% c("4"), 10000,
                                              ifelse(PROPDMGEXP %in% c("2", "h", "H"), 100,
                                                 ifelse(PROPDMGEXP %in% c("7"), 10000000, 100000000)))))))))

data <- mutate(data, CROPDMGEXP=ifelse(CROPDMGEXP %in% c("M", "m"), 1000000,
                                  ifelse(CROPDMGEXP %in% c("K", "k"), 1000,
                                     ifelse(CROPDMGEXP %in% c("B", "b"), 1000000000,
                                        ifelse(CROPDMGEXP %in% c("?", "0"), 1, 100)))))

Once that’s done, we can compute the total damage as well as “casualties,” a term which I am using to encompass both injuries and fatalities.

data <- mutate(data, TOTALDMG = (PROPDMG * PROPDMGEXP) + (CROPDMG * CROPDMGEXP))
data <- mutate(data, CASUALTIES = INJURIES + FATALITIES)

Finally, we are only interested in storms that had either property/crop damage or casualties (or both), so we can filter out the relevant observations, obtaining 230,112 cleaned observations for the analysis.

data <- filter(data, (TOTALDMG != 0) | (CASUALTIES != 0))
nrow(data)
## [1] 230112

3. Results

3.1. Types of Storms Most Harmful to Population Health

First, let’s look at what types of storms are most harmful to population health, measured as the overall number of casualties for that storm type. From the summary of casualties, we can see that the top three types of events are tornadoes, heat waves, and thunderstorm wind. Together, almost 50,000 people have been casualties (including over 5,000 killed).

casualty_summary <- group_by(data, EVTYPE) %>% 
  summarize(CASUALTIES = sum(CASUALTIES), INJURIES = sum(INJURIES), FATALITIES = sum(FATALITIES))
casualty_summary <- casualty_summary[order(-casualty_summary$CASUALTIES),][1:10,]
head(casualty_summary, 3)
## # A tibble: 3 × 4
##              EVTYPE CASUALTIES INJURIES FATALITIES
##               <chr>      <dbl>    <dbl>      <dbl>
## 1           TORNADO      28469    26692       1777
## 2              HEAT      12372     9229       3143
## 3 THUNDERSTORM WIND       8007     7478        529

Next, let’s graph the top ten event types by casualty. To do this, we need to specify the factor levels of EVTYPE (so the graph will be in order of most to least). To get a stacked bar plot, we need to translate casualty_summary from its “wide” format to the corresponding “narrow” format, we use the function melt for this. Then we can see, graphically, the impact on population health of the top ten event types.

casualty_summary$EVTYPE<-factor(casualty_summary$EVTYPE,levels=unique(casualty_summary$EVTYPE))
m_casualty_summary <- melt(casualty_summary, id.vars='EVTYPE', measure.vars=c('INJURIES', 'FATALITIES'))

ggplot(m_casualty_summary, aes(x=EVTYPE, y=value, fill=variable)) +
  geom_bar(stat='identity') +
  scale_fill_brewer(palette='Paired') +
  labs(x="Event Type", y="Casualties", 
       title="Casualties by Storm Type\nTop Ten Types", fill="Casualties") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Figure 1. Casualties by storm type, both fatalities and injuries included.

Tornadoes are far and away the biggest source of injuries. However, heat waves are the biggest source of fatalities, by a fair margin. After heat waves and tornadoes, flash floods are the worst for deaths.

3.2. Types of Storms With Greatest Economic Consequences

Now let’s look at what types of weather-related events have the biggest economic consequences, measured as the total property and crop damage caused. From the following summary, we can see that the top three types of economically damaging events are floods, hurricanes/typhoons, and high winds. Together, these types of events are responsible for over 300 billion dollars worth of damage over the period from 1990 through 2011.

fiscal_summary <- group_by(data, EVTYPE) %>% 
  summarize(TOTALDMG = sum(TOTALDMG), PROPDMG = sum(PROPDMG*PROPDMGEXP), CROPDMG = sum(CROPDMG*CROPDMGEXP))
fiscal_summary <- fiscal_summary[order(-fiscal_summary$TOTALDMG),][1:10,]
options(scipen = -5, digits = 4)
head(fiscal_summary, 3)
## # A tibble: 3 × 4
##              EVTYPE  TOTALDMG   PROPDMG   CROPDMG
##               <chr>     <dbl>     <dbl>     <dbl>
## 1             FLOOD 1.633e+11 1.524e+11 1.095e+10
## 2 HURRICANE/TYPHOON 9.087e+10 8.536e+10 5.516e+09
## 3         HIGH WIND 5.046e+10 4.971e+10 7.500e+08

Next, let’s graph the top ten event types by economic loss. As for casualties, we need to specify the factor levels of EVTYPE (so the graph will be in order of most to least). To get a stacked bar plot, we need to translate fiscal_summary from its “wide” format to the corresponding “narrow” format, we use the function melt for this.

fiscal_summary$EVTYPE<-factor(fiscal_summary$EVTYPE,levels=unique(fiscal_summary$EVTYPE))
m_fiscal_summary <- melt(fiscal_summary, id.vars='EVTYPE', measure.vars=c('PROPDMG', 'CROPDMG'))

Then we can see, graphically, the economic consequences of the top ten event types.

ggplot(m_fiscal_summary, aes(x=EVTYPE, y=value, fill=variable)) +
  geom_bar(stat="identity") +
  scale_fill_brewer(palette='Paired') +
  labs(x="Event Type", y="Economic Losses", 
       title="Economic Losses by Storm Type\nTop Ten Types", fill="Loss Type") +
  theme(axis.text.x=element_text(angle=45, hjust=1))

Figure 2. Total economic losses by storm type for ten worst types, breakout shown.

By far, most of the overall economic loss comes from floods. In terms of crop damage, however, drought is the clear leader, followed by floods, ice, and hurricanes/typhoons.

3.3. The Storm Center of the U.S.

Out of curiousity, I plotted the storm events on a map of the continental United States to see where the majority of storms occurred. It turns out most storms take place in the eastern half of the country. Note that the map does not show storms from Hawaii or Alaska. I was surprised at the complete lack of significant clusters of events in the west. It seems the data may be biased toward those states that endure more storms which are east of the Rocky Mountains.

qmap("United States", zoom=4) +
  stat_density2d(aes(x=LNG, y=LAT, fill=..level.., alpha=0.5), data=data, geom="polygon") +
  scale_fill_gradient(low='#fdbe85', high='#a63603') +
  theme(legend.position='none')
Figure 3. Storm density by location across continental U.S. Darker color represents higher number of storms in that area.

Figure 3. Storm density by location across continental U.S. Darker color represents higher number of storms in that area.

There is clearly a concentration of storms over the state to the left of Illinois, which turns out to be Iowa. Perhaps they don’t know it, but Iowa seems to be the storm capital of America, a dubious honor.

3.4. The Economic Burden of Storms (1990-2011)

Finally, I looked at the annual cost of storm events in the U.S. As the chart below shows, the American economy loses billions annually to weather-related events.

burden <- data %>% filter(year(DATE) > 2001) %>% group_by(year(DATE)) %>% summarize(COST = sum(TOTALDMG))
options(scipen = -5, digits = 4)
knitr::kable(burden)
year(DATE) COST
2002 5.511e+09
2003 1.140e+10
2004 2.680e+10
2005 1.008e+11
2006 1.255e+11
2007 7.480e+09
2008 1.778e+10
2009 5.749e+09
2010 1.103e+10
2011 2.156e+10

Clearly, it is worth targeting these events with earlier warning systems (so people can prepare better and secure property more effectively) and storm-confounding structures, such as flood walls or artificial redirection of waterways.

4. Summary

Impactful weather-related events range from the thunderous and violent (tornadoes) to the silent and deadly (heat waves). Billions of dollars are lost to storms each year, as well as many lives. It is worth addressing in particular tornadoes (with earlier warning systems), heat waves (with community cooling stations), wind (with resistant structures), and floods (with walls or diverting). Efforts such as these are ongoing, I know, but in spite of our best attempts, nature continues to maim and destroy. Data sets such as the NOAA storm database help us know which event types to specifically target.