Synopsis

We clean up the storm data from the NOAA and look at damages both in terms of economic damages as well as harm to human health. We find that, since 1996, the major storm causes of damage to human health are tornadoes and excessive heat, while the causers of the greatest economic consequences are floods and hurricanes, but that drought is the leading cause of crop damages in the United States.

Requirements

I’ll be using a series of R packages in the course of the analysis. magrittr, dplyr, stringr, lubridate, and tidyr will all come in handy during the various manipulations we need to perform to clean up the data and add new columns. I use the approximate matching capabilities in stringdist in order to create a cleaner verion of the event codings, and quantmod to get historical inflation numbers in order to index all dollar damage values to 1996 dollars. Finally, I use ggplot2 to plot results.

library(magrittr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringr)
library(lubridate)
library(tidyr)
library(stringdist)
## Loading required package: parallel
library(quantmod)
## Loading required package: Defaults
## Loading required package: xts
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## 
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
## 
## 
## Attaching package: 'xts'
## 
## The following objects are masked from 'package:dplyr':
## 
##     first, last
## 
## Loading required package: TTR
## Version 0.4-0 included new data defaults. See ?getSymbols.
library(ggplot2)
library(scales)

Data Processing

I start by getting the data and reading it into a data.frame, and taking a quick look at the structure:

storm_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destfile <- "data/storm_data.csv.bz2"
if (!file.exists(destfile)) download.file(storm_url, destfile, mode="wb")

storm <- read.csv("data/storm_data.csv.bz2", 
                  header=TRUE, 
                  stringsAsFactors=FALSE) %>% 
    tbl_df
names(storm) <- tolower(names(storm))
storm
## Source: local data frame [902,297 x 37]
## 
##    state__           bgn_date bgn_time time_zone county countyname state
## 1        1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2        1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3        1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
## 4        1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL
## 5        1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL
## 6        1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL
## 7        1 11/16/1951 0:00:00     0100       CST      9     BLOUNT    AL
## 8        1  1/22/1952 0:00:00     0900       CST    123 TALLAPOOSA    AL
## 9        1  2/13/1952 0:00:00     2000       CST    125 TUSCALOOSA    AL
## 10       1  2/13/1952 0:00:00     2000       CST     57    FAYETTE    AL
## ..     ...                ...      ...       ...    ...        ...   ...
## Variables not shown: evtype (chr), bgn_range (dbl), bgn_azi (chr),
##   bgn_locati (chr), end_date (chr), end_time (chr), county_end (dbl),
##   countyendn (lgl), end_range (dbl), end_azi (chr), end_locati (chr),
##   length (dbl), width (dbl), f (int), mag (dbl), fatalities (dbl),
##   injuries (dbl), propdmg (dbl), propdmgexp (chr), cropdmg (dbl),
##   cropdmgexp (chr), wfo (chr), stateoffic (chr), zonenames (chr), latitude
##   (dbl), longitude (dbl), latitude_e (dbl), longitude_ (dbl), remarks
##   (chr), refnum (dbl)
str(storm)
## Classes 'tbl_df', 'tbl' and 'data.frame':    902297 obs. of  37 variables:
##  $ state__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ bgn_date  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ bgn_time  : chr  "0130" "0145" "1600" "0900" ...
##  $ time_zone : chr  "CST" "CST" "CST" "CST" ...
##  $ county    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ countyname: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ state     : chr  "AL" "AL" "AL" "AL" ...
##  $ evtype    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ bgn_range : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ bgn_azi   : chr  "" "" "" "" ...
##  $ bgn_locati: chr  "" "" "" "" ...
##  $ end_date  : chr  "" "" "" "" ...
##  $ end_time  : chr  "" "" "" "" ...
##  $ county_end: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ countyendn: logi  NA NA NA NA NA NA ...
##  $ end_range : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ end_azi   : chr  "" "" "" "" ...
##  $ end_locati: chr  "" "" "" "" ...
##  $ length    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ width     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ f         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ mag       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ fatalities: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ injuries  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ propdmg   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ propdmgexp: chr  "K" "K" "K" "K" ...
##  $ cropdmg   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ cropdmgexp: chr  "" "" "" "" ...
##  $ wfo       : chr  "" "" "" "" ...
##  $ stateoffic: chr  "" "" "" "" ...
##  $ zonenames : chr  "" "" "" "" ...
##  $ latitude  : num  3040 3042 3340 3458 3412 ...
##  $ longitude : num  8812 8755 8742 8626 8642 ...
##  $ latitude_e: num  3051 0 0 0 0 ...
##  $ longitude_: num  8806 0 0 0 0 ...
##  $ remarks   : chr  "" "" "" "" ...
##  $ refnum    : num  1 2 3 4 5 6 7 8 9 10 ...
summary(storm)
##     state__       bgn_date           bgn_time          time_zone        
##  Min.   : 1.0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:19.0   Class :character   Class :character   Class :character  
##  Median :30.0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :31.2                                                           
##  3rd Qu.:45.0                                                           
##  Max.   :95.0                                                           
##                                                                         
##      county     countyname           state              evtype         
##  Min.   :  0   Length:902297      Length:902297      Length:902297     
##  1st Qu.: 31   Class :character   Class :character   Class :character  
##  Median : 75   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :101                                                           
##  3rd Qu.:131                                                           
##  Max.   :873                                                           
##                                                                        
##    bgn_range      bgn_azi           bgn_locati          end_date        
##  Min.   :   0   Length:902297      Length:902297      Length:902297     
##  1st Qu.:   0   Class :character   Class :character   Class :character  
##  Median :   0   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :   1                                                           
##  3rd Qu.:   1                                                           
##  Max.   :3749                                                           
##                                                                         
##    end_time           county_end countyendn       end_range  
##  Length:902297      Min.   :0    Mode:logical   Min.   :  0  
##  Class :character   1st Qu.:0    NA's:902297    1st Qu.:  0  
##  Mode  :character   Median :0                   Median :  0  
##                     Mean   :0                   Mean   :  1  
##                     3rd Qu.:0                   3rd Qu.:  0  
##                     Max.   :0                   Max.   :925  
##                                                              
##    end_azi           end_locati            length           width     
##  Length:902297      Length:902297      Min.   :   0.0   Min.   :   0  
##  Class :character   Class :character   1st Qu.:   0.0   1st Qu.:   0  
##  Mode  :character   Mode  :character   Median :   0.0   Median :   0  
##                                        Mean   :   0.2   Mean   :   8  
##                                        3rd Qu.:   0.0   3rd Qu.:   0  
##                                        Max.   :2315.0   Max.   :4400  
##                                                                       
##        f               mag          fatalities     injuries     
##  Min.   :0        Min.   :    0   Min.   :  0   Min.   :   0.0  
##  1st Qu.:0        1st Qu.:    0   1st Qu.:  0   1st Qu.:   0.0  
##  Median :1        Median :   50   Median :  0   Median :   0.0  
##  Mean   :1        Mean   :   47   Mean   :  0   Mean   :   0.2  
##  3rd Qu.:1        3rd Qu.:   75   3rd Qu.:  0   3rd Qu.:   0.0  
##  Max.   :5        Max.   :22000   Max.   :583   Max.   :1700.0  
##  NA's   :843563                                                 
##     propdmg      propdmgexp           cropdmg       cropdmgexp       
##  Min.   :   0   Length:902297      Min.   :  0.0   Length:902297     
##  1st Qu.:   0   Class :character   1st Qu.:  0.0   Class :character  
##  Median :   0   Mode  :character   Median :  0.0   Mode  :character  
##  Mean   :  12                      Mean   :  1.5                     
##  3rd Qu.:   0                      3rd Qu.:  0.0                     
##  Max.   :5000                      Max.   :990.0                     
##                                                                      
##      wfo             stateoffic         zonenames            latitude   
##  Length:902297      Length:902297      Length:902297      Min.   :   0  
##  Class :character   Class :character   Class :character   1st Qu.:2802  
##  Mode  :character   Mode  :character   Mode  :character   Median :3540  
##                                                           Mean   :2875  
##                                                           3rd Qu.:4019  
##                                                           Max.   :9706  
##                                                           NA's   :47    
##    longitude        latitude_e     longitude_       remarks         
##  Min.   :-14451   Min.   :   0   Min.   :-14455   Length:902297     
##  1st Qu.:  7247   1st Qu.:   0   1st Qu.:     0   Class :character  
##  Median :  8707   Median :   0   Median :     0   Mode  :character  
##  Mean   :  6940   Mean   :1452   Mean   :  3509                     
##  3rd Qu.:  9605   3rd Qu.:3549   3rd Qu.:  8735                     
##  Max.   : 17124   Max.   :9706   Max.   :106220                     
##                   NA's   :40                                        
##      refnum      
##  Min.   :     1  
##  1st Qu.:225575  
##  Median :451149  
##  Mean   :451149  
##  3rd Qu.:676723  
##  Max.   :902297  
## 

The data is a bit messy. The questions we are looking to answer are about which event types are the most harmful and the most expensive. So I’ll focus on cleaning up the columns in the data related to event type coding, cost estimates, and the injury/fatality counts. I do this in several steps:

  1. Remove event data from before 1996. This decision is based on http://www.ncdc.noaa.gov/stormevents/details.jsp, where we see that the event type coding changed significantly in 1996.
  2. Create meaningful cost numbers out of the propdmg and cropdmg fields, where each field is accompanied by a multiplier. We use a lookup vector to convert the single letter codes to the correct multiplier number (for instance, “M” gets mapped to 1,000,000).
  3. Use the CPI to inflation-adjust the cost numbers
  4. As well as possible, match the event encodings found in the storm data to the list of 48 official event types listed in NWS Directive 10-1605

Filtering and creating the cost variables

I remove event types listed as “summary,” as the aggregated data they represent should be included unaggregated in other rows. I also replace “tstm” in the event names with “thunderstorm,” which matches the coding in NWS Directive 10-1605. Finally, I create more meaningful date fields, and I convert the propdmg and cropdmg variables to actual numbers that can be used in calculations.

# lookup vector:
expref <- c("0" = 1, "K" = 1000, "M" = 1E6, "B" = 1E9)

# filter and add cost numbers:
storm_clean <- storm %>%
    mutate(event = str_trim(tolower(evtype)),
           event = str_replace(event, "tstm", "thunderstorm"),
           bgn_date = mdy_hms(bgn_date),
           year = year(bgn_date),
           propdmgexp = ifelse(propdmgexp == "", 0, propdmgexp),
           property_damage = propdmg * expref[propdmgexp],
           cropdmgexp = ifelse(cropdmgexp == "", 0, cropdmgexp),
           crop_damage = cropdmg * expref[cropdmgexp],
           total_damage = property_damage+crop_damage) %>%
    filter(year >= 1996, 
           !str_detect(event, "summary"))

Adding inflation data

I use the package quantmod to get the monthly inflation numbers from FRED, and use them to create real dollar values for the damages:

cpi <- data.frame(getSymbols("CPIAUCSL", src='FRED', auto.assign=FALSE, warnings=FALSE))
##     As of 0.4-0, 'getSymbols' uses env=parent.frame() and
##  auto.assign=TRUE by default.
## 
##  This  behavior  will be  phased out in 0.5-0  when the call  will
##  default to use auto.assign=FALSE. getOption("getSymbols.env") and 
##  getOptions("getSymbols.auto.assign") are now checked for alternate defaults
## 
##  This message is shown once per session and may be disabled by setting 
##  options("getSymbols.warning4.0"=FALSE). See ?getSymbol for more details
cpi$date <- ymd(row.names(cpi))
cpi$index <- cpi$CPIAUCSL/cpi["1996-01-01","CPIAUCSL"]

storm_clean <- storm_clean %>%
    mutate(date = floor_date(bgn_date, unit = "month")) %>%
    inner_join(cpi, by="date") %>%
    mutate(property_damage_real = property_damage / index,
           crop_damage_real = crop_damage / index,
           total_damage_real = total_damage / index)

Cleaning up event type data

The event types are all over the place. A complete cleanup would require domain knowledge that I don’t have access to, but we can use approximate string matching to map a good chunk of the event types to the 48 official event types. I start by creating a vector of the official event types, from the NWS Directive, and then mapping each event name that appears in the storm data to one of the official event types. Those event names that can not be mapped are left as is, but we still are able to reduce the number of obvious duplicates. The official event names are only available in a PDF, so I copied and pasted the values from the PDF directly into my R code, assuring we’ll be able to re-create them at any time. Because of their origin, the event names data need to be tidied up (split the single string into a vector of event names, remove the C/Z/M designator at the end of the name and convert everything to lower case as I did with the event data in the storm dataset):

storm_events <- unique(storm_clean$event)

event_names <- "Astronomical Low Tide Z 
Avalanche Z 
Blizzard Z 
Coastal Flood Z 
Cold/Wind Chill Z 
Debris Flow C 
Dense Fog Z 
Dense Smoke Z 
Drought Z 
Dust Devil C 
Dust Storm Z 
Excessive Heat Z 
Extreme Cold/Wind Chill Z 
Flash Flood C 
Flood C 
Frost/Freeze Z 
Funnel Cloud C 
Freezing Fog Z 
Hail C 
Heat Z 
Heavy Rain C 
Heavy Snow Z 
High Surf Z 
High Wind Z 
Hurricane (Typhoon) Z 
Ice Storm Z 
Lake-Effect Snow Z 
Lakeshore Flood Z 
Lightning C 
Marine Hail M 
Marine High Wind M 
Marine Strong Wind M 
Marine Thunderstorm Wind M 
Rip Current Z 
Seiche Z 
Sleet Z 
Storm Surge/Tide Z 
Strong Wind Z 
Thunderstorm Wind C 
Tornado C 
Tropical Depression Z 
Tropical Storm Z 
Tsunami Z 
Volcanic Ash Z 
Waterspout M 
Wildfire Z 
Winter Storm Z 
Winter Weather Z "

event_names <- event_names %>% 
    str_split("\n") %>%
    `[[`(1) %>%
    str_match("(.*?) [ZCM] $") %>%
    `[`(,2) %>%
    tolower
names(event_names) <- event_names

try_match <- sapply(storm_events, amatch, event_names, method="cosine", maxDist=.1)
clean_events <- structure(event_names[try_match], names = storm_events)

I made one more change. The only hurricane/typhoon entry in the official event list is hurricane (typhoon), but visual inspection showed that the storm data has numerous instances of hurricane as well as typhoon, both of which should be mapped to hurricane (typhoon):

clean_events["hurricane"] <- "hurricane (typhoon)"
clean_events["typhoon"] <- "hurricane (typhoon)"

Now clean_events is a vector that maps event names found in the storm data to their closest counterparts in the official event names. I modify the storm_clean data frame to add a cleaned up event column, prioritizing exact matches, then approximate matches, and finally leaving the field as is if no better option is found:

storm_clean <- storm_clean %>% 
    mutate(event_clean = event_names[event],
           event_clean = ifelse(is.na(event_clean), clean_events[event], event_clean),
           event_clean = ifelse(is.na(event_clean), event, event_clean))

Here’s a quick example of the sorts of changes we made:

storm_clean %>% 
    filter(event_clean != event) %>%
    group_by(event, event_clean) %>%
    summarise(total_records = n()) %>% 
    arrange(desc(total_records)) %>%
    head
## Source: local data frame [6 x 3]
## Groups: event
## 
##                    event              event_clean total_records
## 1     winter weather/mix           winter weather          1104
## 2 thunderstorm wind/hail marine thunderstorm wind          1028
## 3           rip currents              rip current           302
## 4   heavy surf/high surf                high surf           228
## 5      extreme windchill  extreme cold/wind chill           204
## 6           strong winds              strong wind           192

It appears that we can take the injury and fatality numbers as they are from the storm data, so now we’re ready to answer the original questions.

Results

Which types of events are the most harmful with respect to population health?

The chart below shows the top causers of injuries and fatalities, sorted by number of events, since 1996. Overall, tornadoes take the highest human toll, but excessive heat is disproportionately fatal when it does strike.

storm_clean %>%
    group_by(event_clean) %>%
    summarise(events = n(), 
              fatalities = sum(fatalities), 
              injuries=sum(injuries),
              human_damages = sum(fatalities+injuries)) %>%
    filter(events > 25 & human_damages > 100) %>%
    select(event_type = event_clean, 
           events, 
           Fatalities = fatalities, 
           Injuries = injuries, 
           human_damages) %>%
    gather(damage_type, count, Fatalities, Injuries, -human_damages, -events) %>%
    ggplot(aes(y = reorder(event_type, events), x = count)) +
    geom_point() +
    geom_segment(aes(xend = 0, yend=reorder(event_type, events))) +
    scale_x_continuous(labels=comma, expand=c(0.01,0)) +
    facet_wrap(~damage_type, ncol=2, scale="free_x") +
    labs(title="Injuries and fatalities by event type\nsorted by number of events", x = "Total", y = "Event Type") +
    theme(rect=element_blank(),
          plot.title = element_text(hjust=0),
          axis.ticks = element_blank(),
          strip.text = element_text(hjust=0),
          axis.title.x = element_text(hjust=0),
          panel.grid.major.x = element_line(size = .1, colour="gray50"))
Events types are sorted by number of event ocurrences. We can see that tornadoes and excessive heat have the highest human cost.

Events types are sorted by number of event ocurrences. We can see that tornadoes and excessive heat have the highest human cost.

Which types of events have the greatest economic consequences?

From the chart below, showing the top 10 event types in terms of overall dollars of damage, we can see that floods and hurricanes and typhoons have the greatest economic consequences overall, but that, as expected, droughts are by far the leading cause of crop damages. The dollar amounts are in constant 1996 dollars.

storm_clean %>%
    group_by(event_clean) %>%
    summarise(events = n(), 
              crop_damages = sum(crop_damage_real), 
              property_damages=sum(property_damage_real),
              total_damages = sum(total_damage_real)) %>%
    select(event_type = event_clean, 
           events, 
           crop_damages, 
           property_damages, 
           total_damages) %>%
    top_n(10) %>%
    gather(damage_type, dollars, crop_damages, property_damages, -total_damages, -events) %>%    
    ggplot(aes(y = reorder(event_type, total_damages), x = dollars)) +
    geom_point() +
    geom_segment(aes(xend=0, yend=reorder(event_type, total_damages))) +
    scale_x_continuous(labels=comma, expand=c(0.01,0)) +
    facet_wrap(~damage_type, ncol=2, scale="free_x") +
    labs(title="Monetary damages by event type\nsorted by total cost of damages", x = "Cost in 1996 dollars", y = "Event Type") +
    theme(rect=element_blank(),
          plot.title = element_text(hjust=0),
          axis.ticks = element_blank(),
          strip.text = element_text(hjust=0),
          axis.title.x = element_text(hjust=0),
          panel.grid.major.x = element_line(size = .1, colour="gray50"))
## Selecting by total_damages
Top 10 economic damage-causers. Floods and hurricanes/typhoons have the greatest economic consequences overall, but droughts cause by far the most crop damage.

Top 10 economic damage-causers. Floods and hurricanes/typhoons have the greatest economic consequences overall, but droughts cause by far the most crop damage.