Synopsis

What types of weather events have the greatest monetary impact on communities? Which ones lead to the greatest loss of life?

This report will attempt to answer these questions using weather event data collected by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). The NOAA storm database tracks details about major storms and other weather events including the location, duration and economic impact as well as the number of injuries, direct and indirect, as well as loss of life. The data analyzed was collected between 1950 and November 2011.

Although the database is structured in lacks consistent conventions for the key attributes we need, so a significant part of the analysis will be to try to normalize the event types and damage estimates so they can be aggregated and compared.

For the purposes of this report event types will be aggregated into broader categories of weather events, such as hurricane, thunderstorm, hail, flood, and freezing rain.

Data Processing

I’m using dplyr and ggplot2 libraries to process the data.

library(dplyr)
library(ggplot2)
library(pander)
library(knitr)

The data file comes in the format of a csv in bzip2 format. There are over 900,000 events recorded. Since we’re only looking at the event type and damages we’re going to ignore the remaining columns to save memory.

data <- read.csv(bzfile('repdata-data-StormData.csv.bz2'),
                 colClasses=c(STATE__ = "NULL",
                              BGN_DATE = 'character',
                              BGN_TIME = "NULL",
                              TIME_ZONE = "NULL",
                              COUNTY = "NULL",
                              COUNTYNAME = "NULL",
                              STATE = "NULL",
                              EVTYPE = 'character',
                              BGN_RANGE = "NULL",
                              BGN_AZI = "NULL",
                              BGN_LOCATI = "NULL",
                              END_DATE = "NULL",
                              END_TIME = "NULL",
                              COUNTY_END = "NULL",
                              COUNTYENDN = "NULL",
                              END_RANGE = "NULL",
                              END_AZI = "NULL",
                              END_LOCATI = "NULL",
                              LENGTH = "NULL",
                              WIDTH = "NULL",
                              F = "NULL",
                              MAG = NA, 
                              FATALITIES = NA,
                              INJURIES = NA,
                              PROPDMG = NA,
                              PROPDMGEXP = 'character',
                              CROPDMG = NA,
                              CROPDMGEXP = 'character',
                              WFO = "NULL",
                              STATEOFFIC = "NULL",
                              ZONENAMES = "NULL",
                              LATITUDE = "NULL",
                              LONGITUDE = "NULL",
                              LATITUDE_E = "NULL",
                              LONGITUDE_ = "NULL",
                              REMARKS = "NULL",
                              REFNUM = "NULL"))
dim(data)
## [1] 902297      9

Normalizing Costs

The monetary damages are all represented with a magnitude and exponent. The exponent values for property damage and crop damage include a variety of characters all of which translate to a power of 10, which is multiplied by the base magnitude value.

I am assuming +, -, ? are artifacts and in those cases there was no exponent multiplier, so the exponent of 10 is 0. I created a mapping of EXP value to an actual power of 10 for both property and crop damage.

exp_lookup <- data.frame(stringsAsFactors = F,
                         levels=sort(unique(c(data$PROPDMGEXP, data$CROPDMGEXP))))
exp_lookup$values <- c(0,0,0,0,0,1,2,3,4,5,6,7,8,9,2,2,3,3,6,6)
print(exp_lookup, row.names = F)
##  levels values
##              0
##       -      0
##       ?      0
##       +      0
##       0      0
##       1      1
##       2      2
##       3      3
##       4      4
##       5      5
##       6      6
##       7      7
##       8      8
##       B      9
##       h      2
##       H      2
##       k      3
##       K      3
##       m      6
##       M      6

To quantify the total cost of any given event I’m simply adding the property and crop losses.

data$total_damage <-
    data$PROPDMG * 10 ^ left_join(data, exp_lookup, by=c('PROPDMGEXP'='levels'))$values +
    data$CROPDMG * 10 ^ left_join(data, exp_lookup, by=c('CROPDMGEXP'='levels'))$values

We can spot check a few samples:

##       PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP total_damage
## 933        50          K      50          K       100000
## 10533      10          K      20          K        30000
## 15074      50          K     100          K       150000
## 1506        5          K      50          K        55000
## 3500       25          K      25          K        50000
## 11833       2          K       5          K         7000
## 15294       1          K       1          K         2000
## 9951        1          K       5          K         6000
## 15638       3          K      10          K        13000
## 9968       10          K      10          K        20000

To make the damages estimates more meaningful I considered adjusting values for inflation by multiplying them by the consumer price index based on the year of the event. But I was starting to feel like I was spending way too much time on this project so I decided instead to use the time to go play with my kids.

Normalizing Event Types

The main problem with the raw data is that event types are described using different words, such as TSTRM and THUNDERSTORM. Another problem is that events types consist of multiple different events in many cases, like Sleet/Snow. The events have mixed case, spurious or inconsistent punctuating characters, mis-spellings, and in some cases just defer to descriptions.

Here’s an example showing 20 of the event types for events related to thunderstorms:

grep(data$EVTYPE, pattern='(TSTM|\\bTH[UNDERSTO]*RM)', ignore.case=T, value = T) %>% table %>% as.data.frame %>% head(20) %>% print(row.names=F)
##                               . Freq
##                       TSTM WIND    4
##                 TSTM WIND (G45)    1
##  FLASH FLOODING/THUNDERSTORM WI    1
##         GUSTY THUNDERSTORM WIND    3
##        GUSTY THUNDERSTORM WINDS    5
##  LIGHTNING AND THUNDERSTORM WIN    1
##    LIGHTNING THUNDERSTORM WINDS    1
##   LIGHTNING THUNDERSTORM WINDSS    1
##        MARINE THUNDERSTORM WIND 5812
##                MARINE TSTM WIND 6175
##                   NON TSTM WIND    2
##                   NON-TSTM WIND    1
##             SEVERE THUNDERSTORM   13
##       SEVERE THUNDERSTORM WINDS    5
##            SEVERE THUNDERSTORMS   23
##               THUDERSTORM WINDS    2
##             THUNDEERSTORM WINDS    2
##             THUNDERESTORM WINDS    1
##                    THUNDERSTORM   45
##             THUNDERSTORM  WINDS    7

My approach to solving this was to aggregate the event types into single consistent categories, grouping them according to certain keywords. Also, rather than fix every single record in the data I focused on the data that seemed like it could impact the top 15 or 20 categories. I did this by iterating over the summary data, looking at categories near the top of the list of damage, and adding rules to ensure those event types were meaningful.

For the thunderstorm example I had to consider whether to group these all into thunderstorm events, wind events, or thunderstorm-wind events. I decided that the dominant attribute of all of these was “thunderstorm.” Events described as “thunderstorm wind” are still essentially thunderstorms for the purpose of quantifying their impact.

Here are the rules used to aggregate the EVTYPE into a normalized event type:

normalized_event <- toupper(data$EVTYPE)

normalized_event  <- 
    gsub('(RECORD|EXTREME|PATCHY|DENSE|HIGH|STRONG|HEAVY|VERY|UNUSUAL|MODERATE|UNSEASONAL|EXCESSIVE|ABNORMAL|ABNORMALLY) *',
         '', normalized_event)

normalized_event[grep('THUNDERSTORM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('TSTM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('THU[NDERSTO]*M\\b', normalized_event)] <- 'THUNDERSTORM'
normalized_event[grep('FLOOD', normalized_event, fixed=T)] <- 'FLOOD'
normalized_event[grep('\\bFLD\\b', normalized_event)] <- 'FLOOD'
normalized_event[grep('TORNADO', normalized_event, fixed=T)] <- 'TORNADO'
normalized_event[grep('HURRICANE', normalized_event, fixed=T)] <- 'HURRICANE'
normalized_event[grep('TYPHOON', normalized_event, fixed=T)] <- 'TYPHOON'
normalized_event[grep('BLIZZARD', normalized_event, fixed=T)] <- 'BLIZZARD'
normalized_event[grep('TROPICAL ST[OR]*M', normalized_event)] <- 'TROPICAL STORM'
normalized_event[grep('HAIL', normalized_event, fixed=T)] <- 'HAIL'
normalized_event[grep('FIRE', normalized_event, fixed=T)] <- 'FIRE'
normalized_event[grep('^VOLCAN', normalized_event)] <- 'VOLCANO'
normalized_event[grep('^LIGHT[NI]*G\\b', normalized_event)] <- 'LIGHTNING'
normalized_event[grep('SNOW', normalized_event, fixed=T)] <- 'SNOW'
normalized_event[grep('ICE', normalized_event, fixed=T)] <- 'ICE'
normalized_event[grep('^WIND', normalized_event)] <- 'WINDS'
normalized_event[grep('^STORM SURGE', normalized_event)] <- 'STORM SURGE'
normalized_event[grep('^RIP CURRENT', normalized_event)] <- 'RIP CURRENT'
normalized_event[grep('^SURF', normalized_event)] <- 'SURF'
normalized_event[grep('HEAT', normalized_event)] <- 'HEAT'
normalized_event[grep('^COLD', normalized_event)] <- 'COLD'
normalized_event[grep('FREEZING RAIN', normalized_event)] <- 'FREEZING RAIN'
normalized_event[grep('^FROST', normalized_event)] <- 'FROST'
normalized_event[grep('^LANDSLIDE', normalized_event)] <- 'LANDSLIDE'
normalized_event[grep('^FOG', normalized_event)] <- 'FOG'

normalized_event[grep('^SUMMARY', normalized_event)] <- 'OTHER'

data$category <- normalized_event

You can see the effect of aggregation looking at a plot of event types before and after aggregation.

ev_grouped <- group_by(data, EVTYPE) %>% 
    summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())

cat_grouped <- group_by(data, category) %>% 
    summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())

ev_grouped$processed <- F
cat_grouped$processed <-T 

ggplot(bind_rows(ev_grouped, cat_grouped)) +
    facet_grid(processed ~ .) +
    aes(x=damage, y=fatalities, size=count) +
    geom_point() + scale_x_log10() + scale_y_log10() +
    ggtitle('Weather Event Types Before and After Aggregation') 

The plot above shows the event types plotted against the total damages and fatalities. In this report we’re only interested in what shows up in the upper right side of the chart. After aggregation you can see it’s cleaned up quite a bit with a smaller number of categories with a higher count of events.

(Note: the charts axes are log scaled)

Summarizing the Data

The final step in processing was to group the data by event type and summarize by damages in dollars and loss of life.

event_data <- group_by(data, category) %>%
    summarize(count=n(), 
              total_damage=sum(total_damage), 
              average_damage=mean(total_damage),
              total_fatalities=sum(FATALITIES),
              average_fatalities=mean(FATALITIES),
              total_injuries=sum(INJURIES),
              average_injuries=mean(INJURIES)) %>%
    arrange(desc(count))

Results

combined_results <- 
    bind_rows(event_data %>% arrange(desc(total_damage)) %>% head(5) %>%
                  mutate(impact = total_damage, rel=total_damage/sum(total_damage), by='Total Damage'),
              event_data %>% arrange(desc(total_fatalities)) %>% head(5) %>% 
                  mutate(impact=total_fatalities, rel=total_fatalities/sum(total_fatalities), by='Fatalities'),
              event_data %>% arrange(desc(total_injuries)) %>% head(5) %>% 
                  mutate(impact=total_injuries, rel=total_injuries/sum(total_injuries), by='Injuries'))

ggplot(combined_results) +
         geom_bar(stat='identity') +
         aes(x=by, y=rel, fill=category) + 
         stat_identity(geom="text", position='stack', size=5,aes(label=category,y=rel, vjust=1.5)) +
         scale_fill_brewer(palette='Set3', guide=F) +
         ggtitle('Relative Impact of Top 5 Weather Events by Measured Cost') 

This figure shows the relative breakdown of cost among the top five event in terms of the number of fatalities, the number of injuries, and the total damage to crops and property. This shows tornadoes as the most dangerous weather event and floods as the most expensive!

Here are the top 20 events in order of total cost over all events in that category:

event_data %>%
    arrange(desc(total_damage)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Crop and Property Damage', align=c('l','r','r','r','r'))
Top Events by Total Crop and Property Damage
category count total_damage total_fatalities total_injuries
FLOOD 86119 18,100 M 1553 8683
HURRICANE 288 9,030 M 135 1328
TORNADO 60699 5,740 M 5636 91407
STORM SURGE 409 4,800 M 24 43
HAIL 289282 1,900 M 15 1371
DROUGHT 2488 1,500 M 0 4
THUNDERSTORM 336823 1,410 M 755 9544
ICE 2101 898 M 97 2152
FIRE 4240 890 M 90 1608
TROPICAL STORM 697 841 M 66 383
WINDS 26239 696 M 443 1857
WINTER STORM 11433 672 M 206 1321
RAIN/SEVERE WEATHER 2 250 M 0 0
COLD 2387 151 M 432 315
RAIN 11765 144 M 98 251
FROST 1401 117 M 1 3
SNOW 17692 116 M 169 1165
LIGHTNING 15764 94 M 817 5231
HEAT 2648 92 M 3138 9224
BLIZZARD 2745 78 M 101 806

Here are the top 20 events in order of the total number of deaths for events in that category:

event_data %>%
    arrange(desc(total_fatalities)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Number of Fatalities', align=c('l','r','r','r','r'))
Top Events by Total Number of Fatalities
category count total_damage total_fatalities total_injuries
TORNADO 60699 5,740 M 5636 91407
HEAT 2648 92 M 3138 9224
FLOOD 86119 18,100 M 1553 8683
LIGHTNING 15764 94 M 817 5231
THUNDERSTORM 336823 1,410 M 755 9544
RIP CURRENT 777 0 M 577 529
WINDS 26239 696 M 443 1857
COLD 2387 151 M 432 315
AVALANCHE 386 0 M 224 170
WINTER STORM 11433 672 M 206 1321
SNOW 17692 116 M 169 1165
SURF 1055 10 M 157 244
HURRICANE 288 9,030 M 135 1328
BLIZZARD 2745 78 M 101 806
RAIN 11765 144 M 98 251
ICE 2101 898 M 97 2152
FIRE 4240 890 M 90 1608
FOG 1835 2 M 81 1077
TROPICAL STORM 697 841 M 66 383
LANDSLIDE 608 34 M 39 53

Finally, here are the top 20 events in order of the total number of injuries for events in that category:

event_data %>%
    arrange(desc(total_injuries)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Number of Injuries', align=c('l','r','r','r','r'))
Top Events by Total Number of Injuries
category count total_damage total_fatalities total_injuries
TORNADO 60699 5,740 M 5636 91407
THUNDERSTORM 336823 1,410 M 755 9544
HEAT 2648 92 M 3138 9224
FLOOD 86119 18,100 M 1553 8683
LIGHTNING 15764 94 M 817 5231
ICE 2101 898 M 97 2152
WINDS 26239 696 M 443 1857
FIRE 4240 890 M 90 1608
HAIL 289282 1,900 M 15 1371
HURRICANE 288 9,030 M 135 1328
WINTER STORM 11433 672 M 206 1321
SNOW 17692 116 M 169 1165
FOG 1835 2 M 81 1077
BLIZZARD 2745 78 M 101 806
RIP CURRENT 777 0 M 577 529
DUST STORM 427 1 M 22 440
WINTER WEATHER 7045 4 M 33 398
TROPICAL STORM 697 841 M 66 383
COLD 2387 151 M 432 315
RAIN 11765 144 M 98 251