Synopsis

What types of weather events have the greatest monetary impact on communities? Which ones lead to the greatest loss of life?

This report will attempt to answer these questions using weather event data collected by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). The NOAA storm database tracks details about major storms and other weather events including the location, duration and economic impact as well as the number of injuries, direct and indirect, as well as loss of life. The data analyzed was collected between 1950 and November 2011.

Although the database is structured in lacks consistent conventions for the key attributes we need, so a significant part of the analysis will be to try to normalize the event types and damage estimates so they can be aggregated and compared.

For the purposes of this report event types will be aggregated into broader categories of weather events, such as hurricane, thunderstorm, hail, flood, and freezing rain.

Data Processing

I’m using dplyr and ggplot2 libraries to process the data.

library(dplyr)
library(ggplot2)
library(pander)
library(knitr)

The data file comes in the format of a csv in bzip2 format. There are over 900,000 events recorded. Since we’re only looking at the event type and damages we’re going to ignore the remaining columns to save memory.

data <- read.csv(bzfile('repdata-data-StormData.csv.bz2'),
                 colClasses=c(STATE__ = "NULL",
                              BGN_DATE = 'character',
                              BGN_TIME = "NULL",
                              TIME_ZONE = "NULL",
                              COUNTY = "NULL",
                              COUNTYNAME = "NULL",
                              STATE = "NULL",
                              EVTYPE = 'character',
                              BGN_RANGE = "NULL",
                              BGN_AZI = "NULL",
                              BGN_LOCATI = "NULL",
                              END_DATE = "NULL",
                              END_TIME = "NULL",
                              COUNTY_END = "NULL",
                              COUNTYENDN = "NULL",
                              END_RANGE = "NULL",
                              END_AZI = "NULL",
                              END_LOCATI = "NULL",
                              LENGTH = "NULL",
                              WIDTH = "NULL",
                              F = "NULL",
                              MAG = NA, 
                              FATALITIES = NA,
                              INJURIES = NA,
                              PROPDMG = NA,
                              PROPDMGEXP = 'character',
                              CROPDMG = NA,
                              CROPDMGEXP = 'character',
                              WFO = "NULL",
                              STATEOFFIC = "NULL",
                              ZONENAMES = "NULL",
                              LATITUDE = "NULL",
                              LONGITUDE = "NULL",
                              LATITUDE_E = "NULL",
                              LONGITUDE_ = "NULL",
                              REMARKS = "NULL",
                              REFNUM = "NULL"))
dim(data)

## [1] 902297      9

Normalizing Costs

The monetary damages are all represented with a magnitude and exponent. The exponent values for property damage and crop damage include a variety of characters all of which translate to a power of 10, which is multiplied by the base magnitude value.

I am assuming +, -, ? are artifacts and in those cases there was no exponent multiplier, so the exponent of 10 is 0. I created a mapping of EXP value to an actual power of 10 for both property and crop damage.

exp_lookup <- data.frame(stringsAsFactors = F,
                         levels=sort(unique(c(data$PROPDMGEXP, data$CROPDMGEXP))))
exp_lookup$values <- c(0,0,0,0,0,1,2,3,4,5,6,7,8,9,2,2,3,3,6,6)
print(exp_lookup, row.names = F)

##  levels values
##              0
##       -      0
##       ?      0
##       +      0
##       0      0
##       1      1
##       2      2
##       3      3
##       4      4
##       5      5
##       6      6
##       7      7
##       8      8
##       B      9
##       h      2
##       H      2
##       k      3
##       K      3
##       m      6
##       M      6

To quantify the total cost of any given event I’m simply adding the property and crop losses.

data$total_damage <-
    data$PROPDMG * 10 ^ left_join(data, exp_lookup, by=c('PROPDMGEXP'='levels'))$values +
    data$CROPDMG * 10 ^ left_join(data, exp_lookup, by=c('CROPDMGEXP'='levels'))$values

We can spot check a few samples:

##       PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP total_damage
## 933        50          K      50          K       100000
## 10533      10          K      20          K        30000
## 15074      50          K     100          K       150000
## 1506        5          K      50          K        55000
## 3500       25          K      25          K        50000
## 11833       2          K       5          K         7000
## 15294       1          K       1          K         2000
## 9951        1          K       5          K         6000
## 15638       3          K      10          K        13000
## 9968       10          K      10          K        20000

To make the damages estimates more meaningful I considered adjusting values for inflation by multiplying them by the consumer price index based on the year of the event. But I was starting to feel like I was spending way too much time on this project so I decided instead to use the time to go play with my kids.

Normalizing Event Types

The main problem with the raw data is that event types are described using different words, such as TSTRM and THUNDERSTORM. Another problem is that events types consist of multiple different events in many cases, like Sleet/Snow. The events have mixed case, spurious or inconsistent punctuating characters, mis-spellings, and in some cases just defer to descriptions.

Here’s an example showing 20 of the event types for events related to thunderstorms:

grep(data$EVTYPE, pattern='(TSTM|\\bTH[UNDERSTO]*RM)', ignore.case=T, value = T) %>% table %>% as.data.frame %>% head(20) %>% print(row.names=F)

##                               . Freq
##                       TSTM WIND    4
##                 TSTM WIND (G45)    1
##  FLASH FLOODING/THUNDERSTORM WI    1
##         GUSTY THUNDERSTORM WIND    3
##        GUSTY THUNDERSTORM WINDS    5
##  LIGHTNING AND THUNDERSTORM WIN    1
##    LIGHTNING THUNDERSTORM WINDS    1
##   LIGHTNING THUNDERSTORM WINDSS    1
##        MARINE THUNDERSTORM WIND 5812
##                MARINE TSTM WIND 6175
##                   NON TSTM WIND    2
##                   NON-TSTM WIND    1
##             SEVERE THUNDERSTORM   13
##       SEVERE THUNDERSTORM WINDS    5
##            SEVERE THUNDERSTORMS   23
##               THUDERSTORM WINDS    2
##             THUNDEERSTORM WINDS    2
##             THUNDERESTORM WINDS    1
##                    THUNDERSTORM   45
##             THUNDERSTORM  WINDS    7

My approach to solving this was to aggregate the event types into single consistent categories, grouping them according to certain keywords. Also, rather than fix every single record in the data I focused on the data that seemed like it could impact the top 15 or 20 categories. I did this by iterating over the summary data, looking at categories near the top of the list of damage, and adding rules to ensure those event types were meaningful.

For the thunderstorm example I had to consider whether to group these all into thunderstorm events, wind events, or thunderstorm-wind events. I decided that the dominant attribute of all of these was “thunderstorm.” Events described as “thunderstorm wind” are still essentially thunderstorms for the purpose of quantifying their impact.

Here are the rules used to aggregate the EVTYPE into a normalized event type:

normalized_event <- toupper(data$EVTYPE)

normalized_event  <- 
    gsub('(RECORD|EXTREME|PATCHY|DENSE|HIGH|STRONG|HEAVY|VERY|UNUSUAL|MODERATE|UNSEASONAL|EXCESSIVE|ABNORMAL|ABNORMALLY) *',
         '', normalized_event)

normalized_event[grep('THUNDERSTORM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('TSTM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('THU[NDERSTO]*M\\b', normalized_event)] <- 'THUNDERSTORM'
normalized_event[grep('FLOOD', normalized_event, fixed=T)] <- 'FLOOD'
normalized_event[grep('\\bFLD\\b', normalized_event)] <- 'FLOOD'
normalized_event[grep('TORNADO', normalized_event, fixed=T)] <- 'TORNADO'
normalized_event[grep('HURRICANE', normalized_event, fixed=T)] <- 'HURRICANE'
normalized_event[grep('TYPHOON', normalized_event, fixed=T)] <- 'TYPHOON'
normalized_event[grep('BLIZZARD', normalized_event, fixed=T)] <- 'BLIZZARD'
normalized_event[grep('TROPICAL ST[OR]*M', normalized_event)] <- 'TROPICAL STORM'
normalized_event[grep('HAIL', normalized_event, fixed=T)] <- 'HAIL'
normalized_event[grep('FIRE', normalized_event, fixed=T)] <- 'FIRE'
normalized_event[grep('^VOLCAN', normalized_event)] <- 'VOLCANO'
normalized_event[grep('^LIGHT[NI]*G\\b', normalized_event)] <- 'LIGHTNING'
normalized_event[grep('SNOW', normalized_event, fixed=T)] <- 'SNOW'
normalized_event[grep('ICE', normalized_event, fixed=T)] <- 'ICE'
normalized_event[grep('^WIND', normalized_event)] <- 'WINDS'
normalized_event[grep('^STORM SURGE', normalized_event)] <- 'STORM SURGE'
normalized_event[grep('^RIP CURRENT', normalized_event)] <- 'RIP CURRENT'
normalized_event[grep('^SURF', normalized_event)] <- 'SURF'
normalized_event[grep('HEAT', normalized_event)] <- 'HEAT'
normalized_event[grep('^COLD', normalized_event)] <- 'COLD'
normalized_event[grep('FREEZING RAIN', normalized_event)] <- 'FREEZING RAIN'
normalized_event[grep('^FROST', normalized_event)] <- 'FROST'
normalized_event[grep('^LANDSLIDE', normalized_event)] <- 'LANDSLIDE'
normalized_event[grep('^FOG', normalized_event)] <- 'FOG'

normalized_event[grep('^SUMMARY', normalized_event)] <- 'OTHER'

data$category <- normalized_event

You can see the effect of aggregation looking at a plot of event types before and after aggregation.

ev_grouped <- group_by(data, EVTYPE) %>% 
    summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())

cat_grouped <- group_by(data, category) %>% 
    summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())

ev_grouped$processed <- F
cat_grouped$processed <-T 

ggplot(bind_rows(ev_grouped, cat_grouped)) +
    facet_grid(processed ~ .) +
    aes(x=damage, y=fatalities, size=count) +
    geom_point() + scale_x_log10() + scale_y_log10() +
    ggtitle('Weather Event Types Before and After Aggregation')

The plot above shows the event types plotted against the total damages and fatalities. In this report we’re only interested in what shows up in the upper right side of the chart. After aggregation you can see it’s cleaned up quite a bit with a smaller number of categories with a higher count of events.

(Note: the charts axes are log scaled)

Summarizing the Data

The final step in processing was to group the data by event type and summarize by damages in dollars and loss of life.

event_data <- group_by(data, category) %>%
    summarize(count=n(), 
              total_damage=sum(total_damage), 
              average_damage=mean(total_damage),
              total_fatalities=sum(FATALITIES),
              average_fatalities=mean(FATALITIES),
              total_injuries=sum(INJURIES),
              average_injuries=mean(INJURIES)) %>%
    arrange(desc(count))

Results

combined_results <- 
    bind_rows(event_data %>% arrange(desc(total_damage)) %>% head(5) %>%
                  mutate(impact = total_damage, rel=total_damage/sum(total_damage), by='Total Damage'),
              event_data %>% arrange(desc(total_fatalities)) %>% head(5) %>% 
                  mutate(impact=total_fatalities, rel=total_fatalities/sum(total_fatalities), by='Fatalities'),
              event_data %>% arrange(desc(total_injuries)) %>% head(5) %>% 
                  mutate(impact=total_injuries, rel=total_injuries/sum(total_injuries), by='Injuries'))

ggplot(combined_results) +
         geom_bar(stat='identity') +
         aes(x=by, y=rel, fill=category) + 
         stat_identity(geom="text", position='stack', size=5,aes(label=category,y=rel, vjust=1.5)) +
         scale_fill_brewer(palette='Set3', guide=F) +
         ggtitle('Relative Impact of Top 5 Weather Events by Measured Cost')

This figure shows the relative breakdown of cost among the top five event in terms of the number of fatalities, the number of injuries, and the total damage to crops and property. This shows tornadoes as the most dangerous weather event and floods as the most expensive!

Here are the top 20 events in order of total cost over all events in that category:

event_data %>%
    arrange(desc(total_damage)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Crop and Property Damage', align=c('l','r','r','r','r'))

Top Events by Total Crop and Property Damage
category	count	total_damage	total_fatalities	total_injuries
FLOOD	86119	18,100 M	1553	8683
HURRICANE	288	9,030 M	135	1328
TORNADO	60699	5,740 M	5636	91407
STORM SURGE	409	4,800 M	24	43
HAIL	289282	1,900 M	15	1371
DROUGHT	2488	1,500 M	0	4
THUNDERSTORM	336823	1,410 M	755	9544
ICE	2101	898 M	97	2152
FIRE	4240	890 M	90	1608
TROPICAL STORM	697	841 M	66	383
WINDS	26239	696 M	443	1857
WINTER STORM	11433	672 M	206	1321
RAIN/SEVERE WEATHER	2	250 M	0	0
COLD	2387	151 M	432	315
RAIN	11765	144 M	98	251
FROST	1401	117 M	1	3
SNOW	17692	116 M	169	1165
LIGHTNING	15764	94 M	817	5231
HEAT	2648	92 M	3138	9224
BLIZZARD	2745	78 M	101	806

Here are the top 20 events in order of the total number of deaths for events in that category:

event_data %>%
    arrange(desc(total_fatalities)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Number of Fatalities', align=c('l','r','r','r','r'))

Top Events by Total Number of Fatalities
category	count	total_damage	total_fatalities	total_injuries
TORNADO	60699	5,740 M	5636	91407
HEAT	2648	92 M	3138	9224
FLOOD	86119	18,100 M	1553	8683
LIGHTNING	15764	94 M	817	5231
THUNDERSTORM	336823	1,410 M	755	9544
RIP CURRENT	777	0 M	577	529
WINDS	26239	696 M	443	1857
COLD	2387	151 M	432	315
AVALANCHE	386	0 M	224	170
WINTER STORM	11433	672 M	206	1321
SNOW	17692	116 M	169	1165
SURF	1055	10 M	157	244
HURRICANE	288	9,030 M	135	1328
BLIZZARD	2745	78 M	101	806
RAIN	11765	144 M	98	251
ICE	2101	898 M	97	2152
FIRE	4240	890 M	90	1608
FOG	1835	2 M	81	1077
TROPICAL STORM	697	841 M	66	383
LANDSLIDE	608	34 M	39	53

Finally, here are the top 20 events in order of the total number of injuries for events in that category:

event_data %>%
    arrange(desc(total_injuries)) %>%
    select(category, count, total_damage, total_fatalities, total_injuries) %>%
    mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
    head(20) %>% kable(caption='Top Events by Total Number of Injuries', align=c('l','r','r','r','r'))

Top Events by Total Number of Injuries
category	count	total_damage	total_fatalities	total_injuries
TORNADO	60699	5,740 M	5636	91407
THUNDERSTORM	336823	1,410 M	755	9544
HEAT	2648	92 M	3138	9224
FLOOD	86119	18,100 M	1553	8683
LIGHTNING	15764	94 M	817	5231
ICE	2101	898 M	97	2152
WINDS	26239	696 M	443	1857
FIRE	4240	890 M	90	1608
HAIL	289282	1,900 M	15	1371
HURRICANE	288	9,030 M	135	1328
WINTER STORM	11433	672 M	206	1321
SNOW	17692	116 M	169	1165
FOG	1835	2 M	81	1077
BLIZZARD	2745	78 M	101	806
RIP CURRENT	777	0 M	577	529
DUST STORM	427	1 M	22	440
WINTER WEATHER	7045	4 M	33	398
TROPICAL STORM	697	841 M	66	383
COLD	2387	151 M	432	315
RAIN	11765	144 M	98	251

NOAA Weather Event Analysis

Bill Kayser

January 11, 2016