What types of weather events have the greatest monetary impact on communities? Which ones lead to the greatest loss of life?
This report will attempt to answer these questions using weather event data collected by the U.S. National Oceanic and Atmospheric Administration’s (NOAA). The NOAA storm database tracks details about major storms and other weather events including the location, duration and economic impact as well as the number of injuries, direct and indirect, as well as loss of life. The data analyzed was collected between 1950 and November 2011.
Although the database is structured in lacks consistent conventions for the key attributes we need, so a significant part of the analysis will be to try to normalize the event types and damage estimates so they can be aggregated and compared.
For the purposes of this report event types will be aggregated into broader categories of weather events, such as hurricane, thunderstorm, hail, flood, and freezing rain.
I’m using dplyr and ggplot2 libraries to process the data.
library(dplyr)
library(ggplot2)
library(pander)
library(knitr)
The data file comes in the format of a csv in bzip2 format. There are over 900,000 events recorded. Since we’re only looking at the event type and damages we’re going to ignore the remaining columns to save memory.
data <- read.csv(bzfile('repdata-data-StormData.csv.bz2'),
colClasses=c(STATE__ = "NULL",
BGN_DATE = 'character',
BGN_TIME = "NULL",
TIME_ZONE = "NULL",
COUNTY = "NULL",
COUNTYNAME = "NULL",
STATE = "NULL",
EVTYPE = 'character',
BGN_RANGE = "NULL",
BGN_AZI = "NULL",
BGN_LOCATI = "NULL",
END_DATE = "NULL",
END_TIME = "NULL",
COUNTY_END = "NULL",
COUNTYENDN = "NULL",
END_RANGE = "NULL",
END_AZI = "NULL",
END_LOCATI = "NULL",
LENGTH = "NULL",
WIDTH = "NULL",
F = "NULL",
MAG = NA,
FATALITIES = NA,
INJURIES = NA,
PROPDMG = NA,
PROPDMGEXP = 'character',
CROPDMG = NA,
CROPDMGEXP = 'character',
WFO = "NULL",
STATEOFFIC = "NULL",
ZONENAMES = "NULL",
LATITUDE = "NULL",
LONGITUDE = "NULL",
LATITUDE_E = "NULL",
LONGITUDE_ = "NULL",
REMARKS = "NULL",
REFNUM = "NULL"))
dim(data)
## [1] 902297 9
The monetary damages are all represented with a magnitude and exponent. The exponent values for property damage and crop damage include a variety of characters all of which translate to a power of 10, which is multiplied by the base magnitude value.
I am assuming +
, -
, ?
are artifacts and in those cases there was no exponent multiplier, so the exponent of 10 is 0. I created a mapping of EXP value to an actual power of 10 for both property and crop damage.
exp_lookup <- data.frame(stringsAsFactors = F,
levels=sort(unique(c(data$PROPDMGEXP, data$CROPDMGEXP))))
exp_lookup$values <- c(0,0,0,0,0,1,2,3,4,5,6,7,8,9,2,2,3,3,6,6)
print(exp_lookup, row.names = F)
## levels values
## 0
## - 0
## ? 0
## + 0
## 0 0
## 1 1
## 2 2
## 3 3
## 4 4
## 5 5
## 6 6
## 7 7
## 8 8
## B 9
## h 2
## H 2
## k 3
## K 3
## m 6
## M 6
To quantify the total cost of any given event I’m simply adding the property and crop losses.
data$total_damage <-
data$PROPDMG * 10 ^ left_join(data, exp_lookup, by=c('PROPDMGEXP'='levels'))$values +
data$CROPDMG * 10 ^ left_join(data, exp_lookup, by=c('CROPDMGEXP'='levels'))$values
We can spot check a few samples:
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP total_damage
## 933 50 K 50 K 100000
## 10533 10 K 20 K 30000
## 15074 50 K 100 K 150000
## 1506 5 K 50 K 55000
## 3500 25 K 25 K 50000
## 11833 2 K 5 K 7000
## 15294 1 K 1 K 2000
## 9951 1 K 5 K 6000
## 15638 3 K 10 K 13000
## 9968 10 K 10 K 20000
To make the damages estimates more meaningful I considered adjusting values for inflation by multiplying them by the consumer price index based on the year of the event. But I was starting to feel like I was spending way too much time on this project so I decided instead to use the time to go play with my kids.
The main problem with the raw data is that event types are described using different words, such as TSTRM
and THUNDERSTORM
. Another problem is that events types consist of multiple different events in many cases, like Sleet/Snow
. The events have mixed case, spurious or inconsistent punctuating characters, mis-spellings, and in some cases just defer to descriptions.
Here’s an example showing 20 of the event types for events related to thunderstorms:
grep(data$EVTYPE, pattern='(TSTM|\\bTH[UNDERSTO]*RM)', ignore.case=T, value = T) %>% table %>% as.data.frame %>% head(20) %>% print(row.names=F)
## . Freq
## TSTM WIND 4
## TSTM WIND (G45) 1
## FLASH FLOODING/THUNDERSTORM WI 1
## GUSTY THUNDERSTORM WIND 3
## GUSTY THUNDERSTORM WINDS 5
## LIGHTNING AND THUNDERSTORM WIN 1
## LIGHTNING THUNDERSTORM WINDS 1
## LIGHTNING THUNDERSTORM WINDSS 1
## MARINE THUNDERSTORM WIND 5812
## MARINE TSTM WIND 6175
## NON TSTM WIND 2
## NON-TSTM WIND 1
## SEVERE THUNDERSTORM 13
## SEVERE THUNDERSTORM WINDS 5
## SEVERE THUNDERSTORMS 23
## THUDERSTORM WINDS 2
## THUNDEERSTORM WINDS 2
## THUNDERESTORM WINDS 1
## THUNDERSTORM 45
## THUNDERSTORM WINDS 7
My approach to solving this was to aggregate the event types into single consistent categories, grouping them according to certain keywords. Also, rather than fix every single record in the data I focused on the data that seemed like it could impact the top 15 or 20 categories. I did this by iterating over the summary data, looking at categories near the top of the list of damage, and adding rules to ensure those event types were meaningful.
For the thunderstorm example I had to consider whether to group these all into thunderstorm events, wind events, or thunderstorm-wind events. I decided that the dominant attribute of all of these was “thunderstorm.” Events described as “thunderstorm wind” are still essentially thunderstorms for the purpose of quantifying their impact.
Here are the rules used to aggregate the EVTYPE
into a normalized event type:
normalized_event <- toupper(data$EVTYPE)
normalized_event <-
gsub('(RECORD|EXTREME|PATCHY|DENSE|HIGH|STRONG|HEAVY|VERY|UNUSUAL|MODERATE|UNSEASONAL|EXCESSIVE|ABNORMAL|ABNORMALLY) *',
'', normalized_event)
normalized_event[grep('THUNDERSTORM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('TSTM', normalized_event, fixed=T)] <- 'THUNDERSTORM'
normalized_event[grep('THU[NDERSTO]*M\\b', normalized_event)] <- 'THUNDERSTORM'
normalized_event[grep('FLOOD', normalized_event, fixed=T)] <- 'FLOOD'
normalized_event[grep('\\bFLD\\b', normalized_event)] <- 'FLOOD'
normalized_event[grep('TORNADO', normalized_event, fixed=T)] <- 'TORNADO'
normalized_event[grep('HURRICANE', normalized_event, fixed=T)] <- 'HURRICANE'
normalized_event[grep('TYPHOON', normalized_event, fixed=T)] <- 'TYPHOON'
normalized_event[grep('BLIZZARD', normalized_event, fixed=T)] <- 'BLIZZARD'
normalized_event[grep('TROPICAL ST[OR]*M', normalized_event)] <- 'TROPICAL STORM'
normalized_event[grep('HAIL', normalized_event, fixed=T)] <- 'HAIL'
normalized_event[grep('FIRE', normalized_event, fixed=T)] <- 'FIRE'
normalized_event[grep('^VOLCAN', normalized_event)] <- 'VOLCANO'
normalized_event[grep('^LIGHT[NI]*G\\b', normalized_event)] <- 'LIGHTNING'
normalized_event[grep('SNOW', normalized_event, fixed=T)] <- 'SNOW'
normalized_event[grep('ICE', normalized_event, fixed=T)] <- 'ICE'
normalized_event[grep('^WIND', normalized_event)] <- 'WINDS'
normalized_event[grep('^STORM SURGE', normalized_event)] <- 'STORM SURGE'
normalized_event[grep('^RIP CURRENT', normalized_event)] <- 'RIP CURRENT'
normalized_event[grep('^SURF', normalized_event)] <- 'SURF'
normalized_event[grep('HEAT', normalized_event)] <- 'HEAT'
normalized_event[grep('^COLD', normalized_event)] <- 'COLD'
normalized_event[grep('FREEZING RAIN', normalized_event)] <- 'FREEZING RAIN'
normalized_event[grep('^FROST', normalized_event)] <- 'FROST'
normalized_event[grep('^LANDSLIDE', normalized_event)] <- 'LANDSLIDE'
normalized_event[grep('^FOG', normalized_event)] <- 'FOG'
normalized_event[grep('^SUMMARY', normalized_event)] <- 'OTHER'
data$category <- normalized_event
You can see the effect of aggregation looking at a plot of event types before and after aggregation.
ev_grouped <- group_by(data, EVTYPE) %>%
summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())
cat_grouped <- group_by(data, category) %>%
summarize(fatalities=sum(FATALITIES), damage=sum(total_damage), count=n())
ev_grouped$processed <- F
cat_grouped$processed <-T
ggplot(bind_rows(ev_grouped, cat_grouped)) +
facet_grid(processed ~ .) +
aes(x=damage, y=fatalities, size=count) +
geom_point() + scale_x_log10() + scale_y_log10() +
ggtitle('Weather Event Types Before and After Aggregation')
The plot above shows the event types plotted against the total damages and fatalities. In this report we’re only interested in what shows up in the upper right side of the chart. After aggregation you can see it’s cleaned up quite a bit with a smaller number of categories with a higher count of events.
(Note: the charts axes are log scaled)
The final step in processing was to group the data by event type and summarize by damages in dollars and loss of life.
event_data <- group_by(data, category) %>%
summarize(count=n(),
total_damage=sum(total_damage),
average_damage=mean(total_damage),
total_fatalities=sum(FATALITIES),
average_fatalities=mean(FATALITIES),
total_injuries=sum(INJURIES),
average_injuries=mean(INJURIES)) %>%
arrange(desc(count))
combined_results <-
bind_rows(event_data %>% arrange(desc(total_damage)) %>% head(5) %>%
mutate(impact = total_damage, rel=total_damage/sum(total_damage), by='Total Damage'),
event_data %>% arrange(desc(total_fatalities)) %>% head(5) %>%
mutate(impact=total_fatalities, rel=total_fatalities/sum(total_fatalities), by='Fatalities'),
event_data %>% arrange(desc(total_injuries)) %>% head(5) %>%
mutate(impact=total_injuries, rel=total_injuries/sum(total_injuries), by='Injuries'))
ggplot(combined_results) +
geom_bar(stat='identity') +
aes(x=by, y=rel, fill=category) +
stat_identity(geom="text", position='stack', size=5,aes(label=category,y=rel, vjust=1.5)) +
scale_fill_brewer(palette='Set3', guide=F) +
ggtitle('Relative Impact of Top 5 Weather Events by Measured Cost')
This figure shows the relative breakdown of cost among the top five event in terms of the number of fatalities, the number of injuries, and the total damage to crops and property. This shows tornadoes as the most dangerous weather event and floods as the most expensive!
Here are the top 20 events in order of total cost over all events in that category:
event_data %>%
arrange(desc(total_damage)) %>%
select(category, count, total_damage, total_fatalities, total_injuries) %>%
mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
head(20) %>% kable(caption='Top Events by Total Crop and Property Damage', align=c('l','r','r','r','r'))
category | count | total_damage | total_fatalities | total_injuries |
---|---|---|---|---|
FLOOD | 86119 | 18,100 M | 1553 | 8683 |
HURRICANE | 288 | 9,030 M | 135 | 1328 |
TORNADO | 60699 | 5,740 M | 5636 | 91407 |
STORM SURGE | 409 | 4,800 M | 24 | 43 |
HAIL | 289282 | 1,900 M | 15 | 1371 |
DROUGHT | 2488 | 1,500 M | 0 | 4 |
THUNDERSTORM | 336823 | 1,410 M | 755 | 9544 |
ICE | 2101 | 898 M | 97 | 2152 |
FIRE | 4240 | 890 M | 90 | 1608 |
TROPICAL STORM | 697 | 841 M | 66 | 383 |
WINDS | 26239 | 696 M | 443 | 1857 |
WINTER STORM | 11433 | 672 M | 206 | 1321 |
RAIN/SEVERE WEATHER | 2 | 250 M | 0 | 0 |
COLD | 2387 | 151 M | 432 | 315 |
RAIN | 11765 | 144 M | 98 | 251 |
FROST | 1401 | 117 M | 1 | 3 |
SNOW | 17692 | 116 M | 169 | 1165 |
LIGHTNING | 15764 | 94 M | 817 | 5231 |
HEAT | 2648 | 92 M | 3138 | 9224 |
BLIZZARD | 2745 | 78 M | 101 | 806 |
Here are the top 20 events in order of the total number of deaths for events in that category:
event_data %>%
arrange(desc(total_fatalities)) %>%
select(category, count, total_damage, total_fatalities, total_injuries) %>%
mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
head(20) %>% kable(caption='Top Events by Total Number of Fatalities', align=c('l','r','r','r','r'))
category | count | total_damage | total_fatalities | total_injuries |
---|---|---|---|---|
TORNADO | 60699 | 5,740 M | 5636 | 91407 |
HEAT | 2648 | 92 M | 3138 | 9224 |
FLOOD | 86119 | 18,100 M | 1553 | 8683 |
LIGHTNING | 15764 | 94 M | 817 | 5231 |
THUNDERSTORM | 336823 | 1,410 M | 755 | 9544 |
RIP CURRENT | 777 | 0 M | 577 | 529 |
WINDS | 26239 | 696 M | 443 | 1857 |
COLD | 2387 | 151 M | 432 | 315 |
AVALANCHE | 386 | 0 M | 224 | 170 |
WINTER STORM | 11433 | 672 M | 206 | 1321 |
SNOW | 17692 | 116 M | 169 | 1165 |
SURF | 1055 | 10 M | 157 | 244 |
HURRICANE | 288 | 9,030 M | 135 | 1328 |
BLIZZARD | 2745 | 78 M | 101 | 806 |
RAIN | 11765 | 144 M | 98 | 251 |
ICE | 2101 | 898 M | 97 | 2152 |
FIRE | 4240 | 890 M | 90 | 1608 |
FOG | 1835 | 2 M | 81 | 1077 |
TROPICAL STORM | 697 | 841 M | 66 | 383 |
LANDSLIDE | 608 | 34 M | 39 | 53 |
Finally, here are the top 20 events in order of the total number of injuries for events in that category:
event_data %>%
arrange(desc(total_injuries)) %>%
select(category, count, total_damage, total_fatalities, total_injuries) %>%
mutate(total_damage=paste(prettyNum(round(signif(total_damage/10e6, 3)), big.mark=','), 'M')) %>%
head(20) %>% kable(caption='Top Events by Total Number of Injuries', align=c('l','r','r','r','r'))
category | count | total_damage | total_fatalities | total_injuries |
---|---|---|---|---|
TORNADO | 60699 | 5,740 M | 5636 | 91407 |
THUNDERSTORM | 336823 | 1,410 M | 755 | 9544 |
HEAT | 2648 | 92 M | 3138 | 9224 |
FLOOD | 86119 | 18,100 M | 1553 | 8683 |
LIGHTNING | 15764 | 94 M | 817 | 5231 |
ICE | 2101 | 898 M | 97 | 2152 |
WINDS | 26239 | 696 M | 443 | 1857 |
FIRE | 4240 | 890 M | 90 | 1608 |
HAIL | 289282 | 1,900 M | 15 | 1371 |
HURRICANE | 288 | 9,030 M | 135 | 1328 |
WINTER STORM | 11433 | 672 M | 206 | 1321 |
SNOW | 17692 | 116 M | 169 | 1165 |
FOG | 1835 | 2 M | 81 | 1077 |
BLIZZARD | 2745 | 78 M | 101 | 806 |
RIP CURRENT | 777 | 0 M | 577 | 529 |
DUST STORM | 427 | 1 M | 22 | 440 |
WINTER WEATHER | 7045 | 4 M | 33 | 398 |
TROPICAL STORM | 697 | 841 M | 66 | 383 |
COLD | 2387 | 151 M | 432 | 315 |
RAIN | 11765 | 144 M | 98 | 251 |