This is an analysis of data concerning weather events from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It seeks to find the types of events that have historically been most harmful to population health (in terms of fatalities and injuries), and those that have had the greatest economic consequences (in terms of property and crop damage). The list below shows the top three event types that have resulted in the most damage of each kind:
The analysis starts with the bzipped CSV file available here.
To start, we assume that this file has been downloaded and is in the current directory.
library(data.table)
library(R.utils)
library(scales)
library(ggplot2)
library(magrittr)
library(hash)
library(stringdist)
We’ll cache this to save time.
bunzip2('repdata%2Fdata%2FStormData.csv.bz2', 'stormdata.csv', remove=FALSE, skip=TRUE)
sdata <- fread('stormdata.csv')
A preliminary analysis shows us that earlier years have less data. In this analysis, we’ll only use years starting from 1996.
sdata[, begin.date := as.Date(BGN_DATE, '%m/%d/%Y')]
sdata[, year := year(begin.date)]
sdata <- sdata[year >= 1996]
sdata <- sdata[PROPDMG != 0 | CROPDMG != 0 | FATALITIES != 0 | INJURIES != 0]
Property damage and crop damage are represented in this data set using a combination of two fields each: ‘PROPDMG’ or ‘CROPDMG’ representing the significant figures, and ‘PROPDMGEXP’ or ‘CROPDMGEXP’ representing powers of 10.
PROPDMGEXP and CROPDMGEXP have a few values that do not make sense, such as ‘+’, ‘-’ etc, but all of these occur before 1996, so they’ve already been filtered out.
First, we’ll set up a table that lists out the various unique values of PROPDMGEXP and CROPDMGEXP, and allots each of them a power of 10.
unique_exps <- unique(c(sdata$PROPDMGEXP, sdata$CROPDMGEXP))
exponents <- data.table(given=unique_exps, power.of.ten=0) %>%
.[given %in% c('B','b'), power.of.ten := 1000000000] %>%
.[given %in% c('M','m'), power.of.ten := 1000000] %>%
.[given %in% c('K','k'), power.of.ten := 1000] %>%
.[given %in% c('H','h'), power.of.ten := 100] %>%
.[given != ''] # Filter out the empty string
print(exponents)
## given power.of.ten
## 1: K 1e+03
## 2: M 1e+06
## 3: B 1e+09
We now set up a hash using these values for optimal performance, and create new columns within our data table representing the crop and property damage multipliers, actual crop and property damage in dollars, as well as the total damage.
multiplier <- hash(exponents$given, exponents$power.of.ten)
get.multiplier <- function(exp) {
if (exp == '') {1}
else {multiplier[[exp]]}
}
sdata <- sdata[, propdmg_multiplier := mapply(get.multiplier, PROPDMGEXP)] %>%
.[, cropdmg_multiplier := mapply(get.multiplier, CROPDMGEXP)] %>%
.[, actual_prop_dmg := PROPDMG * propdmg_multiplier] %>%
.[, actual_crop_dmg := CROPDMG * cropdmg_multiplier] %>%
.[, total_dmg := actual_prop_dmg + actual_crop_dmg]
# Take a peek at our new columns
columns.to.select <- c('PROPDMG', 'CROPDMG',
'PROPDMGEXP', 'CROPDMGEXP',
'propdmg_multiplier', 'cropdmg_multiplier',
'actual_prop_dmg', 'actual_crop_dmg', 'total_dmg')
print(sdata[
sample(nrow(sdata), 10),
columns.to.select, with=FALSE
])
## PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP propdmg_multiplier
## 1: 4.5 0 M K 1e+06
## 2: 5.0 0 K 1e+03
## 3: 5.0 0 K 1e+03
## 4: 15.0 0 K K 1e+03
## 5: 0.0 1 K K 1e+03
## 6: 50.0 0 K K 1e+03
## 7: 1.0 0 K K 1e+03
## 8: 50.0 0 K K 1e+03
## 9: 250.0 0 K K 1e+03
## 10: 0.0 0 1e+00
## cropdmg_multiplier actual_prop_dmg actual_crop_dmg total_dmg
## 1: 1000 4500000 0 4500000
## 2: 1 5000 0 5000
## 3: 1 5000 0 5000
## 4: 1000 15000 0 15000
## 5: 1000 0 1000 1000
## 6: 1000 50000 0 50000
## 7: 1000 1000 0 1000
## 8: 1000 50000 0 50000
## 9: 1000 250000 0 250000
## 10: 1 0 0 0
According to the documentation, there are only 48 canonical event types. However, our data has a lot more:
length(unique(sdata$EVTYPE))
## [1] 222
Therefore, we need to fix these to include just the canonical 48. We’ll do this using a combination of substring analysis, closest-text matching and straight-up ignoring the values with low frequency.
First up, we’ll put the canonical types in a vector. We’ll use lower-case to avoid any case-related discrepancies.
official_evtypes <- c('Astronomical Low Tide',
'Avalanche',
'Blizzard',
'Coastal Flood',
'Cold/Wind Chill',
'Debris Flow',
'Dense Fog',
'Dense Smoke',
'Drought',
'Dust Devil',
'Dust Storm',
'Excessive Heat',
'Extreme Cold/Wind Chill',
'Flash Flood',
'Flood',
'Frost/Freeze',
'Funnel Cloud',
'Freezing Fog',
'Hail',
'Heat',
'Heavy Rain',
'Heavy Snow',
'High Surf',
'High Wind',
'Hurricane (Typhoon)',
'Ice Storm',
'Lake-Effect Snow',
'Lakeshore Flood',
'Lightning',
'Marine Hail',
'Marine High Wind',
'Marine Strong Wind',
'Marine Thunderstorm Wind',
'Rip Current',
'Seiche',
'Sleet',
'Storm Surge/Tide',
'Strong Wind',
'Thunderstorm Wind',
'Tornado',
'Tropical Depression',
'Tropical Storm',
'Tsunami',
'Volcanic Ash',
'Waterspout',
'Wildfire',
'Winter Storm',
'Winter Weather')
official_evtypes <- official_evtypes %>% tolower
We only really need a few columns of the data we have left (grouped by event type), so let’s get rid of everything else:
evtable <- sdata[,
.(count = .N, fat = sum(FATALITIES), inj = sum(INJURIES), dmg = sum(total_dmg)),
by = EVTYPE
]
print(evtable)
## EVTYPE count fat inj dmg
## 1: WINTER STORM 1460 191 1292 1544687250
## 2: TORNADO 12366 1511 20667 24900370720
## 3: TSTM WIND 61775 241 3629 5031941790
## 4: HIGH WIND 5402 235 1083 5881421660
## 5: FLASH FLOOD 19011 887 1674 16557105610
## ---
## 218: MARINE STRONG WIND 46 14 22 418330
## 219: ASTRONOMICAL LOW TIDE 2 0 0 320000
## 220: DENSE SMOKE 1 0 0 100000
## 221: MARINE HAIL 2 0 0 4000
## 222: FREEZING FOG 7 0 0 2182000
Next, we add a couple of columns: ‘type’ to represent a lower-case trimmed version of EVTYPE, and realtype, which will represent the cleansed canonical type (initially set to NA)
evtable[, type := tolower(trim(EVTYPE))] %>%
.[, realtype := '']
We use approximate string matching to fill in the canonical types as far as possible:
evtable[realtype == '', realtype := official_evtypes[amatch(type, official_evtypes, maxDist = 6)]]
For the ones that do not yet have a canonical type, we’ll perform a series of substring-based allotments. Eg: If the substring, ‘hurricane’ appears in the text, it can be given the canonical type ‘hurricane (typhoon)’
evtable[is.na(realtype) & (grepl('tstm', type) | grepl('thunderstorm', type)) & !grepl('marine', type), realtype := 'thunderstorm wind']
evtable[is.na(realtype) & grepl('hail',type) & grepl('marine', type) == FALSE, realtype := 'hail']
evtable[is.na(realtype) & grepl('hurricane',type), realtype := 'hurricane (typhoon)']
evtable[is.na(realtype) & grepl('fire', type), realtype := 'wildfire']
evtable[is.na(realtype) & grepl('snow', type), realtype := 'heavy snow']
evtable[is.na(realtype) & grepl('wind', type), realtype := 'string wind']
evtable[is.na(realtype) & grepl('freez', type), realtype := 'frost/freeze']
evtable[is.na(realtype) & grepl('flood', type), realtype := 'flood']
evtable[is.na(realtype) & grepl('fld', type), realtype := 'flood']
evtable[is.na(realtype) & grepl('surf', type), realtype := 'high surf']
evtable[is.na(realtype) & grepl('cold', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('rain', type), realtype := 'heavy rain']
evtable[is.na(realtype) & grepl('wint', type), realtype := 'winter weather']
evtable[is.na(realtype) & grepl('precip', type), realtype := 'heavy rain']
evtable[is.na(realtype) & grepl('hypothermia', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('hyperthermia', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('frost', type), realtype := 'frost/freeze']
evtable[is.na(realtype) & grepl('heat', type), realtype := 'heat']
evtable[is.na(realtype) & grepl('warm', type), realtype := 'heat']
evtable[is.na(realtype) & grepl('microburst', type), realtype := 'thunderstorm wind']
Unmatched realtypes are NA. Let’s see how many of those there are:
print(evtable[is.na(realtype)])
## EVTYPE count fat inj dmg type realtype
## 1: Marine Accident 1 1 2 50000 marine accident <NA>
## 2: Beach Erosion 1 0 0 100000 beach erosion <NA>
## 3: Landslump 1 0 0 570000 landslump <NA>
## 4: MUD SLIDE 2 0 0 100100 mud slide <NA>
## 5: BLACK ICE 1 1 24 0 black ice <NA>
## 6: DAM BREAK 2 0 0 1002000 dam break <NA>
## 7: LANDSLIDES 1 1 1 5000 landslides <NA>
## 8: ROCK SLIDE 1 0 0 150000 rock slide <NA>
## 9: ROGUE WAVE 1 0 2 0 rogue wave <NA>
## 10: BLOWING DUST 1 0 0 20000 blowing dust <NA>
As these have frequency of at most 2, they can safely be ignored:
evtable <- evtable[!is.na(realtype)]
We now group the data using the clean ‘realtype’ column instead of EVTYPE:
evtable <- evtable[,
.(events = sum(count), fatalities = sum(fat), injuries = sum(inj), damage = sum(dmg)),
by = realtype
]
print(evtable)
## realtype events fatalities injuries damage
## 1: winter storm 1460 191 1292 1544687250
## 2: tornado 12366 1511 20667 24900370720
## 3: high wind 67231 484 4724 10923869450
## 4: flash flood 19093 888 1674 16579317610
## 5: freezing fog 17 3 0 2808000
## 6: cold/wind chill 270 221 91 1361565900
## 7: lightning 11293 652 4143 752488520
## 8: hail 22683 7 713 17071742870
## 9: flood 10464 544 7884 149791901700
## 10: thunderstorm wind 43674 141 1526 3892249290
## 11: excessive heat 710 1797 6393 502060700
## 12: rip current 603 542 503 163000
## 13: heat 205 239 1313 2796400
## 14: heavy snow 1077 138 779 707319640
## 15: wildfire 1419 124 1510 8507299630
## 16: ice storm 651 87 341 3658252010
## 17: blizzard 228 70 385 532718950
## 18: storm surge/tide 216 13 42 47835579000
## 19: dust storm 96 11 376 8574000
## 20: strong wind 3414 110 299 241947740
## 21: dust devil 84 2 39 663630
## 22: high surf 191 140 207 109004500
## 23: heavy rain 1074 96 260 1323824240
## 24: avalanche 264 223 156 3711800
## 25: sleet 15 1 0 156925000
## 26: coastal flood 194 10 4 356025560
## 27: waterspout 29 2 2 5737200
## 28: frost/freeze 152 2 13 1231072000
## 29: seiche 140 65 48 14556434010
## 30: tropical storm 410 57 338 8320186550
## 31: winter weather 552 64 562 42310500
## 32: drought 264 9 9 14413689000
## 33: hurricane (typhoon) 73 64 1277 71913712800
## 34: extreme cold/wind chill 130 142 29 26453000
## 35: string wind 12 3 7 1056000
## 36: lake-effect snow 198 0 0 40182000
## 37: funnel cloud 9 0 1 134100
## 38: marine hail 13 0 10 20867000
## 39: volcanic ash 2 0 0 500000
## 40: marine high wind 128 10 9 6718010
## 41: dense fog 58 9 143 7319000
## 42: astronomical low tide 10 0 0 9745000
## 43: tropical depression 35 0 0 1737000
## 44: tsunami 14 33 129 144082000
## 45: lakeshore flood 5 0 0 7540000
## 46: marine thunderstorm wind 33 10 26 486400
## 47: marine strong wind 46 14 22 418330
## 48: dense smoke 1 0 0 100000
## realtype events fatalities injuries damage
As the data is now (relatively) clean, we can start to plot graphs to answer the questions we want to answer.
dmg.obj = ggplot(evtable[order(-damage)][1:10], aes(x = reorder(realtype, -damage), y = damage))
dmg.plot <- dmg.obj +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = dollar) +
xlab('Type of event') +
ylab('Total damage (in $)')
print(dmg.plot)
The top three event types by economic damage are floods, hurricanes and storm surges.
fat.obj = ggplot(evtable[order(-fatalities)][1:10], aes(x = reorder(realtype, -fatalities), y = fatalities))
fat.plot <- fat.obj +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab('Type of event') +
ylab('Total fatalities')
print(fat.plot)
The top three event types by fatalities are excessive heat, tornadoes and flash floods.
inj.obj = ggplot(evtable[order(-injuries)][1:10], aes(x = reorder(realtype, -injuries), y = injuries))
inj.plot <- inj.obj +
geom_col() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab('Type of event') +
ylab('Total injuries')
print(inj.plot)
The top three event types by injuries are tornadoes, floods, and excessive heat.