Analysis of weather events in the United States

Created as part of a project for the Coursera course ‘Reproducible Research’ by Aneev K

Synopsis

This is an analysis of data concerning weather events from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. It seeks to find the types of events that have historically been most harmful to population health (in terms of fatalities and injuries), and those that have had the greatest economic consequences (in terms of property and crop damage). The list below shows the top three event types that have resulted in the most damage of each kind:

Total economic damage (property and crop): Flood, Hurricane (Typhoon), Storm Surge/Tide
Fatalities: Excessive heat, Tornado, Flash Flood
Injuries: Tornado, Flood, Excessive Heat

Preliminaries

The analysis starts with the bzipped CSV file available here.

To start, we assume that this file has been downloaded and is in the current directory.

Loading libraries

library(data.table)
library(R.utils)
library(scales)
library(ggplot2)
library(magrittr)
library(hash)
library(stringdist)

Reading in the data

We’ll cache this to save time.

bunzip2('repdata%2Fdata%2FStormData.csv.bz2', 'stormdata.csv', remove=FALSE, skip=TRUE)
sdata <- fread('stormdata.csv')

Data processing

Filtering by year

A preliminary analysis shows us that earlier years have less data. In this analysis, we’ll only use years starting from 1996.

sdata[, begin.date := as.Date(BGN_DATE, '%m/%d/%Y')]
sdata[, year := year(begin.date)]
sdata <- sdata[year >= 1996]

Filtering out events with no economic damage, fatalities, or injuries

sdata <- sdata[PROPDMG != 0 | CROPDMG != 0 | FATALITIES != 0 | INJURIES != 0]

Fixing PROPDMGEXP and CROPDMGEXP

Property damage and crop damage are represented in this data set using a combination of two fields each: ‘PROPDMG’ or ‘CROPDMG’ representing the significant figures, and ‘PROPDMGEXP’ or ‘CROPDMGEXP’ representing powers of 10.

PROPDMGEXP and CROPDMGEXP have a few values that do not make sense, such as ‘+’, ‘-’ etc, but all of these occur before 1996, so they’ve already been filtered out.

First, we’ll set up a table that lists out the various unique values of PROPDMGEXP and CROPDMGEXP, and allots each of them a power of 10.

unique_exps <- unique(c(sdata$PROPDMGEXP, sdata$CROPDMGEXP))

exponents <- data.table(given=unique_exps, power.of.ten=0) %>%
  .[given %in% c('B','b'), power.of.ten := 1000000000] %>%
  .[given %in% c('M','m'), power.of.ten := 1000000] %>%
  .[given %in% c('K','k'), power.of.ten := 1000] %>%
  .[given %in% c('H','h'), power.of.ten := 100] %>%
  .[given != ''] # Filter out the empty string

print(exponents)

##    given power.of.ten
## 1:     K        1e+03
## 2:     M        1e+06
## 3:     B        1e+09

We now set up a hash using these values for optimal performance, and create new columns within our data table representing the crop and property damage multipliers, actual crop and property damage in dollars, as well as the total damage.

multiplier <- hash(exponents$given, exponents$power.of.ten)

get.multiplier <- function(exp) {
  if (exp == '') {1}
  else {multiplier[[exp]]}
}

sdata <- sdata[, propdmg_multiplier := mapply(get.multiplier, PROPDMGEXP)] %>%
  .[, cropdmg_multiplier := mapply(get.multiplier, CROPDMGEXP)] %>%
  .[, actual_prop_dmg := PROPDMG * propdmg_multiplier] %>%
  .[, actual_crop_dmg := CROPDMG * cropdmg_multiplier] %>%
  .[, total_dmg := actual_prop_dmg + actual_crop_dmg]

# Take a peek at our new columns
columns.to.select <- c('PROPDMG', 'CROPDMG',
                       'PROPDMGEXP', 'CROPDMGEXP',
                       'propdmg_multiplier', 'cropdmg_multiplier',
                       'actual_prop_dmg', 'actual_crop_dmg', 'total_dmg')

print(sdata[
  sample(nrow(sdata), 10),
  columns.to.select, with=FALSE
])

##     PROPDMG CROPDMG PROPDMGEXP CROPDMGEXP propdmg_multiplier
##  1:     4.5       0          M          K              1e+06
##  2:     5.0       0          K                         1e+03
##  3:     5.0       0          K                         1e+03
##  4:    15.0       0          K          K              1e+03
##  5:     0.0       1          K          K              1e+03
##  6:    50.0       0          K          K              1e+03
##  7:     1.0       0          K          K              1e+03
##  8:    50.0       0          K          K              1e+03
##  9:   250.0       0          K          K              1e+03
## 10:     0.0       0                                    1e+00
##     cropdmg_multiplier actual_prop_dmg actual_crop_dmg total_dmg
##  1:               1000         4500000               0   4500000
##  2:                  1            5000               0      5000
##  3:                  1            5000               0      5000
##  4:               1000           15000               0     15000
##  5:               1000               0            1000      1000
##  6:               1000           50000               0     50000
##  7:               1000            1000               0      1000
##  8:               1000           50000               0     50000
##  9:               1000          250000               0    250000
## 10:                  1               0               0         0

Fixing the event types

According to the documentation, there are only 48 canonical event types. However, our data has a lot more:

length(unique(sdata$EVTYPE))

## [1] 222

Therefore, we need to fix these to include just the canonical 48. We’ll do this using a combination of substring analysis, closest-text matching and straight-up ignoring the values with low frequency.

First up, we’ll put the canonical types in a vector. We’ll use lower-case to avoid any case-related discrepancies.

official_evtypes <- c('Astronomical Low Tide',
                     'Avalanche',
                     'Blizzard',
                     'Coastal Flood',
                     'Cold/Wind Chill',
                     'Debris Flow',
                     'Dense Fog',
                     'Dense Smoke',
                     'Drought',
                     'Dust Devil',
                     'Dust Storm',
                     'Excessive Heat',
                     'Extreme Cold/Wind Chill',
                     'Flash Flood',
                     'Flood',
                     'Frost/Freeze',
                     'Funnel Cloud',
                     'Freezing Fog',
                     'Hail',
                     'Heat',
                     'Heavy Rain',
                     'Heavy Snow',
                     'High Surf',
                     'High Wind',
                     'Hurricane (Typhoon)',
                     'Ice Storm',
                     'Lake-Effect Snow',
                     'Lakeshore Flood',
                     'Lightning',
                     'Marine Hail',
                     'Marine High Wind',
                     'Marine Strong Wind',
                     'Marine Thunderstorm Wind',
                     'Rip Current',
                     'Seiche',
                     'Sleet',
                     'Storm Surge/Tide',
                     'Strong Wind',
                     'Thunderstorm Wind',
                     'Tornado',
                     'Tropical Depression',
                     'Tropical Storm',
                     'Tsunami',
                     'Volcanic Ash',
                     'Waterspout',
                     'Wildfire',
                     'Winter Storm',
                     'Winter Weather')

official_evtypes <- official_evtypes %>% tolower

We only really need a few columns of the data we have left (grouped by event type), so let’s get rid of everything else:

evtable <- sdata[,
  .(count = .N, fat = sum(FATALITIES), inj = sum(INJURIES), dmg = sum(total_dmg)),
  by = EVTYPE
]

print(evtable)

##                     EVTYPE count  fat   inj         dmg
##   1:          WINTER STORM  1460  191  1292  1544687250
##   2:               TORNADO 12366 1511 20667 24900370720
##   3:             TSTM WIND 61775  241  3629  5031941790
##   4:             HIGH WIND  5402  235  1083  5881421660
##   5:           FLASH FLOOD 19011  887  1674 16557105610
##  ---                                                   
## 218:    MARINE STRONG WIND    46   14    22      418330
## 219: ASTRONOMICAL LOW TIDE     2    0     0      320000
## 220:           DENSE SMOKE     1    0     0      100000
## 221:           MARINE HAIL     2    0     0        4000
## 222:          FREEZING FOG     7    0     0     2182000

Next, we add a couple of columns: ‘type’ to represent a lower-case trimmed version of EVTYPE, and realtype, which will represent the cleansed canonical type (initially set to NA)

evtable[, type := tolower(trim(EVTYPE))] %>%
  .[, realtype := '']

We use approximate string matching to fill in the canonical types as far as possible:

evtable[realtype == '', realtype := official_evtypes[amatch(type, official_evtypes, maxDist = 6)]]

For the ones that do not yet have a canonical type, we’ll perform a series of substring-based allotments. Eg: If the substring, ‘hurricane’ appears in the text, it can be given the canonical type ‘hurricane (typhoon)’

evtable[is.na(realtype) & (grepl('tstm', type) | grepl('thunderstorm', type)) & !grepl('marine', type), realtype := 'thunderstorm wind'] 
evtable[is.na(realtype) & grepl('hail',type) & grepl('marine', type) == FALSE, realtype := 'hail']
evtable[is.na(realtype) & grepl('hurricane',type), realtype := 'hurricane (typhoon)']
evtable[is.na(realtype) & grepl('fire', type), realtype := 'wildfire']
evtable[is.na(realtype) & grepl('snow', type), realtype := 'heavy snow']
evtable[is.na(realtype) & grepl('wind', type), realtype := 'string wind']
evtable[is.na(realtype) & grepl('freez', type), realtype := 'frost/freeze']
evtable[is.na(realtype) & grepl('flood', type), realtype := 'flood']
evtable[is.na(realtype) & grepl('fld', type), realtype := 'flood']
evtable[is.na(realtype) & grepl('surf', type), realtype := 'high surf']
evtable[is.na(realtype) & grepl('cold', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('rain', type), realtype := 'heavy rain']
evtable[is.na(realtype) & grepl('wint', type), realtype := 'winter weather']
evtable[is.na(realtype) & grepl('precip', type), realtype := 'heavy rain']
evtable[is.na(realtype) & grepl('hypothermia', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('hyperthermia', type), realtype := 'cold/wind chill']
evtable[is.na(realtype) & grepl('frost', type), realtype := 'frost/freeze']
evtable[is.na(realtype) & grepl('heat', type), realtype := 'heat']
evtable[is.na(realtype) & grepl('warm', type), realtype := 'heat']
evtable[is.na(realtype) & grepl('microburst', type), realtype := 'thunderstorm wind']

Unmatched realtypes are NA. Let’s see how many of those there are:

print(evtable[is.na(realtype)])

##              EVTYPE count fat inj     dmg            type realtype
##  1: Marine Accident     1   1   2   50000 marine accident     <NA>
##  2:   Beach Erosion     1   0   0  100000   beach erosion     <NA>
##  3:       Landslump     1   0   0  570000       landslump     <NA>
##  4:       MUD SLIDE     2   0   0  100100       mud slide     <NA>
##  5:       BLACK ICE     1   1  24       0       black ice     <NA>
##  6:       DAM BREAK     2   0   0 1002000       dam break     <NA>
##  7:      LANDSLIDES     1   1   1    5000      landslides     <NA>
##  8:      ROCK SLIDE     1   0   0  150000      rock slide     <NA>
##  9:      ROGUE WAVE     1   0   2       0      rogue wave     <NA>
## 10:    BLOWING DUST     1   0   0   20000    blowing dust     <NA>

As these have frequency of at most 2, they can safely be ignored:

evtable <- evtable[!is.na(realtype)]

We now group the data using the clean ‘realtype’ column instead of EVTYPE:

evtable <- evtable[,
  .(events = sum(count), fatalities = sum(fat), injuries = sum(inj), damage = sum(dmg)),
  by = realtype
]

print(evtable)

##                     realtype events fatalities injuries       damage
##  1:             winter storm   1460        191     1292   1544687250
##  2:                  tornado  12366       1511    20667  24900370720
##  3:                high wind  67231        484     4724  10923869450
##  4:              flash flood  19093        888     1674  16579317610
##  5:             freezing fog     17          3        0      2808000
##  6:          cold/wind chill    270        221       91   1361565900
##  7:                lightning  11293        652     4143    752488520
##  8:                     hail  22683          7      713  17071742870
##  9:                    flood  10464        544     7884 149791901700
## 10:        thunderstorm wind  43674        141     1526   3892249290
## 11:           excessive heat    710       1797     6393    502060700
## 12:              rip current    603        542      503       163000
## 13:                     heat    205        239     1313      2796400
## 14:               heavy snow   1077        138      779    707319640
## 15:                 wildfire   1419        124     1510   8507299630
## 16:                ice storm    651         87      341   3658252010
## 17:                 blizzard    228         70      385    532718950
## 18:         storm surge/tide    216         13       42  47835579000
## 19:               dust storm     96         11      376      8574000
## 20:              strong wind   3414        110      299    241947740
## 21:               dust devil     84          2       39       663630
## 22:                high surf    191        140      207    109004500
## 23:               heavy rain   1074         96      260   1323824240
## 24:                avalanche    264        223      156      3711800
## 25:                    sleet     15          1        0    156925000
## 26:            coastal flood    194         10        4    356025560
## 27:               waterspout     29          2        2      5737200
## 28:             frost/freeze    152          2       13   1231072000
## 29:                   seiche    140         65       48  14556434010
## 30:           tropical storm    410         57      338   8320186550
## 31:           winter weather    552         64      562     42310500
## 32:                  drought    264          9        9  14413689000
## 33:      hurricane (typhoon)     73         64     1277  71913712800
## 34:  extreme cold/wind chill    130        142       29     26453000
## 35:              string wind     12          3        7      1056000
## 36:         lake-effect snow    198          0        0     40182000
## 37:             funnel cloud      9          0        1       134100
## 38:              marine hail     13          0       10     20867000
## 39:             volcanic ash      2          0        0       500000
## 40:         marine high wind    128         10        9      6718010
## 41:                dense fog     58          9      143      7319000
## 42:    astronomical low tide     10          0        0      9745000
## 43:      tropical depression     35          0        0      1737000
## 44:                  tsunami     14         33      129    144082000
## 45:          lakeshore flood      5          0        0      7540000
## 46: marine thunderstorm wind     33         10       26       486400
## 47:       marine strong wind     46         14       22       418330
## 48:              dense smoke      1          0        0       100000
##                     realtype events fatalities injuries       damage

Results

As the data is now (relatively) clean, we can start to plot graphs to answer the questions we want to answer.

Economic damage

dmg.obj = ggplot(evtable[order(-damage)][1:10], aes(x = reorder(realtype, -damage), y = damage))
dmg.plot <- dmg.obj +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_continuous(labels = dollar) + 
  xlab('Type of event') + 
  ylab('Total damage (in $)')

print(dmg.plot)

The top three event types by economic damage are floods, hurricanes and storm surges.

Fatalities

fat.obj = ggplot(evtable[order(-fatalities)][1:10], aes(x = reorder(realtype, -fatalities), y = fatalities))
fat.plot <- fat.obj +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab('Type of event') + 
  ylab('Total fatalities')

print(fat.plot)

The top three event types by fatalities are excessive heat, tornadoes and flash floods.

Injuries

inj.obj = ggplot(evtable[order(-injuries)][1:10], aes(x = reorder(realtype, -injuries), y = injuries))
inj.plot <- inj.obj +
  geom_col() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  xlab('Type of event') + 
  ylab('Total injuries')

print(inj.plot)

The top three event types by injuries are tornadoes, floods, and excessive heat.