This analysis uses public information provided by the Department of Commerce, and in particular the National Oceanic & Atmospheric Administration. The analysis aims to answer two primary questions:

The source data tracks incidents from 1950 forward. However, more recent years have proven to be more reliable in the data; so, this analysis will be conducted from the year 2000 forward. The Data Processing describes all the transformation from the source data to the final form, and the Results section states the findings in a consumable form, both dataset and plot visualizations.

Author: mjfii
Date: 2014-09-21

Data Processing:

The following are steps to gather, clean, and the alter source data for use in later analytics and ultimately the below stated results. First, we will use all the following R libraries in the processing and analytic code chunks, and we will also make sure a working source data folder exists.

library(data.table)
suppressMessages(library(plyr))
suppressMessages(library(lubridate))
library(ggplot2)
suppressMessages(library(reshape))

setwd('~/R')
if (!file.exists('data/storm_data')) {
  dir.create('data/storm_data')
}

In order to make this analysis reproducible, we need to start with the original .bz2 data file. Below, we simply use the read.csv function to load the source data into memory. Note, since this file is very large, if we already have a tidy dataset from a prior processing, we will look for it and use it. From there, the below cleaning, subsetting, and aggregating is conducted. The primary transformation below the calculation of ‘DAMAGE’, which is defined in greater detail in the source code book, but in short, it translates ‘1.1M’ into $1,100,000, for aggregation purposes.

if (!file.exists('data/storm_data/tidy_storm_data.csv')) {
  # read the file and calc dates to subset by year
  raw<-data.table(read.csv('data/storm_data/FStormData.csv.bz2',stringsAsFactors=FALSE))
  raw<-raw[,YEAR:=year(as.POSIXlt(strptime(raw$BGN_DATE,'%m/%d/%Y %H:%M:%S')))]
  raw<-raw[YEAR>=2000,]

  # calculate total economic damage per code book specifications
  raw$CROPDMGEXP[raw$CROPDMGEXP=='']<-'1'
  raw$CROPDMGEXP[raw$CROPDMGEXP=='K']<-'1000'
  raw$CROPDMGEXP[raw$CROPDMGEXP=='M']<-'1000000'
  raw$CROPDMGEXP[raw$CROPDMGEXP=='B']<-'1000000000'
  raw$PROPDMGEXP[raw$PROPDMGEXP=='']<-'1'
  raw$PROPDMGEXP[raw$PROPDMGEXP=='K']<-'1000'
  raw$PROPDMGEXP[raw$PROPDMGEXP=='M']<-'1000000'
  raw$PROPDMGEXP[raw$PROPDMGEXP=='B']<-'1000000000'
  raw<-raw[,DAMAGE:=(as.numeric(PROPDMG)*as.numeric(PROPDMGEXP))+(as.numeric(CROPDMG)*as.numeric(CROPDMGEXP))]

  # trim up any white-space
  trim <- function(x) {
    gsub('(^[[:space:]]+|[[:space:]]+$)','',x)
  }
  raw$EVTYPE<-mapply(trim,raw$EVTYPE)
  
  # aggregate items into a clean version
  pre.agg<-raw[,c(8,4,23,24,39),with=FALSE]
  pre.agg<-ddply(pre.agg,c('EVTYPE'), summarize,FATALITIES=(sum(FATALITIES)),INJURIES=(sum(INJURIES)),DAMAGE=(sum(DAMAGE)))
  pre.agg<-data.table(pre.agg)[,INCIDENTS:=FATALITIES+INJURIES]
  pre.agg<-pre.agg[INCIDENTS>0 | DAMAGE>0,]
  pre.agg<-pre.agg[order(-pre.agg$INCIDENTS),]
  write.csv(pre.agg, file='data/storm_data/tidy_storm_data.csv',row.names=FALSE) 
} else {
  pre.agg<-data.table(read.csv('data/storm_data/tidy_storm_data.csv',stringsAsFactors=FALSE))
}

Since there are potential overlaps with EVTYPE in the raw data, e.g. ‘WILDFIRE’ and ‘WILD FOREST FIRE’, we want to combine identify these types as similar for aggregation purposes. The below fuzzy_lookup function uses the base agrep function to conduct this fuzzy lookup. This base function uses the Levenshtein edit distance to determine similarity. The initial data.table is passed into the function as an argument, and a similar data.table is returned with all the new aggregations in place.

fuzzy_lookup<-function(fuzzy_table) {
  
  # get the initial variables for the fuction
  fuzzy_table<-cbind(fuzzy_table,c(1:nrow(fuzzy_table)))
  setnames(fuzzy_table,6,'ROW')
  fuzzy_table$FUZZY<-as.character(NA)
  fuzzy_table_length<-length(fuzzy_table$FUZZY)
  luv<-fuzzy_table$EVTYPE
  
  # loop thru each row and find the closest match (if any)
  for (i in 1:(fuzzy_table_length)) {
    val<-fuzzy_table$EVTYPE[i]
    vec<-head(luv,i)
    vec<-vec[!is.element(vec,val)]
    fuzzy_value<-agrep(val,vec)
    fuzzy_value<-fuzzy_value[1]
    fuzzy_table$FUZZY[i]=vec[fuzzy_value]
  }
  
  # if a match was found, persist it to the source column and (re)aggregate
  fuzzy_table$EVTYPE<-ifelse(is.na(fuzzy_table$FUZZY),fuzzy_table$EVTYPE,fuzzy_table$FUZZY)
  fuzzy_table<-fuzzy_table[,c(1:5),with=FALSE]
  fuzzy_table<-ddply(fuzzy_table, c('EVTYPE'), summarize,FATALITIES=(sum(FATALITIES)),
                                                         INJURIES=(sum(INJURIES)),
                                                         INCIDENTS=(sum(INCIDENTS)),
                                                         DAMAGE=(sum(DAMAGE)))
  return(data.table(fuzzy_table))
}

Now that we have the fuzzy logic function, just pass our aggregated dataset in and we have our working ‘cleaned’ dataset returned and set to a variable.

cleaned.set<-fuzzy_lookup(pre.agg)

The below chunk of code uses our cleaned dataset to build and ‘incident’ dataset for analysis and visualization. The top 3 weather type by ‘Incident’ (Injuries plus Fatalities) are shown in the following segment.

incident.plot.data<-cleaned.set[INCIDENTS>0,]
total.incidents<-sum(cleaned.set$INCIDENTS)
incident.plot.data<-incident.plot.data[,INCIDENT_PERCENT:=round(incident.plot.data$INCIDENTS/total.incidents[1]*100,digits=1)]
incident.plot.data<-incident.plot.data[order(-incident.plot.data$INCIDENT_PERCENT),]
incident.plot.data<-head(incident.plot.data,n=3)[,c(1:4,6),with=FALSE]
setnames(incident.plot.data,c(1:5),c('Event','Fatalities','Injuries','Incidents','Incident %'))
incident.plot.data
##             Event Fatalities Injuries Incidents Incident %
## 1:        TORNADO       1193    15213     16406       39.9
## 2: EXCESSIVE HEAT       1244     4930      6174       15.0
## 3:      LIGHTNING        466     2993      3459        8.4

The below chunk of code uses our cleaned dataset to build and ‘damage’ dataset for analysis and visualization. The top 4 weather types by ‘Damage’ (Property $ plus Crop $) are shown in the following segment.

damage.plot.data<-cleaned.set[DAMAGE>0,]
total.damage<-sum(damage.plot.data$DAMAGE)
damage.plot.data<-damage.plot.data[,DAMAGE_PERCENT:=round(damage.plot.data$DAMAGE/total.damage[1]*100,digits=1)]
damage.plot.data<-damage.plot.data[,INCIDENT_PERCENT:=round(damage.plot.data$INCIDENTS/total.incidents[1]*100,digits=1)]
damage.plot.data<-damage.plot.data[order(-damage.plot.data$DAMAGE_PERCENT),]
damage.plot.data$DAMAGE<-damage.plot.data$DAMAGE/1000000
damage.plot.data<-damage.plot.data[,c(1,5,6,7),with=FALSE]
damage.plot.data<-head(damage.plot.data,n=4)
setnames(damage.plot.data,c(1:4),c('Event','Damage $M','Damage %','Incident %'))
damage.plot.data
##                Event Damage $M Damage % Incident %
## 1:       FLASH FLOOD    151695     42.8        4.8
## 2: HURRICANE/TYPHOON     75406     21.3        3.3
## 3:  STORM SURGE/TIDE     47813     13.5        0.0
## 4:           TORNADO     19696      5.6       39.9

Results:

Using the chunk of code below, we can see clearly that Tornados cause the risk to pubic health with a 39.9% stake of all weather related instances. ‘Excessive Heat’ and ‘Lightning’ round out the top three, with 15.0% and 8.4%, respectively. However, it is important to note that this number utilizes both ‘Fatalities’ and ‘Injuries’; in turn, ‘Excessive Heat’ causes the most fatalites.

# reshape the data and plot
incident.plot.data<-melt(incident.plot.data[,c(1:4),with=FALSE],id=c('Event'))
p<-qplot(variable,value,data=incident.plot.data,facets=Event~.,geom='bar',stat='identity',position='dodge')
p<-p+xlab('Incident Type by Event')
p<-p+ylab('Number of Incidents')
p

plot of chunk unnamed-chunk-7

The next result set seek to understand the monetary impact of any given weather event, and compare the percentage of the total impact accross all events to that of the incident percentages. ‘Flash Floods’ cause the most economic impact with 42.8% of the total. ‘Hurricanes’ and ‘Storm Surges’ round out the top three, with 21.3% and 13.5%, respectively. ‘Tornados’, while causing the most ‘health’ related incident only account for 5.6% or the total, or $19.6B since January 2000. The plot below shows this dichotomy in terms of percentage of totals.

# reshape the data and plot
damage.plot.data<-melt(damage.plot.data[,c(1,3,4),with=FALSE],id=c('Event'))
p<-qplot(variable,value,data=damage.plot.data,geom='bar',stat='identity',position='dodge')
p<-p + facet_wrap(~ Event)
p<-p+xlab('Top 4 Events by Percentage of Total Damage')
p<-p+ylab('Percentage (%)')
p

plot of chunk unnamed-chunk-8

References:

  1. Raw Source Data
  2. Storm Data Code Book and Documentation
  3. Storm Data FAQs

Other working datasets used within the above documentation:

pre.agg
##                   EVTYPE FATALITIES INJURIES    DAMAGE INCIDENTS
##   1:             TORNADO       1193    15213 1.970e+10     16406
##   2:      EXCESSIVE HEAT       1013     3708 4.968e+08      4721
##   3:           LIGHTNING        466     2993 6.014e+08      3459
##   4:           TSTM WIND        116     1753 2.424e+09      1869
##   5:   THUNDERSTORM WIND        130     1400 3.781e+09      1530
##  ---                                                            
## 103: TROPICAL DEPRESSION          0        0 1.737e+06         0
## 104:     TSTM WIND (G40)          0        0 6.000e+03         0
## 105:       TSTM WIND G45          0        0 1.000e+03         0
## 106:        VOLCANIC ASH          0        0 5.000e+05         0
## 107:           WHIRLWIND          0        0 7.000e+03         0
cleaned.set
##                       EVTYPE FATALITIES INJURIES INCIDENTS    DAMAGE
##  1:   ASTRONOMICAL HIGH TIDE          0        0         0 9.425e+06
##  2:    ASTRONOMICAL LOW TIDE          0        0         0 3.200e+05
##  3:                AVALANCHE        179      126       305 3.521e+06
##  4:                 BLIZZARD         19       23        42 1.248e+08
##  5:             BLOWING DUST          0        0         0 2.000e+04
##  6:               BRUSH FIRE          0        2         2 0.000e+00
##  7:            COASTAL FLOOD          3        2         5 2.168e+08
##  8:         COASTAL FLOODING          1        0         1 1.709e+07
##  9:             COLD WEATHER          2        0         2 0.000e+00
## 10:                DAM BREAK          0        0         0 1.000e+06
## 11:                DENSE FOG          9      143       152 7.319e+06
## 12:              DENSE SMOKE          0        0         0 1.000e+05
## 13:                  DROUGHT          0        4         4 9.982e+09
## 14:                 DROWNING          1        0         1 0.000e+00
## 15:           DRY MICROBURST          2       19        21 6.145e+05
## 16:               DUST DEVIL          2       36        38 6.080e+05
## 17:               DUST STORM         10      243       253 7.329e+06
## 18:           EXCESSIVE HEAT       1244     4930      6174 4.985e+08
## 19:  EXTREME COLD/WIND CHILL        257       36       293 1.700e+08
## 20:        EXTREME WINDCHILL          7        0         7 1.700e+07
## 21:         FALLING SNOW/ICE          1        1         2 0.000e+00
## 22:              FLASH FLOOD        866     1127      1993 1.517e+11
## 23:                      FOG         39      331       370 7.074e+06
## 24:                   FREEZE          0        0         0 5.335e+07
## 25:             FREEZING FOG          0        0         0 2.182e+06
## 26:             FROST/FREEZE          0        0         0 1.104e+09
## 27:             FUNNEL CLOUD          0        0         0 8.010e+04
## 28:              GUSTY WINDS          2        0         2 6.200e+04
## 29:                     HAIL          5      540       545 1.377e+10
## 30:           HAZARDOUS SURF          0        1         1 0.000e+00
## 31:               HEAVY RAIN         37      167       204 8.702e+08
## 32:               HEAVY SNOW         33      318       351 3.271e+08
## 33:     HEAVY SURF/HIGH SURF         44       74       118 1.058e+07
## 34:                HIGH SEAS          3        7        10 0.000e+00
## 35:                HIGH SURF         78      133       211 8.350e+07
## 36:       HIGH SURF ADVISORY          0        0         0 2.000e+05
## 37:               HIGH WATER          3        0         3 0.000e+00
## 38:                HIGH WIND        131      677       808 5.444e+09
## 39:        HURRICANE/TYPHOON         68     1291      1359 7.541e+10
## 40:              ICE ON ROAD          1        0         1 0.000e+00
## 41:                ICE STORM         30      115       145 2.929e+09
## 42:         LAKE-EFFECT SNOW          0        0         0 4.012e+07
## 43:          LAKESHORE FLOOD          0        0         0 7.540e+06
## 44:                LANDSLIDE         37       52        89 3.446e+08
## 45:         LATE SEASON SNOW          0        0         0 1.800e+05
## 46:      LIGHT FREEZING RAIN          0        0         0 3.060e+05
## 47:               LIGHT SNOW          1        2         3 1.064e+06
## 48:                LIGHTNING        466     2993      3459 6.014e+08
## 49:         MARINE HIGH WIND          1        1         2 1.301e+06
## 50:       MARINE STRONG WIND         14       22        36 4.183e+05
## 51: MARINE THUNDERSTORM WIND         10       26        36 4.864e+05
## 52:         MARINE TSTM WIND          9        9        18 5.421e+06
## 53:      MIXED PRECIPITATION          0        0         0 5.380e+05
## 54:                MUD SLIDE          0        0         0 1.325e+06
## 55:   NON-SEVERE WIND DAMAGE          0        7         7 5.000e+03
## 56:            NON TSTM WIND          0        0         0 4.000e+04
## 57:              RIP CURRENT        462      383       845 1.610e+05
## 58:               ROGUE WAVE          0        2         2 0.000e+00
## 59:               ROUGH SEAS          8        5        13 0.000e+00
## 60:                   SEICHE          0        0         0 2.100e+05
## 61:               SMALL HAIL          0        5         5 5.220e+06
## 62:             SNOW SQUALLS          0        0         0 1.500e+04
## 63:         STORM SURGE/TIDE         11        9        20 4.781e+10
## 64:              STRONG WIND        100      266       366 2.407e+08
## 65:        THUNDERSTORM WIND        131     1400      1531 3.781e+09
## 66:  THUNDERSTORM WIND (G40)          1        0         1 0.000e+00
## 67:                  TORNADO       1193    15213     16406 1.970e+10
## 68:      TROPICAL DEPRESSION          0        0         0 1.737e+06
## 69:           TROPICAL STORM         50      267       317 7.607e+09
## 70:                TSTM WIND        125     1807      1932 2.426e+09
## 71:          TSTM WIND (G45)          0        3         3 9.000e+03
## 72:           TSTM WIND/HAIL          2       34        36 4.895e+07
## 73:                  TSUNAMI         33      129       162 1.441e+08
## 74:        UNSEASONABLY WARM          0        1         1 0.000e+00
## 75:     URBAN/SML STREAM FLD         12       28        40 6.707e+06
## 76:             VOLCANIC ASH          0        0         0 5.000e+05
## 77:             WARM WEATHER          0        2         2 0.000e+00
## 78:               WATERSPOUT          2        1         3 5.308e+06
## 79:                WHIRLWIND          0        0         0 7.000e+03
## 80:         WILD/FOREST FIRE          9      286       295 2.364e+09
## 81:                 WILDFIRE         75      911       986 5.054e+09
## 82:             WINTER STORM        104      436       540 1.390e+09
## 83:           WINTER WEATHER         33      347       380 3.585e+07
## 84:       WINTER WEATHER/MIX         28      140       168 6.432e+06
##                       EVTYPE FATALITIES INJURIES INCIDENTS    DAMAGE