This analysis uses public information provided by the Department of Commerce, and in particular the National Oceanic & Atmospheric Administration. The analysis aims to answer two primary questions:
The source data tracks incidents from 1950 forward. However, more recent years have proven to be more reliable in the data; so, this analysis will be conducted from the year 2000 forward. The Data Processing describes all the transformation from the source data to the final form, and the Results section states the findings in a consumable form, both dataset and plot visualizations.
Author: mjfii
Date: 2014-09-21
The following are steps to gather, clean, and the alter source data for use in later analytics and ultimately the below stated results. First, we will use all the following R libraries in the processing and analytic code chunks, and we will also make sure a working source data folder exists.
library(data.table)
suppressMessages(library(plyr))
suppressMessages(library(lubridate))
library(ggplot2)
suppressMessages(library(reshape))
setwd('~/R')
if (!file.exists('data/storm_data')) {
dir.create('data/storm_data')
}
In order to make this analysis reproducible, we need to start with the original .bz2 data file. Below, we simply use the read.csv function to load the source data into memory. Note, since this file is very large, if we already have a tidy dataset from a prior processing, we will look for it and use it. From there, the below cleaning, subsetting, and aggregating is conducted. The primary transformation below the calculation of ‘DAMAGE’, which is defined in greater detail in the source code book, but in short, it translates ‘1.1M’ into $1,100,000, for aggregation purposes.
if (!file.exists('data/storm_data/tidy_storm_data.csv')) {
# read the file and calc dates to subset by year
raw<-data.table(read.csv('data/storm_data/FStormData.csv.bz2',stringsAsFactors=FALSE))
raw<-raw[,YEAR:=year(as.POSIXlt(strptime(raw$BGN_DATE,'%m/%d/%Y %H:%M:%S')))]
raw<-raw[YEAR>=2000,]
# calculate total economic damage per code book specifications
raw$CROPDMGEXP[raw$CROPDMGEXP=='']<-'1'
raw$CROPDMGEXP[raw$CROPDMGEXP=='K']<-'1000'
raw$CROPDMGEXP[raw$CROPDMGEXP=='M']<-'1000000'
raw$CROPDMGEXP[raw$CROPDMGEXP=='B']<-'1000000000'
raw$PROPDMGEXP[raw$PROPDMGEXP=='']<-'1'
raw$PROPDMGEXP[raw$PROPDMGEXP=='K']<-'1000'
raw$PROPDMGEXP[raw$PROPDMGEXP=='M']<-'1000000'
raw$PROPDMGEXP[raw$PROPDMGEXP=='B']<-'1000000000'
raw<-raw[,DAMAGE:=(as.numeric(PROPDMG)*as.numeric(PROPDMGEXP))+(as.numeric(CROPDMG)*as.numeric(CROPDMGEXP))]
# trim up any white-space
trim <- function(x) {
gsub('(^[[:space:]]+|[[:space:]]+$)','',x)
}
raw$EVTYPE<-mapply(trim,raw$EVTYPE)
# aggregate items into a clean version
pre.agg<-raw[,c(8,4,23,24,39),with=FALSE]
pre.agg<-ddply(pre.agg,c('EVTYPE'), summarize,FATALITIES=(sum(FATALITIES)),INJURIES=(sum(INJURIES)),DAMAGE=(sum(DAMAGE)))
pre.agg<-data.table(pre.agg)[,INCIDENTS:=FATALITIES+INJURIES]
pre.agg<-pre.agg[INCIDENTS>0 | DAMAGE>0,]
pre.agg<-pre.agg[order(-pre.agg$INCIDENTS),]
write.csv(pre.agg, file='data/storm_data/tidy_storm_data.csv',row.names=FALSE)
} else {
pre.agg<-data.table(read.csv('data/storm_data/tidy_storm_data.csv',stringsAsFactors=FALSE))
}
Since there are potential overlaps with EVTYPE in the raw data, e.g. ‘WILDFIRE’ and ‘WILD FOREST FIRE’, we want to combine identify these types as similar for aggregation purposes. The below fuzzy_lookup function uses the base agrep function to conduct this fuzzy lookup. This base function uses the Levenshtein edit distance to determine similarity. The initial data.table is passed into the function as an argument, and a similar data.table is returned with all the new aggregations in place.
fuzzy_lookup<-function(fuzzy_table) {
# get the initial variables for the fuction
fuzzy_table<-cbind(fuzzy_table,c(1:nrow(fuzzy_table)))
setnames(fuzzy_table,6,'ROW')
fuzzy_table$FUZZY<-as.character(NA)
fuzzy_table_length<-length(fuzzy_table$FUZZY)
luv<-fuzzy_table$EVTYPE
# loop thru each row and find the closest match (if any)
for (i in 1:(fuzzy_table_length)) {
val<-fuzzy_table$EVTYPE[i]
vec<-head(luv,i)
vec<-vec[!is.element(vec,val)]
fuzzy_value<-agrep(val,vec)
fuzzy_value<-fuzzy_value[1]
fuzzy_table$FUZZY[i]=vec[fuzzy_value]
}
# if a match was found, persist it to the source column and (re)aggregate
fuzzy_table$EVTYPE<-ifelse(is.na(fuzzy_table$FUZZY),fuzzy_table$EVTYPE,fuzzy_table$FUZZY)
fuzzy_table<-fuzzy_table[,c(1:5),with=FALSE]
fuzzy_table<-ddply(fuzzy_table, c('EVTYPE'), summarize,FATALITIES=(sum(FATALITIES)),
INJURIES=(sum(INJURIES)),
INCIDENTS=(sum(INCIDENTS)),
DAMAGE=(sum(DAMAGE)))
return(data.table(fuzzy_table))
}
Now that we have the fuzzy logic function, just pass our aggregated dataset in and we have our working ‘cleaned’ dataset returned and set to a variable.
cleaned.set<-fuzzy_lookup(pre.agg)
The below chunk of code uses our cleaned dataset to build and ‘incident’ dataset for analysis and visualization. The top 3 weather type by ‘Incident’ (Injuries plus Fatalities) are shown in the following segment.
incident.plot.data<-cleaned.set[INCIDENTS>0,]
total.incidents<-sum(cleaned.set$INCIDENTS)
incident.plot.data<-incident.plot.data[,INCIDENT_PERCENT:=round(incident.plot.data$INCIDENTS/total.incidents[1]*100,digits=1)]
incident.plot.data<-incident.plot.data[order(-incident.plot.data$INCIDENT_PERCENT),]
incident.plot.data<-head(incident.plot.data,n=3)[,c(1:4,6),with=FALSE]
setnames(incident.plot.data,c(1:5),c('Event','Fatalities','Injuries','Incidents','Incident %'))
incident.plot.data
## Event Fatalities Injuries Incidents Incident %
## 1: TORNADO 1193 15213 16406 39.9
## 2: EXCESSIVE HEAT 1244 4930 6174 15.0
## 3: LIGHTNING 466 2993 3459 8.4
The below chunk of code uses our cleaned dataset to build and ‘damage’ dataset for analysis and visualization. The top 4 weather types by ‘Damage’ (Property $ plus Crop $) are shown in the following segment.
damage.plot.data<-cleaned.set[DAMAGE>0,]
total.damage<-sum(damage.plot.data$DAMAGE)
damage.plot.data<-damage.plot.data[,DAMAGE_PERCENT:=round(damage.plot.data$DAMAGE/total.damage[1]*100,digits=1)]
damage.plot.data<-damage.plot.data[,INCIDENT_PERCENT:=round(damage.plot.data$INCIDENTS/total.incidents[1]*100,digits=1)]
damage.plot.data<-damage.plot.data[order(-damage.plot.data$DAMAGE_PERCENT),]
damage.plot.data$DAMAGE<-damage.plot.data$DAMAGE/1000000
damage.plot.data<-damage.plot.data[,c(1,5,6,7),with=FALSE]
damage.plot.data<-head(damage.plot.data,n=4)
setnames(damage.plot.data,c(1:4),c('Event','Damage $M','Damage %','Incident %'))
damage.plot.data
## Event Damage $M Damage % Incident %
## 1: FLASH FLOOD 151695 42.8 4.8
## 2: HURRICANE/TYPHOON 75406 21.3 3.3
## 3: STORM SURGE/TIDE 47813 13.5 0.0
## 4: TORNADO 19696 5.6 39.9
Using the chunk of code below, we can see clearly that Tornados cause the risk to pubic health with a 39.9% stake of all weather related instances. ‘Excessive Heat’ and ‘Lightning’ round out the top three, with 15.0% and 8.4%, respectively. However, it is important to note that this number utilizes both ‘Fatalities’ and ‘Injuries’; in turn, ‘Excessive Heat’ causes the most fatalites.
# reshape the data and plot
incident.plot.data<-melt(incident.plot.data[,c(1:4),with=FALSE],id=c('Event'))
p<-qplot(variable,value,data=incident.plot.data,facets=Event~.,geom='bar',stat='identity',position='dodge')
p<-p+xlab('Incident Type by Event')
p<-p+ylab('Number of Incidents')
p
The next result set seek to understand the monetary impact of any given weather event, and compare the percentage of the total impact accross all events to that of the incident percentages. ‘Flash Floods’ cause the most economic impact with 42.8% of the total. ‘Hurricanes’ and ‘Storm Surges’ round out the top three, with 21.3% and 13.5%, respectively. ‘Tornados’, while causing the most ‘health’ related incident only account for 5.6% or the total, or $19.6B since January 2000. The plot below shows this dichotomy in terms of percentage of totals.
# reshape the data and plot
damage.plot.data<-melt(damage.plot.data[,c(1,3,4),with=FALSE],id=c('Event'))
p<-qplot(variable,value,data=damage.plot.data,geom='bar',stat='identity',position='dodge')
p<-p + facet_wrap(~ Event)
p<-p+xlab('Top 4 Events by Percentage of Total Damage')
p<-p+ylab('Percentage (%)')
p
pre.agg
## EVTYPE FATALITIES INJURIES DAMAGE INCIDENTS
## 1: TORNADO 1193 15213 1.970e+10 16406
## 2: EXCESSIVE HEAT 1013 3708 4.968e+08 4721
## 3: LIGHTNING 466 2993 6.014e+08 3459
## 4: TSTM WIND 116 1753 2.424e+09 1869
## 5: THUNDERSTORM WIND 130 1400 3.781e+09 1530
## ---
## 103: TROPICAL DEPRESSION 0 0 1.737e+06 0
## 104: TSTM WIND (G40) 0 0 6.000e+03 0
## 105: TSTM WIND G45 0 0 1.000e+03 0
## 106: VOLCANIC ASH 0 0 5.000e+05 0
## 107: WHIRLWIND 0 0 7.000e+03 0
cleaned.set
## EVTYPE FATALITIES INJURIES INCIDENTS DAMAGE
## 1: ASTRONOMICAL HIGH TIDE 0 0 0 9.425e+06
## 2: ASTRONOMICAL LOW TIDE 0 0 0 3.200e+05
## 3: AVALANCHE 179 126 305 3.521e+06
## 4: BLIZZARD 19 23 42 1.248e+08
## 5: BLOWING DUST 0 0 0 2.000e+04
## 6: BRUSH FIRE 0 2 2 0.000e+00
## 7: COASTAL FLOOD 3 2 5 2.168e+08
## 8: COASTAL FLOODING 1 0 1 1.709e+07
## 9: COLD WEATHER 2 0 2 0.000e+00
## 10: DAM BREAK 0 0 0 1.000e+06
## 11: DENSE FOG 9 143 152 7.319e+06
## 12: DENSE SMOKE 0 0 0 1.000e+05
## 13: DROUGHT 0 4 4 9.982e+09
## 14: DROWNING 1 0 1 0.000e+00
## 15: DRY MICROBURST 2 19 21 6.145e+05
## 16: DUST DEVIL 2 36 38 6.080e+05
## 17: DUST STORM 10 243 253 7.329e+06
## 18: EXCESSIVE HEAT 1244 4930 6174 4.985e+08
## 19: EXTREME COLD/WIND CHILL 257 36 293 1.700e+08
## 20: EXTREME WINDCHILL 7 0 7 1.700e+07
## 21: FALLING SNOW/ICE 1 1 2 0.000e+00
## 22: FLASH FLOOD 866 1127 1993 1.517e+11
## 23: FOG 39 331 370 7.074e+06
## 24: FREEZE 0 0 0 5.335e+07
## 25: FREEZING FOG 0 0 0 2.182e+06
## 26: FROST/FREEZE 0 0 0 1.104e+09
## 27: FUNNEL CLOUD 0 0 0 8.010e+04
## 28: GUSTY WINDS 2 0 2 6.200e+04
## 29: HAIL 5 540 545 1.377e+10
## 30: HAZARDOUS SURF 0 1 1 0.000e+00
## 31: HEAVY RAIN 37 167 204 8.702e+08
## 32: HEAVY SNOW 33 318 351 3.271e+08
## 33: HEAVY SURF/HIGH SURF 44 74 118 1.058e+07
## 34: HIGH SEAS 3 7 10 0.000e+00
## 35: HIGH SURF 78 133 211 8.350e+07
## 36: HIGH SURF ADVISORY 0 0 0 2.000e+05
## 37: HIGH WATER 3 0 3 0.000e+00
## 38: HIGH WIND 131 677 808 5.444e+09
## 39: HURRICANE/TYPHOON 68 1291 1359 7.541e+10
## 40: ICE ON ROAD 1 0 1 0.000e+00
## 41: ICE STORM 30 115 145 2.929e+09
## 42: LAKE-EFFECT SNOW 0 0 0 4.012e+07
## 43: LAKESHORE FLOOD 0 0 0 7.540e+06
## 44: LANDSLIDE 37 52 89 3.446e+08
## 45: LATE SEASON SNOW 0 0 0 1.800e+05
## 46: LIGHT FREEZING RAIN 0 0 0 3.060e+05
## 47: LIGHT SNOW 1 2 3 1.064e+06
## 48: LIGHTNING 466 2993 3459 6.014e+08
## 49: MARINE HIGH WIND 1 1 2 1.301e+06
## 50: MARINE STRONG WIND 14 22 36 4.183e+05
## 51: MARINE THUNDERSTORM WIND 10 26 36 4.864e+05
## 52: MARINE TSTM WIND 9 9 18 5.421e+06
## 53: MIXED PRECIPITATION 0 0 0 5.380e+05
## 54: MUD SLIDE 0 0 0 1.325e+06
## 55: NON-SEVERE WIND DAMAGE 0 7 7 5.000e+03
## 56: NON TSTM WIND 0 0 0 4.000e+04
## 57: RIP CURRENT 462 383 845 1.610e+05
## 58: ROGUE WAVE 0 2 2 0.000e+00
## 59: ROUGH SEAS 8 5 13 0.000e+00
## 60: SEICHE 0 0 0 2.100e+05
## 61: SMALL HAIL 0 5 5 5.220e+06
## 62: SNOW SQUALLS 0 0 0 1.500e+04
## 63: STORM SURGE/TIDE 11 9 20 4.781e+10
## 64: STRONG WIND 100 266 366 2.407e+08
## 65: THUNDERSTORM WIND 131 1400 1531 3.781e+09
## 66: THUNDERSTORM WIND (G40) 1 0 1 0.000e+00
## 67: TORNADO 1193 15213 16406 1.970e+10
## 68: TROPICAL DEPRESSION 0 0 0 1.737e+06
## 69: TROPICAL STORM 50 267 317 7.607e+09
## 70: TSTM WIND 125 1807 1932 2.426e+09
## 71: TSTM WIND (G45) 0 3 3 9.000e+03
## 72: TSTM WIND/HAIL 2 34 36 4.895e+07
## 73: TSUNAMI 33 129 162 1.441e+08
## 74: UNSEASONABLY WARM 0 1 1 0.000e+00
## 75: URBAN/SML STREAM FLD 12 28 40 6.707e+06
## 76: VOLCANIC ASH 0 0 0 5.000e+05
## 77: WARM WEATHER 0 2 2 0.000e+00
## 78: WATERSPOUT 2 1 3 5.308e+06
## 79: WHIRLWIND 0 0 0 7.000e+03
## 80: WILD/FOREST FIRE 9 286 295 2.364e+09
## 81: WILDFIRE 75 911 986 5.054e+09
## 82: WINTER STORM 104 436 540 1.390e+09
## 83: WINTER WEATHER 33 347 380 3.585e+07
## 84: WINTER WEATHER/MIX 28 140 168 6.432e+06
## EVTYPE FATALITIES INJURIES INCIDENTS DAMAGE