Weather events can have important economic and human impacts on a community. Events such as tonados, hurricanes, droughts and floods can cause devastating losses of both economic capital and lives. Thus, preparedness for weather-related events is important for federal, state and municipal goverments. It is the purpose of this document to explore the storm database of the National Oceanogrphic and Atmospheric Administration (NOAA) in order to determine which types of weather events have had the largest, and smallest, impacts in terms of fatalities, injuries, property damage and crop damage.This analysis demonstrates that between 1950 and 2011, constrained by the events included in this data, that tornados have caused the most injuries and deaths, floods/surges have caused the most property damage, and droughts have caused the most crop damage.
The data were downloaded from the coursera web site here. The raw data file was then loaded into R using the command:
x<-read.csv("repdata-data-StormData.csv.bz2")
Preliminary examination of the data revealed that there were only a few of the 37 variables which would be of interest, namely FATALITIES, INJURIES, EVTYPE (event type) and indicators of economic damage, PROPDMG (property damage), PROPDMGEXP (property damage exponent), CROPDMG (crop damage) and CROPDMGEXP (crop damage exponent). Examination of these variables revealed several issues which needed to be addressed before meaningful analysis could be carried out.
With regard to the economic variables, both property damage and crop damage were broken up in to a coefficient and an exponent. The coefficient was fairly self-explanatory, but the exponent was not.
table(x$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
This code reveals that the exponent field is usually blank, or it may contain a number, a letter, or a special character. The numbers(n) were taken to mean 10^n. The letter were taken to mean 10^2 (h/H), 10^3 (K), 10^6 (m/M) or 10^9 (B). The entries with 7, 8 and B in the exponent field were looked at individually to determine if these high levels of economic damage made sense. This could be most easily determined by examining the REMARKS field.
x[,"REMARKS"]<-as.character(x[,"REMARKS"])
x[which(x$PROPDMGEXP=="8"),"REMARKS"]
## [1] "Several homes and several hundered vehicle were damaged. "
This remark would seem to justify the $100 million figure for damages.
The special characters, blanks and zeroes were all taken to mean 10^0=1. To evaluate if this made sense, the coefficients were totaled, considering that if there were only a small number of coefficients, then these odd exponents were unlikely to meaningfully impact the overall results.
tapply(x$PROPDMG,x$PROPDMGEXP,FUN=sum)
## - ? + 0 1 2
## 5.274e+02 1.500e+01 0.000e+00 1.170e+02 7.108e+03 0.000e+00 1.200e+01
## 3 4 5 6 7 8 B
## 2.000e+01 1.450e+01 2.105e+02 6.500e+01 8.200e+01 0.000e+00 2.759e+02
## h H K m M
## 2.000e+00 2.500e+01 1.074e+07 3.890e+01 1.407e+05
Note that this table reveals another issue. The coefficient for the one case with 8 for an exponent is 0, yet that storm did a lot of damage. So, the data was explored for other instances where a coefficient of zero was accompanied by a non-blank exponent.
#where property damage is zero but exponent is not
oddexp<-x[which(x$PROPDMG==0 & x$PROPDMGEXP!=""),]
table(oddexp$PROPDMGEXP)
##
## - ? + 0 1 2 3 4 5
## 0 0 8 0 7 25 12 3 0 10
## 6 7 8 B h H K m M
## 1 3 1 0 0 0 197184 0 11
There were over 197,000 instances. And once again, the REMARKS determined that these storms did, in fact, do considerable damage.
oddexp[which(oddexp$PROPDMGEXP==7),"REMARKS"]
## [1] "Over 40 building suffered structural damage to them. A few trees were knocked down. A man was injured in Liberty Hill as he sought shelter in his camp trailer. Eight power poles were blown down. "
## [2] "Several houses had windows broken and numerous cars were damaged. "
## [3] "A home was destroyed by a fire started by lightning. "
These same issues held true for the CROPDMGEXP variable. The final strategy for dealing with this was to impute a coefficient of 1 for those rows with a zero coefficient and a blank exponent, impute numerical exponents as described above, and create two new variables “TotalPD” (total property damage) and “TotalCD” (total crop damage).
x$PDExpNum<-0
x$PDExpNum[which(x$PROPDMGEXP=="B")]<-9
x$PDExpNum[which(x$PROPDMGEXP=="h" | x$PROPDMGEXP=="H")]<-2
x$PDExpNum[which(x$PROPDMGEXP=="K")]<-3
x$PDExpNum[which(x$PROPDMGEXP=="m" | x$PROPDMGEXP=="M")]<-6
x$PDExpNum[which(x$PROPDMGEXP=="" | x$PROPDMGEXP=="?" |x$PROPDMGEXP=="-" |
x$PROPDMGEXP=="+")]<-0
x$PDExpNum[which(x$PROPDMGEXP=="2")]<-2
x$PDExpNum[which(x$PROPDMGEXP=="3")]<-3
x$PDExpNum[which(x$PROPDMGEXP=="4")]<-4
x$PDExpNum[which(x$PROPDMGEXP=="5")]<-5
x$PDExpNum[which(x$PROPDMGEXP=="6")]<-6
x$PDExpNum[which(x$PROPDMGEXP=="7")]<-7
x$PDExpNum[which(x$PROPDMGEXP=="8")]<-8
#Same for crop damage
x$CDExpNum<-0
x$CDExpNum[which(x$CROPDMGEXP=="B")]<-9
x$CDExpNum[which(x$CROPDMGEXP=="h" | x$CROPDMGEXP=="H")]<-2
x$CDExpNum[which(x$CROPDMGEXP=="K")]<-3
x$CDExpNum[which(x$CROPDMGEXP=="m" | x$CROPDMGEXP=="M")]<-6
x$CDExpNum[which(x$CROPDMGEXP=="" | x$CROPDMGEXP=="?" |x$CROPDMGEXP=="-" |
x$CROPDMGEXP=="+")]<-0
x$CDExpNum[which(x$CROPDMGEXP=="2")]<-2
x$CDExpNum[which(x$CROPDMGEXP=="3")]<-3
x$CDExpNum[which(x$CROPDMGEXP=="4")]<-4
x$CDExpNum[which(x$CROPDMGEXP=="5")]<-5
x$CDExpNum[which(x$CROPDMGEXP=="6")]<-6
x$CDExpNum[which(x$CROPDMGEXP=="7")]<-7
x$CDExpNum[which(x$CROPDMGEXP=="8")]<-8
#imput a 1 for the coefficient for those with an exp but a 0 coef
x$PDCoef<-x$PROPDMG
x$PDCoef[which(x$PROPDMG==0 & x$PROPDMGEXP!="")]<-1
x$CDCoef<-x$CROPDMG
x$CDCoef[which(x$CROPDMG==0 & x$CROPDMGEXP!="")]<-1
#Calculate total damage in whole numbers
x$TotalPD<-x$PDCoef * 10^(x$PDExpNum)
x$TotalCD<-x$CDCoef * 10^(x$CDExpNum)
The next issue was the EVTYPE variable. Initial evaluation of the most common event types demonstrated considerable redundancies, inconsistent cases, misspellings and inconsisent abbreviations. For example “Hurricane” and “Hurricane/Tyhpoon” existed as separate types, as did “heat”, “excessive heat”, etc. To resolve this several steps were taken. First, the EVTYPE data was converted to all lower case. Then, after examining the most commonly used types, a heirarchy was constructed where more specific types of events took precedence over more general types. For instance, “hurricane” is more specific than “wind”, so “hurricane wind” and “hurricane rain” are both considered “hurricane” instead of “wind” or “rain”. This heirachy was enacted by creating a new variable “evcat” (event category) and using sequential grep statements on the EVTYPE variable. This heirachy was iteratively refined be examining the uncategorized EVTYPEs and incorporating them into the heirarchy. The end result was a 25-category factor which included the “Other” type. The most numerous type of event in the “Other” category was monthly event summaries - which could not be categorized by a specific weather event.
x$EVTYPE<-tolower(x$EVTYPE)
x$evcat<-"Other"
x$evcat[grep("dust",x$EVTYPE)]<-"dust"
x$evcat[grep("fog",x$EVTYPE)]<-"fog"
x$evcat[grep("landslide|mudslide",x$EVTYPE)]<-"landslide"
x$evcat[grep("drought|dry",x$EVTYPE)]<-"drought/dry"
x$evcat[grep("fire",x$EVTYPE)]<-"fire"
x$evcat[grep("heat|hot|warm",x$EVTYPE)]<-"heat"
x$evcat[grep("cold",x$EVTYPE)]<-"cold"
x$evcat[grep("ice|freez|frost|glaze|icy ",x$EVTYPE)]<-"ice/freeze"
x$evcat[grep("rain",x$EVTYPE)]<-"rain"
x$evcat[grep("flood|surge|surf|tide|fld",x$EVTYPE)]<-"flood/surge"
x$evcat[grep("tsunami",x$EVTYPE)]<-"tsunami"
x$evcat[grep("rip",x$EVTYPE)]<-"rip currents"
x$evcat[grep("hail",x$EVTYPE)]<-"hail"
x$evcat[grep("wind",x$EVTYPE)]<-"wind"
x$evcat[grep("lightning",x$EVTYPE)]<-"lightning"
x$evcat[grep("thunderstorm|tstm",x$EVTYPE)]<-"thunderstorm"
x$evcat[grep("spout|funnel",x$EVTYPE)]<-"funnel cloud/waterspout"
x$evcat[grep("microburst",x$EVTYPE)]<-"microburst"
x$evcat[grep("snow|blizzard|winter|wintry|sleet",x$EVTYPE)]<-"snow/blizzard"
x$evcat[grep("tornado",x$EVTYPE)]<-"tornado"
x$evcat[grep("tropical",x$EVTYPE)]<-"tropical storm/depression"
x$evcat[grep("hurricane|typhoon",x$EVTYPE)]<-"hurricane/typhoon"
x$evcat[grep("volcan",x$EVTYPE)]<-"volcano"
x$evcat[grep("avalanch",x$EVTYPE)]<-"avalanche"
uncat<-x[x$evcat=="Other",]
x$evcat<-factor(x$evcat)
The processing now complete, the totals of deaths, injuries, crop damage and property damage were computed.
finalDeaths<-data.frame(tapply(x$FATALITIES,x$evcat,FUN=sum,na.rm=TRUE))
names(finalDeaths)[1]<-"deaths"
finalInj<-data.frame(tapply(x$INJURIES,x$evcat,FUN=sum,na.rm=TRUE))
names(finalInj)[1]<-"injuries"
finalProp<-data.frame(tapply(x$TotalPD,x$evcat,FUN=sum,na.rm=TRUE))
names(finalProp)[1]<-"PropDamage"
finalCrop<-data.frame(tapply(x$TotalCD,x$evcat,FUN=sum,na.rm=TRUE))
names(finalCrop)[1]<-"CropDamage"
display.results<-cbind(finalDeaths,finalInj,prettyNum(finalProp,scientific=TRUE, digits=3),prettyNum(finalCrop,scientific=TRUE,digits=3))
display.results
## deaths injuries PropDamage CropDamage
## avalanche 224 171 8.86e+06 1.54e+05
## cold 215 280 1.24e+08 1.41e+09
## drought/dry 0 4 1.05e+09 1.60e+10
## dust 24 483 6.47e+06 3.35e+06
## fire 90 1608 8.50e+09 4.05e+08
## flood/surge 1735 8972 2.16e+11 1.23e+10
## fog 80 1076 2.37e+07 9.21e+05
## funnel cloud/waterspout 3 32 1.32e+07 3.44e+06
## hail 15 1371 1.62e+10 3.12e+09
## heat 3178 9243 2.16e+07 9.06e+08
## hurricane/typhoon 135 1333 8.54e+10 5.52e+09
## ice/freeze 114 2417 3.98e+09 7.03e+09
## landslide 44 55 3.26e+08 2.03e+07
## lightning 817 5231 9.52e+08 1.72e+07
## microburst 3 28 7.30e+06 1.50e+04
## Other 54 55 7.81e+06 1.48e+08
## rain 107 303 3.23e+09 8.12e+08
## rip currents 577 529 4.42e+05 2.77e+05
## snow/blizzard 551 3938 8.49e+09 3.18e+08
## thunderstorm 729 9544 1.12e+10 1.39e+09
## tornado 5661 91407 5.86e+10 4.40e+08
## tropical storm/depression 66 383 7.72e+09 6.95e+08
## tsunami 33 129 1.44e+08 3.80e+04
## volcano 0 0 5.03e+05 3.00e+03
## wind 690 1936 6.22e+09 9.08e+08
The following 3 plots demonstrate the human and economic impact of the most harmful 10 categories of events.
results<-cbind(finalDeaths,finalInj,finalProp,finalCrop)
barplot(sort(results$deaths,decreasing=TRUE)[1:10], las=2, cex.names=0.7,
main="Deaths by Event Category")
Plot 1: Total number of deaths in each category - only the 10 deadliest categories are displayed.
barplot(sort(results$injuries,decreasing=TRUE)[1:10], las=2, cex.names=0.7,
main="Injuries by Event Category")
Plot 2: Total number of injuries in each category - only the 10 most injurious categories are displayed.
barplot(sort(results$PropDamage+results$CropDamage,decreasing=TRUE)[1:10],
las=2, cex.names=0.7, main = "Crop + Property Damage by Event Category")
Plot 3: Total damage (property plus crop) in each category, in US Dollars - only the 10 most damaging categories are displayed.