United States Storm Data from 1950 - 2011 is obtained from the NOAA Storm Database , and after modifying the dataset to account for misspellings, and abbreviations, the Human and Financial Impacts of weather events are presented.
The primary weather events impacting human health are Tornadoes, while Floods account for the largest financial impact.
This analysis is limited to describing impacts across the whole of the United States, and does not take into account regional variability. Future analyses could provide detail from the national, regional (eg. Pacific Northwest), state, and possibly county levels.
All analysis is based on the dataset provided by the NOAA Storm Database .
After reading in the data file, we extract only the 7 fields containing the information required to determine the human and economic impacts of the listed weather events.
read_get_Data<-function(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2") {
tmp<-"./Data/NOAA_StormData.csv.bz2"
setInternet2(use=T)
#if the file already exists, just read it, otherwise get it
if(!file.exists(tmp)){print("Downloading file.");download.file(url,destfile=tmp);
print("Reading downloaded file.")}
else{print("Reading saved file.")}
df<-read.csv(bzfile(tmp), strip.white=TRUE)
#reduce memory requirements
wdf=data.frame(df$EVTYPE,df$INJURIES,df$FATALITIES,df$PROPDMG,df$PROPDMGEXP,
df$CROPDMG,df$CROPDMGEXP)
colnames(wdf)<-c("EVTYPE","INJURIES","FATALITIES","PROPDMG","PROPDMGEXP",
"CROPDMG","CROPDMGEXP")
rm(df)
#Standardize Event Types
wdf$EVTYPE<-factor(toupper(wdf$EVTYPE))
return(wdf)
}
#store raw dataset in wdf (working data frame)
wdf<-read_get_Data()
## [1] "Downloading file."
## [1] "Reading downloaded file."
There is a list of 48 allowed event types, but almost 900 unique entries in the data, this implies some serious cleaning is going to be required.
require(dplyr, quietly=TRUE, warn.conflicts=FALSE)
## Warning: package 'dplyr' was built under R version 3.1.3
clean_data1<-function(wdf){
allowedEV<-c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
"COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE",
"DROUGHT","DUST DEVIL","DUST STORM","EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FROST/FREEZE",
"FUNNEL CLOUD","FREEZING FOG","HAIL","HEAT","HEAVY RAIN",
"HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
"ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING",
"MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND", "RIP CURRENT","SEICHE","SLEET",
"STORM SURGE/TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
"TROPICAL DEPRESSION","TSUNAMI","VOLCANIC ASH","WATERSPOUT",
"TROPICAL STORM","WILDFIRE","WINTER STORM","WINTER WEATHER")
inEV<-wdf[,1]%in%allowedEV
#store good event types here
gdf<-data.frame(wdf[inEV==TRUE,])
#store event types needing further work here
wdf<-data.frame(wdf[inEV!=TRUE,])
return(list(gdf,wdf))
}
ls<-clean_data1(wdf) #list of good and bad dataframes
gdf<-data.frame(ls[1]) #good data frame
bdf<-data.frame(ls[2]) #bad data frame
After initial matching to the allowed event types we have 29.59%[1] entries left to fix, with 852[2] unique event types.
How much will this affect the analysis? If we simply look at fatalities, we see we have 13.78%[3] unassigned fatalities remaining in the non-standard (bad data frame) event type (EVTYPE) data set. So obviously, more work needs to be done to clean this data-set.
In this section, we process the ‘bad’ data left over from the first cleaning and correct for common misspellings, and abbreviations.
clean_data2<-function(wdf){
#fix event types as best we can.
allowedEV<-c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
"COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE",
"DROUGHT","DUST DEVIL","DUST STORM","EXCESSIVE HEAT",
"EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FROST/FREEZE",
"FUNNEL CLOUD","FREEZING FOG","HAIL","HEAT","HEAVY RAIN",
"HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
"ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING",
"MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
"MARINE THUNDERSTORM WIND", "RIP CURRENT","SEICHE","SLEET",
"STORM SURGE/TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
"TROPICAL DEPRESSION","TSUNAMI","VOLCANIC ASH","WATERSPOUT",
"TROPICAL STORM","WILDFIRE","WINTER STORM","WINTER WEATHER")
wdf$EVTYPE<-gsub("TSTM.*","THUNDERSTORM",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*THUNDERSTORM.*","THUNDERSTORM WIND",wdf$EVTYPE)
wdf$EVTYPE<-gsub("HURRICANE.*","HURRICANE (TYPHOON)", wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*FLASH FLOOD.*","FLASH FLOOD",wdf$EVTYPE)
wdf$EVTYPE<-gsub("LIGHTNING.*","LIGHTNING",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*TORNADO.*","TORNADO",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*HAIL.*","HAIL",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*FLOOD.*","FLOOD",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*SLEET.*", "SLEET",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*WINDCHILL.*","EXTREME COLD/WIND CHILL", wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*HEAT.*","EXCESSIVE HEAT",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*WARM.*","EXCESSIVE HEAT",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*SNOW.*","HEAVY SNOW",wdf$EVTYPE)
wdf$EVTYPE<-gsub("LANDSLIDE","DEBRIS FLOW",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*FOG.*","DENSE FOG",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*SMOKE.*","DENSE SMOKE",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*STORM SURGE.*","STORM SURGE/TIDE",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*TIDE.*","STORM SURGE/TIDE",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*WINTER STORM.*","WINTER STORM",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*FIRE.*","WILDFIRE",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*EXTREME.*COLD.*","EXTREME COLD/WIND CHILL",wdf$EVTYPE )
wdf$EVTYPE<-gsub(".*HIGH.*WIND.*","HIGH WIND",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*FLD.*","FLOOD",wdf$EVTYPE)
wdf$EVTYPE<-gsub(".*CURRENT.*","RIP CURRENT",wdf$EVTYPE)
inEV<-wdf[,1]%in%allowedEV
#store good event types here
gdf<-data.frame(wdf[inEV==TRUE,])
#store event types needing further work here
wdf<-data.frame(wdf[inEV!=TRUE,])
return(list(gdf,wdf))
}
ls<-clean_data2(bdf)
gdf<-rbind(gdf,data.frame(ls[1])) #combine both sets of 'good' data
bdf<-data.frame(ls[2]) #leftover bad data
We now have 0.49%[1] unaccounted for rows, with 365[2] unique event types.
Has this improved our data coverage? Again, if we simply look at fatalities, we see we have 1.88%[3] unassigned to a weather event. So we seem to have covered the bulk of the data.
Next we will calculate the financial impacts of these weather events.
calc_StuffImpact<-function(wdf){
#translate K, M, B into power values (3, 6, 9)
#3 and 6 already exist as factor level, so add 9 as a level
levels(wdf$PROPDMGEXP)<-c(levels(wdf$PROPDMGEXP),"9")
#replace 0,K,M,B with appropriate power values(1,3,6,9)
wdf$PROPDMGEXP[wdf$PROPDMGEXP=="0"]<-"1"
wdf$PROPDMGEXP[wdf$PROPDMGEXP=="K"]<-"3"
wdf$PROPDMGEXP[wdf$PROPDMGEXP=="M"]<-"6"
wdf$PROPDMGEXP[wdf$PROPDMGEXP=="B"]<-"9"
wdf$PROPDMGEXP<-as.numeric(as.character(wdf$PROPDMGEXP))
#add missing power levels to the Crop Exponent field
levels(wdf$CROPDMGEXP)<-c(levels(wdf$CROPDMGEXP),"1","3","6","9")
#replace 0,K,M,B with appropriate power values(1,3,6,9)
wdf$CROPDMGEXP[wdf$CROPDMGEXP=="0"]<-"1"
wdf$CROPDMGEXP[wdf$CROPDMGEXP=="K"]<-"3"
wdf$CROPDMGEXP[wdf$CROPDMGEXP=="M"]<-"6"
wdf$CROPDMGEXP[wdf$CROPDMGEXP=="B"]<-"9"
wdf$CROPDMGEXP<-as.numeric(as.character(wdf$CROPDMGEXP))
#calculate damage costs
wdf$PROPVAL<-wdf$PROPDMG*10^wdf$PROPDMGEXP
wdf$CROPVAL<-wdf$CROPDMG*10^wdf$CROPDMGEXP
wdf<-wdf%>%group_by(EVTYPE)%>%summarise(sumPROP=sum(PROPVAL,na.rm=TRUE),
sumCROP=sum(CROPVAL,na.rm=TRUE))
return(wdf)
}
Following is the plot for Human Impacts based on Weather Events.
plot_HumanImpact<-function(wdf){
#assumes col1=EVTYPE, col2=INJURIES, col3=FATALITIES
perDF<-wdf%>%group_by(EVTYPE)%>%summarise(sumINJ=sum(INJURIES),sumFAT=sum(FATALITIES))
#get top ten events of each type (Injuries/fatalities)
pltPerDF<-head(perDF[order(perDF$sumINJ,decreasing=TRUE),],10)
pltPerFDF<-head(perDF[order(perDF$sumFAT,decreasing=TRUE),],10)
par(mfrow=c(1,2), oma=c(0,0,1,0))
barplot(pltPerDF$sumINJ, names.arg=pltPerDF$EVTYPE, las=2, cex.axis=0.7,
cex.names=0.4,col="yellow",
ylab="Number of Injuries")
barplot(pltPerFDF$sumFAT, names.arg=pltPerFDF$EVTYPE, las=2, cex.axis=0.7,
cex.names=0.4,col="red",
ylab="Number of Fatalities")
title("Human Impact by Weather Event",outer=TRUE)
}
plot_HumanImpact(gdf)
From the plots, it is easy to identify that Tornadoes are the most damaging weather events from a human impact (Injury or Death) perspective. Further, fatalities are <10% of the impacts, and the injuries from Thunderstorm Winds are close to the number of fatalities from Tornadoes.
plot_StuffImpact<-function(wdf){
#get top ten events of each type (Crop/Property damage)
pltPropDF<-head(wdf[order(wdf$sumPROP,decreasing=TRUE),],10)
pltCropDF<-head(wdf[order(wdf$sumCROP,decreasing=TRUE),],10)
par(mfrow=c(1,2), oma=c(0,0,1,0))
barplot(pltPropDF$sumPROP/1000000000, names.arg=pltPropDF$EVTYPE, las=2, cex.axis=0.7,
cex.names=0.4,col="lightgreen",
ylab="Property Damage (Billions of $)")
barplot(pltCropDF$sumCROP/1000000000, names.arg=pltCropDF$EVTYPE, las=2, cex.axis=0.7,
cex.names=0.4,col="lightblue",
ylab="Crop Damage (Billions of $)")
title("Financial Impact by Weather Event",outer=TRUE)
}
plot_StuffImpact(calc_StuffImpact(gdf))
## Warning in calc_StuffImpact(gdf): NAs introduced by coercion
## Warning in calc_StuffImpact(gdf): NAs introduced by coercion
From this plot we see that property damage far outweighs crop damage, and Floods, Hurricanes, Tornadoes, and Storm Surges(Tides) cause the most damage. For farmers, Droughts and Floods, are the most damaging weather events, as expected.
We have shown that Tornadoes are the most damaging weather event in relation to its human impacts, and that Floods, and Hurricanes cause the largest financial effects, while Drought and Floods have the most impact on the agricultural sector.
This analysis is limited to describing impacts across the whole of the United States, and does not take into account regional variability. Future analyses could provide detail from the national, regional (eg. Pacific Northwest), state, and possibly county levels.
[1] Percent Bad Data:
format(((nrow(bdf)/(nrow(gdf)+nrow(bdf)))*100),nsmall=2,digits=2)
length(unique(bdf$EVTYPE))
[3] Percent Missing Fatalities:
sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_0.4.1
##
## loaded via a namespace (and not attached):
## [1] assertthat_0.1 DBI_0.3.1 digest_0.6.8 evaluate_0.5.5
## [5] formatR_1.0 htmltools_0.2.6 knitr_1.9 lazyeval_0.1.10
## [9] magrittr_1.5 parallel_3.1.2 Rcpp_0.11.5 rmarkdown_0.5.1
## [13] stringr_0.6.2 tools_3.1.2 yaml_2.1.13