Synopsis

United States Storm Data from 1950 - 2011 is obtained from the NOAA Storm Database , and after modifying the dataset to account for misspellings, and abbreviations, the Human and Financial Impacts of weather events are presented.

The primary weather events impacting human health are Tornadoes, while Floods account for the largest financial impact.

This analysis is limited to describing impacts across the whole of the United States, and does not take into account regional variability. Future analyses could provide detail from the national, regional (eg. Pacific Northwest), state, and possibly county levels.

Data Processing

All analysis is based on the dataset provided by the NOAA Storm Database .

After reading in the data file, we extract only the 7 fields containing the information required to determine the human and economic impacts of the listed weather events.

read_get_Data<-function(url="https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2") {
    tmp<-"./Data/NOAA_StormData.csv.bz2"
    setInternet2(use=T)
    #if the file already exists, just read it, otherwise get it
    if(!file.exists(tmp)){print("Downloading file.");download.file(url,destfile=tmp);
                          print("Reading downloaded file.")}
            else{print("Reading saved file.")}
    df<-read.csv(bzfile(tmp), strip.white=TRUE)
    #reduce memory requirements
    wdf=data.frame(df$EVTYPE,df$INJURIES,df$FATALITIES,df$PROPDMG,df$PROPDMGEXP,
                   df$CROPDMG,df$CROPDMGEXP)
    colnames(wdf)<-c("EVTYPE","INJURIES","FATALITIES","PROPDMG","PROPDMGEXP",
                     "CROPDMG","CROPDMGEXP")
    rm(df)
    #Standardize Event Types
    wdf$EVTYPE<-factor(toupper(wdf$EVTYPE))
    return(wdf)
}

#store raw dataset in wdf (working data frame)
wdf<-read_get_Data()
## [1] "Downloading file."
## [1] "Reading downloaded file."

There is a list of 48 allowed event types, but almost 900 unique entries in the data, this implies some serious cleaning is going to be required.

require(dplyr, quietly=TRUE, warn.conflicts=FALSE)
## Warning: package 'dplyr' was built under R version 3.1.3
clean_data1<-function(wdf){
    allowedEV<-c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
                 "COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE",
                 "DROUGHT","DUST DEVIL","DUST STORM","EXCESSIVE HEAT",
                 "EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FROST/FREEZE",
                 "FUNNEL CLOUD","FREEZING FOG","HAIL","HEAT","HEAVY RAIN",
                 "HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
                 "ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING",
                 "MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
                 "MARINE THUNDERSTORM WIND", "RIP CURRENT","SEICHE","SLEET",
                 "STORM SURGE/TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
                 "TROPICAL DEPRESSION","TSUNAMI","VOLCANIC ASH","WATERSPOUT",
                 "TROPICAL STORM","WILDFIRE","WINTER STORM","WINTER WEATHER")
    inEV<-wdf[,1]%in%allowedEV
    #store good event types here
    gdf<-data.frame(wdf[inEV==TRUE,])
    #store event types needing further work here
    wdf<-data.frame(wdf[inEV!=TRUE,])
    return(list(gdf,wdf))
}
ls<-clean_data1(wdf) #list of good and bad dataframes
gdf<-data.frame(ls[1]) #good data frame
bdf<-data.frame(ls[2]) #bad data frame

After initial matching to the allowed event types we have 29.59%[1] entries left to fix, with 852[2] unique event types.

How much will this affect the analysis? If we simply look at fatalities, we see we have 13.78%[3] unassigned fatalities remaining in the non-standard (bad data frame) event type (EVTYPE) data set. So obviously, more work needs to be done to clean this data-set.

In this section, we process the ‘bad’ data left over from the first cleaning and correct for common misspellings, and abbreviations.

clean_data2<-function(wdf){
    #fix event types as best we can.
    allowedEV<-c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
                 "COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE",
                 "DROUGHT","DUST DEVIL","DUST STORM","EXCESSIVE HEAT",
                 "EXTREME COLD/WIND CHILL","FLASH FLOOD","FLOOD","FROST/FREEZE",
                 "FUNNEL CLOUD","FREEZING FOG","HAIL","HEAT","HEAVY RAIN",
                 "HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
                 "ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING",
                 "MARINE HAIL","MARINE HIGH WIND","MARINE STRONG WIND",
                 "MARINE THUNDERSTORM WIND", "RIP CURRENT","SEICHE","SLEET",
                 "STORM SURGE/TIDE","STRONG WIND","THUNDERSTORM WIND","TORNADO",
                 "TROPICAL DEPRESSION","TSUNAMI","VOLCANIC ASH","WATERSPOUT",
                 "TROPICAL STORM","WILDFIRE","WINTER STORM","WINTER WEATHER")
    wdf$EVTYPE<-gsub("TSTM.*","THUNDERSTORM",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*THUNDERSTORM.*","THUNDERSTORM WIND",wdf$EVTYPE)
    wdf$EVTYPE<-gsub("HURRICANE.*","HURRICANE (TYPHOON)", wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*FLASH FLOOD.*","FLASH FLOOD",wdf$EVTYPE)
    wdf$EVTYPE<-gsub("LIGHTNING.*","LIGHTNING",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*TORNADO.*","TORNADO",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*HAIL.*","HAIL",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*FLOOD.*","FLOOD",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*SLEET.*", "SLEET",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*WINDCHILL.*","EXTREME COLD/WIND CHILL", wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*HEAT.*","EXCESSIVE HEAT",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*WARM.*","EXCESSIVE HEAT",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*SNOW.*","HEAVY SNOW",wdf$EVTYPE)
    wdf$EVTYPE<-gsub("LANDSLIDE","DEBRIS FLOW",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*FOG.*","DENSE FOG",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*SMOKE.*","DENSE SMOKE",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*STORM SURGE.*","STORM SURGE/TIDE",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*TIDE.*","STORM SURGE/TIDE",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*WINTER STORM.*","WINTER STORM",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*FIRE.*","WILDFIRE",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*EXTREME.*COLD.*","EXTREME COLD/WIND CHILL",wdf$EVTYPE )
    wdf$EVTYPE<-gsub(".*HIGH.*WIND.*","HIGH WIND",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*FLD.*","FLOOD",wdf$EVTYPE)
    wdf$EVTYPE<-gsub(".*CURRENT.*","RIP CURRENT",wdf$EVTYPE)
    inEV<-wdf[,1]%in%allowedEV
    #store good event types here
    gdf<-data.frame(wdf[inEV==TRUE,])
    #store event types needing further work here
    wdf<-data.frame(wdf[inEV!=TRUE,])
    return(list(gdf,wdf))
}

ls<-clean_data2(bdf)
gdf<-rbind(gdf,data.frame(ls[1])) #combine both sets of 'good' data
bdf<-data.frame(ls[2]) #leftover bad data

We now have 0.49%[1] unaccounted for rows, with 365[2] unique event types.

Has this improved our data coverage? Again, if we simply look at fatalities, we see we have 1.88%[3] unassigned to a weather event. So we seem to have covered the bulk of the data.

Next we will calculate the financial impacts of these weather events.

calc_StuffImpact<-function(wdf){
    #translate K, M, B into power values (3, 6, 9)
    #3 and 6 already exist as factor level, so add 9 as a level
    levels(wdf$PROPDMGEXP)<-c(levels(wdf$PROPDMGEXP),"9")
    #replace 0,K,M,B with appropriate power values(1,3,6,9)
    wdf$PROPDMGEXP[wdf$PROPDMGEXP=="0"]<-"1"
    wdf$PROPDMGEXP[wdf$PROPDMGEXP=="K"]<-"3"
    wdf$PROPDMGEXP[wdf$PROPDMGEXP=="M"]<-"6"
    wdf$PROPDMGEXP[wdf$PROPDMGEXP=="B"]<-"9"
    wdf$PROPDMGEXP<-as.numeric(as.character(wdf$PROPDMGEXP))
    #add missing power levels to the Crop Exponent field
    levels(wdf$CROPDMGEXP)<-c(levels(wdf$CROPDMGEXP),"1","3","6","9")
    #replace 0,K,M,B with appropriate power values(1,3,6,9)
    wdf$CROPDMGEXP[wdf$CROPDMGEXP=="0"]<-"1"
    wdf$CROPDMGEXP[wdf$CROPDMGEXP=="K"]<-"3"
    wdf$CROPDMGEXP[wdf$CROPDMGEXP=="M"]<-"6"
    wdf$CROPDMGEXP[wdf$CROPDMGEXP=="B"]<-"9"
    wdf$CROPDMGEXP<-as.numeric(as.character(wdf$CROPDMGEXP))
    #calculate damage costs
    wdf$PROPVAL<-wdf$PROPDMG*10^wdf$PROPDMGEXP
    wdf$CROPVAL<-wdf$CROPDMG*10^wdf$CROPDMGEXP
   wdf<-wdf%>%group_by(EVTYPE)%>%summarise(sumPROP=sum(PROPVAL,na.rm=TRUE),
                                          sumCROP=sum(CROPVAL,na.rm=TRUE))
    return(wdf)
}

RESULTS

Following is the plot for Human Impacts based on Weather Events.

plot_HumanImpact<-function(wdf){
    #assumes col1=EVTYPE, col2=INJURIES, col3=FATALITIES
    perDF<-wdf%>%group_by(EVTYPE)%>%summarise(sumINJ=sum(INJURIES),sumFAT=sum(FATALITIES))
    #get top ten events of each type (Injuries/fatalities)
    pltPerDF<-head(perDF[order(perDF$sumINJ,decreasing=TRUE),],10)
    pltPerFDF<-head(perDF[order(perDF$sumFAT,decreasing=TRUE),],10)
    par(mfrow=c(1,2), oma=c(0,0,1,0))
    barplot(pltPerDF$sumINJ, names.arg=pltPerDF$EVTYPE, las=2, cex.axis=0.7,
          cex.names=0.4,col="yellow",
          ylab="Number of Injuries")
    barplot(pltPerFDF$sumFAT, names.arg=pltPerFDF$EVTYPE, las=2, cex.axis=0.7,
        cex.names=0.4,col="red",
        ylab="Number of Fatalities")
    title("Human Impact by Weather Event",outer=TRUE)
}

plot_HumanImpact(gdf)

From the plots, it is easy to identify that Tornadoes are the most damaging weather events from a human impact (Injury or Death) perspective. Further, fatalities are <10% of the impacts, and the injuries from Thunderstorm Winds are close to the number of fatalities from Tornadoes.

plot_StuffImpact<-function(wdf){
     #get top ten events of each type (Crop/Property damage)
    pltPropDF<-head(wdf[order(wdf$sumPROP,decreasing=TRUE),],10)
    pltCropDF<-head(wdf[order(wdf$sumCROP,decreasing=TRUE),],10)
    par(mfrow=c(1,2), oma=c(0,0,1,0))
    barplot(pltPropDF$sumPROP/1000000000, names.arg=pltPropDF$EVTYPE, las=2, cex.axis=0.7,
            cex.names=0.4,col="lightgreen",
            ylab="Property Damage (Billions of $)")
    barplot(pltCropDF$sumCROP/1000000000, names.arg=pltCropDF$EVTYPE, las=2, cex.axis=0.7,
            cex.names=0.4,col="lightblue",
            ylab="Crop Damage (Billions of $)")
    title("Financial Impact by Weather Event",outer=TRUE)
}


plot_StuffImpact(calc_StuffImpact(gdf))
## Warning in calc_StuffImpact(gdf): NAs introduced by coercion
## Warning in calc_StuffImpact(gdf): NAs introduced by coercion

From this plot we see that property damage far outweighs crop damage, and Floods, Hurricanes, Tornadoes, and Storm Surges(Tides) cause the most damage. For farmers, Droughts and Floods, are the most damaging weather events, as expected.

Summary

We have shown that Tornadoes are the most damaging weather event in relation to its human impacts, and that Floods, and Hurricanes cause the largest financial effects, while Drought and Floods have the most impact on the agricultural sector.

This analysis is limited to describing impacts across the whole of the United States, and does not take into account regional variability. Future analyses could provide detail from the national, regional (eg. Pacific Northwest), state, and possibly county levels.


Inline code:

[1] Percent Bad Data:
format(((nrow(bdf)/(nrow(gdf)+nrow(bdf)))*100),nsmall=2,digits=2)

[2] Unique Events:

length(unique(bdf$EVTYPE))

[3] Percent Missing Fatalities:
format((sum(bdf $ FATALITIES)/(sum(gdf $ FATALITIES,na.rm=TRUE)+sum(bdf$FATALITIES)))*100,nsmall=2,digits=2)


System Environment:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## 
## locale:
## [1] LC_COLLATE=English_Canada.1252  LC_CTYPE=English_Canada.1252   
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C                   
## [5] LC_TIME=English_Canada.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] dplyr_0.4.1
## 
## loaded via a namespace (and not attached):
##  [1] assertthat_0.1  DBI_0.3.1       digest_0.6.8    evaluate_0.5.5 
##  [5] formatR_1.0     htmltools_0.2.6 knitr_1.9       lazyeval_0.1.10
##  [9] magrittr_1.5    parallel_3.1.2  Rcpp_0.11.5     rmarkdown_0.5.1
## [13] stringr_0.6.2   tools_3.1.2     yaml_2.1.13