Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
To find the types of events with the major effects on people health and economic consequeses Storm Data from the National Wether Services (NOAA) were analysed. The data where collected in time periode from year 1950 and end in November 2011. Due to the long time periode and different types of event definitions during that time a cleanup of the data was necessary. For this anaylsis, human impact was quantified by casulaties, which equal deaths + injuries. Economic impact is mesasured by summing property and crop damages.
Set local working directory.
setwd("~/code/DataScience/class/05_Reproducible Research/RepData_PeerAssessment2")
Load all necessary libraries.
library(stringdist)
Set global options.
Load the NOAA storm data if it isn’t already downloaded in the data directory.
if(!file.exists("data")){dir.create("data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
filename <- "storm_data.csv.bz2"
if(!file.exists(paste("data/",filename,sep = ""))) {
download.file(fileUrl, destfile = paste("data/",filename,sep = ""), method = "curl")
}
stormdata <- read.csv(bzfile(paste("data/",filename,sep = "")), header = TRUE)
According to the National Weather Service Storm Data Documentation there are 48 official events listed ($2.1.1, p. 6). In contrast the event types (EVTYPE) reported in the NOAA database list 985 levels. Many of the reported event types are misstyped, have slightly different terminology, formatting issues and dublicates of other wether events. As a result, before aggregating by EVTYPE field the data have to be cleaned up with the following:
Build lookup table:
evs <- c("Astronomical Low Tide", "Avalanche", "Blizzard", "Coastal Flood", "Cold/Wind Chill", "Debris Flow", "Dense Fog", "Dense Smoke", "Drought", "Dust Devil", "Dust Storm", "Excessive Heat", "Extreme Cold/Wind Chill", "Flash Flood", "Flood", "Frost/Freeze", "Funnel Cloud", "Freezing Fog", "Hail", "Heat", "Heavy Rain", "Heavy Snow", "High Surf", "High Wind", "Hurricane (Typhoon)", "Ice Storm", "Lake-Effect Snow", "Lakeshore Flood", "Lightning", "Marine Hail", "Marine High Wind", "Marine Strong Wind", "Marine Thunderstorm Wind", "Rip Current", "Seiche", "Sleet", "Storm Surge/Tide", "Strong Wind", "Thunderstorm Wind", "Tornado", "Tropical Depression", "Tropical Storm", "Tsunami", "Volcanic Ash", "Waterspout", "Wildfire", "Winter Storm", "Winter Weather", "OTHER")
## Create data frame of event types and their frequency
x <- as.data.frame(table(stormdata$EVTYPE))
## Order by frequency DESC
x <- x[with(x, order(-Freq)),]
## Approximate string matching
xm <- amatch(x$Var1, evs, maxDist = 25)
## Show fitting of approcimation:
print(summary(is.na(xm)))
## Mode FALSE TRUE NA's
## logical 961 24 0
## Using xm for generating the mapping table
for(i in 1:dim(x)[1]) {
x$event[i] <- evs[xm[i]]
}
For only 24 event types (2.44%) out of 985 the event type can not be matched with the approcimation method. Therefore, the lookup table can be used to clean up the EVTYPE data in a new column: n_EVTYPE. The clean up is done only by the 17. event types with the highest frequency. Other event types were neglected.
stormdata$EVTYPE <- gsub("[/\\& ]", "", toupper(stormdata[, "EVTYPE"]))
stormdata$n_EVTYPE <- gsub("^HAIL.*", "Hail", gsub("^TSTMWIND.*", "High Wind", gsub("^THUNDERSTORMWIND.*", "High Wind", gsub("^TORNADO.*", "Tornado", gsub("^FLASHFLOOD*", "Flash Flood", gsub("^FLOOD*", "Flood", gsub("^HIGHWIND*", "High Wind", gsub("^FLOOD*", "Flood", gsub("^LIGHTNING*", "Lightning", gsub("^HEAVYSNOW*", "Heavy Snow", gsub("^HEAVYRAIN*", "Heavy Rain", gsub("^WINTERSTORM*", "Winter Weather", gsub("^FUNNELCLOUD*", "Funnel Cloud", gsub("^EXCESSIVEHEAT*", "Excessive Heat", gsub("^HURRICANE.*", "Hurricane (Typhoon)", gsub("^RIVERFLOOD.*", "Flood", toupper(stormdata[, "EVTYPE"])))))))))))))))))
The four columns CROPDMG, CROPDMGEXP, PROPDMG, and PROPDMGEXP track the expense amounts and have to be converted from mixed numeric/factor to numeric columns. Four new columns with the cleaned up data are n_CROPDMG, n_CROPDMGEXP, n_PROPDMG, and n_PROPDMGEXP. Two additional columns were added to capture the expenses as calcualted by n_CROPDMG * n_CROPDMGEXP and n_PROPDMG * n_PROPDMGEXP, are called CROPDMGEXPENSE and PROPDMGEXPENSE respectively.
## Crop damage
stormdata$n_CROPDMG <- as.numeric(as.character(stormdata$CROPDMG))
stormdata[is.na(stormdata$n_CROPDMG), "n_CROPDMG"] <- 0 ## set the NA to 0
stormdata$n_CROPDMGEXP <- suppressWarnings(as.numeric(as.character(gsub("[Bb]", "1000000000", gsub("[Mm]", "1000000", gsub("[Kk]", "1000", stormdata$CROPDMGEXP))))))
stormdata[is.na(stormdata$n_CROPDMGEXP), "n_CROPDMG"] <- 1
stormdata$CROPDMGEXPENSE <- stormdata$n_CROPDMG * stormdata$n_CROPDMGEXP
## Property damage
stormdata$n_PROPDMG <- as.numeric(as.character(stormdata$PROPDMG))
stormdata[is.na(stormdata$n_PROPDMG), "c_PROPDMG"] <- 0
stormdata$n_PROPDMGEXP <- suppressWarnings(as.numeric(as.character(gsub("[Bb]", "1000000000", gsub("[Mm]", "1000000", gsub("[Kk]", "1000", stormdata$PROPDMGEXP))))))
stormdata[is.na(stormdata$n_PROPDMGEXP), "n_PROPDMG"] <- 1
stormdata$PROPDMGEXPENSE <- stormdata$n_PROPDMG * stormdata$n_PROPDMGEXP
The data is aggregated by the cleaned up new event column. The impact on population health is calculated by summing fatalities and injuries (refered as sumCasualties). For economic impact, crop and property damage expense were summed (refered as sumEconomics).
sumCasualties <- aggregate(FATALITIES + INJURIES ~ n_EVTYPE, stormdata, sum, na.rm = TRUE)
colnames(sumCasualties) <- c("event_type", "casualties")
sumEconomics <- aggregate(CROPDMGEXPENSE + PROPDMGEXPENSE ~ n_EVTYPE, stormdata, sum, na.rm = TRUE)
colnames(sumEconomics) <- c("event_type", "expenses")
Toranados significantly are the most harmful of all event types, as measured by total casualties. High Wind, Excessive Heat, Flood and Lightning round out the top 5.
barplot(head(sumCasualties[order(sumCasualties$casualties, decreasing = TRUE), "casualties"], 5), names.arg = head(sumCasualties[order(sumCasualties$casualties, decreasing = TRUE), "event_type"], 5), cex.names = 0.5, main = "Top 5 Event Types by Total Casualties", xlab = "Event Type", ylab = "Casualties (Deaths + Injuries)", col = "red")
Floods cause far the most economic costs of all weather events, measured by total crop and property damage. Hurricans, Tornados and river flood round out the top 5.
barplot(head(sumEconomics[order(sumEconomics$expenses, decreasing = TRUE), "expenses"]/1000000000, 5), names.arg = head(sumEconomics[order(sumEconomics$expenses, decreasing = TRUE), "event_type"], 5), cex.names = 0.5, main = "Top 5 Event Types by Total Economic Impact", xlab = "Event Type", ylab = "$ Billions (Property + Crop Damage)", col = "blue")