Siew Choo
date: Friday, March 20, 2015
For this project, we were presented with information from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries and property damage. The events were recorded from 1950 to November 2011.
As storms and other severe weather events can result in loss of lives and damage to property, preventing such outcomes to the extent possible is a key concern. With this motivation, we have explored the NOAA Storm Database in conjunction with the following related documentation:
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
to aim to identify the most hazardous weather events in terms of (1) the deaths and injuries and (2) the economic consequences they brought about.
This section follows and describes the steps I undertook to get the data ready to support analysis.
1) Read in the database file
2) Extract data relevant to the two questions
3) Consolidate the event type variable (ie. EVTYPE)
4) Prepare the plotting data.
The following libraries and functions are required.
## Load and attach required external packages.
library(ggplot2)
library(reshape2)
## Helper function to calculate the damage value.
getDamageValue <- function(value=0, exp=1) {
result <- 0
value <- as.numeric(value)
if (is.nan(value) | is.na(value) | value <= 0)
result <- 0
else {
multiplicand <- 1
switch (toupper(exp),
"K" = multiplicand <- 1000,
"M" = multiplicand <- 1000000,
"B" = multiplicand <- 1000000000
)
result <- value * multiplicand
}
return(result)
}
## Helper function to remap and consolidate the related event types.
if (!exists("remapEVTYPE", mode="function"))
source("remapEVTYPE.R")
The storm database file, repdata-data-StormData.csv.bz2, is assumed to be download and placed in the same directory as this file. It is a compressed CSV with 902297 rows and 37 variables.
## Given that the file is fairly large, check that the data has not already been loaded before reading it in.
if (!exists("dataDf"))
dataDf <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))
For Question 1,
## Select only columns relevant to Question 1.
q1PlotDataDf <- dataDf[, c("EVTYPE", "FATALITIES", "INJURIES")]
## Convert the character FATALITIES and INJURIES columns to numeric ones.
q1PlotDataDf$FATALITIES <- as.numeric(q1PlotDataDf$FATALITIES)
q1PlotDataDf$INJURIES <- as.numeric(q1PlotDataDf$INJURIES)
## If both fatalities and injuries are zeroes, then we are not interest to include the event in our plot.
q1PlotDataDf <- subset(q1PlotDataDf, FATALITIES + INJURIES > 0)
For Question 2,
## Select only columns relevant to Question 2.
q2PlotDataDf <- dataDf[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
## Convert the character PROPDMG and CROPDMG columns to numeric ones.
q2PlotDataDf$PROPDMG <- as.numeric(q2PlotDataDf$PROPDMG)
q2PlotDataDf$CROPDMG <- as.numeric(q2PlotDataDf$CROPDMG)
## If both PROPDMG and CROPDMG are zeroes, then we are not interest to include the event in our plot.
q2PlotDataDf <- subset(q2PlotDataDf, PROPDMG + CROPDMG > 0)
This step essentially calls for various related event types to be remapped and consolidated according to the information laid out in Section 7 of the NOOA Storm Data Documentation.
For Question 1,
## To reduce discrepancies brought on by typos,
## - Remove the leading and trailing whitespaces, then convert EVTYPE to uppercase.
q1PlotDataDf$EVTYPE <- toupper(gsub("^\\s+|\\s+$", "", q1PlotDataDf$EVTYPE))
## - Remove extra whitespaces within the EVTYPE.
q1PlotDataDf$EVTYPE <- gsub("[\\s]+", " ", q1PlotDataDf$EVTYPE)
## Call remapEVTYPE() to consolidate the different related EVTYPEs.
q1PlotDataDf <- remapEVTYPE(q1PlotDataDf)
For Question 2,
## To reduce discrepancies brought on by typos,
## - Remove the leading and trailing whitespaces, then convert EVTYPE to uppercase.
q2PlotDataDf$EVTYPE <- toupper(gsub("^\\s+|\\s+$", "", q2PlotDataDf$EVTYPE))
## - Remove extra whitespaces within the EVTYPE.
q2PlotDataDf$EVTYPE <- gsub("[\\s]+", " ", q2PlotDataDf$EVTYPE)
## Call remapEVTYPE() to consolidate the different related EVTYPEs.
q2PlotDataDf <- remapEVTYPE(q2PlotDataDf)
It is worth noting at this point that EVTYPE is now a character vector, not a factor as a result of the string manipulations above.
To get the data ready to support plotting, we did the following.
4.1) For each event, sum the pertinent values.
At the end of this step, there should be 1 row for each event type.
For Question 1,
## Aggregate the data by summing the values for *FATALITIES* and *INJURIES*.
dfMelt <- melt(q1PlotDataDf, id="EVTYPE")
q1PlotDataDf <- dcast(dfMelt, EVTYPE~variable, sum)
q1PlotDataDf$TOTAL <- rowSums(q1PlotDataDf[, c("FATALITIES", "INJURIES")])
## Reshape the dataset so that it becomes long and narrow.
## The original intention was to have ggplot facet on the *statistics* field.
q1PlotDataDf <- melt(q1PlotDataDf, id="EVTYPE", measure.vars=c("FATALITIES", "INJURIES", "TOTAL"))
colnames(q1PlotDataDf) <- c("EVTYPE", "statistics", "value")
For Question 2,
## Damage value is spread across two fields PROPDMG (value) AND PROPDMGEXP (magnitude).
## PROPDMGEXP: K for thousand, M for million, B for billion
## Compute the true damage values(PROPDMG=250, PROPDMGEXP="k" = 250000) and store that instead.
idx <- which(q2PlotDataDf$PROPDMG > 0)
for (i in idx) {
q2PlotDataDf$PROPDMG[i] <- getDamageValue(q2PlotDataDf$PROPDMG[i], q2PlotDataDf$PROPDMGEXP[i])
}
## Damage value is spread across two fields CROPDMG (value) AND CROPDMGEXP (magnitude).
## Likewise, compute the true damage value and store that.
idx <- which(q2PlotDataDf$CROPDMG > 0)
for (i in idx) {
q2PlotDataDf$CROPDMG[i] <- getDamageValue(q2PlotDataDf$CROPDMG[i], q2PlotDataDf$CROPDMGEXP[i])
}
## Delete the 2 EXP columns by assigning NULL to them since they are not needed.
q2PlotDataDf$PROPDMGEXP <- NULL
q2PlotDataDf$CROPDMGEXP <- NULL
## Aggregate the data by summing the values for *PROPDMG* and *CROPDMG*.
dfMelt <- melt(q2PlotDataDf, id="EVTYPE")
q2PlotDataDf <- dcast(dfMelt, EVTYPE~variable, sum)
q2PlotDataDf$TOTAL <- rowSums(q2PlotDataDf[, c("PROPDMG", "CROPDMG")])
## Reshape the dataset so that it becomes long and narrow.
## The original intention was to have ggplot facet on the *statistics* field.
q2PlotDataDf <- melt(q2PlotDataDf, id="EVTYPE", measure.vars=c("PROPDMG", "CROPDMG", "TOTAL"))
colnames(q2PlotDataDf) <- c("EVTYPE", "statistics", "value")
## convert the damage values to billions of USD for more user-friendly y-axis labels.
q2PlotDataDf$value <- q2PlotDataDf$value/1000000
4.2 Separate datasets to support facet plot making.
For Question 1,
## Prepare individual datasets to support production of the Q1 suite of plots by subsetting.
plotFatalitiesDf <- q1PlotDataDf[q1PlotDataDf$statistics=="FATALITIES",]
plotInjuriesDf <- q1PlotDataDf[q1PlotDataDf$statistics=="INJURIES",]
plotTotalDf <- q1PlotDataDf[q1PlotDataDf$statistics=="TOTAL",]
For Question 2,
## Prepare the plotting data for the PROPDMG PLOT by subsetting and ordering the rows.
plotPropDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="PROPDMG",]
plotCropDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="CROPDMG",]
plotTotalDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="TOTAL",]
4.3 Massage data to chart only the top 10 events in descending order
For Question 1,
## Sort the datasets so that events with the highest death, injury and overall values are at the top.
fatalitiesIdx <- order(plotFatalitiesDf$value, decreasing=TRUE)
injuriesIdx <- order(plotInjuriesDf$value, decreasing=TRUE)
totalIdx <- order(plotTotalDf$value, decreasing=TRUE)
## Retain only rows for the top 10 event types.
plotFatalitiesDf <- plotFatalitiesDf[fatalitiesIdx[c(1:10)],]
plotInjuriesDf <- plotInjuriesDf[injuriesIdx[c(1:10)],]
plotTotalDf <- plotTotalDf[totalIdx[c(1:10)],]
## Explicitly reorder the factor levels based on the fatality/injuries/total numbers, so that the events will be charted in descending order. Otherwise, the plot gets ordered according to the factor levels (in alphabetical order).
plotFatalitiesDf$EVTYPE <- factor(plotFatalitiesDf$EVTYPE,
levels=plotFatalitiesDf$EVTYPE[order(plotFatalitiesDf$value, decreasing=TRUE)])
plotInjuriesDf$EVTYPE <- factor(plotInjuriesDf$EVTYPE,
levels=plotInjuriesDf$EVTYPE[order(plotInjuriesDf$value, decreasing=TRUE)])
plotTotalDf$EVTYPE <- factor(plotTotalDf$EVTYPE,
levels=plotTotalDf$EVTYPE[order(plotTotalDf$value, decreasing=TRUE)])
For Question 2,
## Sort the datasets so that events with the highest property, crop and overall damages are at the top.
propDamageIdx <- order(plotPropDamageDf$value, decreasing=TRUE)
cropDamageIdx <- order(plotCropDamageDf$value, decreasing=TRUE)
totalDamageIdx <- order(plotTotalDamageDf$value, decreasing=TRUE)
## Retain only the rows for the top 10 event types.
plotPropDamageDf <- plotPropDamageDf[propDamageIdx[c(1:10)],]
plotCropDamageDf <- plotCropDamageDf[cropDamageIdx[c(1:10)],]
plotTotalDamageDf <- plotTotalDamageDf[totalDamageIdx[c(1:10)],]
## Explicitly reorder the factor levels based on the property/crop/total damage values, so that the events will be charted in descending order. Otherwise, the plot gets ordered according to the factor levels (in alphabetical order).
plotPropDamageDf$EVTYPE <- factor(plotPropDamageDf$EVTYPE,
levels=plotPropDamageDf$EVTYPE[order(plotPropDamageDf$value, decreasing=TRUE)])
plotCropDamageDf$EVTYPE <- factor(plotCropDamageDf$EVTYPE,
levels=plotCropDamageDf$EVTYPE[order(plotCropDamageDf$value, decreasing=TRUE)])
plotTotalDamageDf$EVTYPE <- factor(plotTotalDamageDf$EVTYPE,
levels=plotTotalDamageDf$EVTYPE[order(plotTotalDamageDf$value, decreasing=TRUE)])
## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotFatalitiesDf$value/1000) * 1000)
## Create the FATALITIES PLOT and annotate it.
fatalitiesPlot <- ggplot(data=plotFatalitiesDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="red", stat="identity") +
ylim(c(0, yUpperLimit)) +
labs(title=paste("Top 10 Death-Causing Events", "(United States)", sep="\n")) +
labs(x="Event Type", y="Number of Deaths") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(fatalitiesPlot)
## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotInjuriesDf$value/10000) * 10000)
## Create the INJURIES PLOT and annotate it.
injuriesPlot <- ggplot(data=plotInjuriesDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="hotpink", stat="identity") +
ylim(c(0, yUpperLimit)) +
labs(title=paste("Top 10 Injury-Causing Events", "(United States)", sep="\n")) +
labs(x="Event Type", y="Number of Injuries") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(injuriesPlot)
## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotTotalDf$value/10000) * 10000)
## Create the combined fatalities and injuries plot and annotate it.
totalPlot <- ggplot(data=plotTotalDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="deeppink", stat="identity") +
ylim(c(0, yUpperLimit)) +
labs(title=paste("Top 10 Death and Injury Causing Events", "(United States)", sep="\n")) +
labs(x="Event Type", y="Number of Deaths and Injuries") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(totalPlot)
Question 1
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Based on the charts, the 3 top weather events causing the greatest death and injury tolls are:
| Harmful | Fatality | Injury | Fatality & Injury |
|---|---|---|---|
| Most | TORNADO | TORNADO | TORNADO |
| : | EXCESSIVE HEAT | MARINE THUNDERSTORM WIND | EXCESSIVE HEAT |
| Least | FLASH FLOOD | FLOOD | MARINE THUNDERSTORM WIND |
Hence, tornado emerges as the single most harmful weather event with respect to population health in the United States, both in terms of loss of lives and injuries.
yUpperLimit <- max(ceiling(plotPropDamageDf$value))
## Create the PROPDMG PLOT and annotate it.
propDamagePlot <- ggplot(data=plotPropDamageDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="blue", stat="identity") +
ylim(c(0, yUpperLimit)) +
labs(title=paste("Top 10 Property Damaging Events, 1950-2011", "(United States)", sep="\n")) +
labs(x="Event Type", y="Cost in Billions of Dollars") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(propDamagePlot)
## Create the CROPDMG PLOT and annotate it.
cropDamagePlot <- ggplot(data=plotCropDamageDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="green", stat="identity") +
ylim(c(0, 150000)) +
labs(title=paste("Top 10 Crop Damaging Events, 1950-2011", "(United States)", sep="\n")) +
labs(x="Event Type", y="Cost in Billions of Dollars") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(cropDamagePlot)
## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotTotalDamageDf$value))
## Create the combined PROPDMG and CROPDMG plot and annotate it.
totalDamagePlot <- ggplot(data=plotTotalDamageDf,
aes(x=EVTYPE, y=value)) +
geom_bar(fill="cyan", stat="identity") +
ylim(c(0, yUpperLimit)) +
labs(title=paste("Top 10 Property and Crop Damaging Events, 1950-2011", "(United States)", sep="\n")) +
labs(x="Event Type", y="Cost in Billions of Dollars") +
theme(axis.text.x = element_text(angle=30, hjust=1))
print(totalDamagePlot)
Question 2
Across the United States, which types of events have the greatest economic consequences?
Based on the charts, the 3 top weather events causing the greatest damages to property and crop are:
| Damaging | Property | Crop | Property & Crop |
|---|---|---|---|
| Most | FLOOD | DROUGHT | FLOOD |
| : | HURRICANE/TYPHOON | FROST/FREEZE | HURRICANE/TYPHOON |
| Least | TORNADO | FLASH FLOOD | TORNADO |
In terms of economic consequences, flood emerges as the overall most damaging weather event across the United States, in terms of dollar value.