Reproducible Research: Peer Assessment 2 (NOAA Storm Data Analysis)

Siew Choo
date: Friday, March 20, 2015

Synopsis

For this project, we were presented with information from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries and property damage. The events were recorded from 1950 to November 2011.

As storms and other severe weather events can result in loss of lives and damage to property, preventing such outcomes to the extent possible is a key concern. With this motivation, we have explored the NOAA Storm Database in conjunction with the following related documentation:

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

to aim to identify the most hazardous weather events in terms of (1) the deaths and injuries and (2) the economic consequences they brought about.

Data Processing

This section follows and describes the steps I undertook to get the data ready to support analysis.
1) Read in the database file
2) Extract data relevant to the two questions
3) Consolidate the event type variable (ie. EVTYPE)
4) Prepare the plotting data.

The following libraries and functions are required.

## Load and attach required external packages.
library(ggplot2)
library(reshape2)
  
## Helper function to calculate the damage value.
getDamageValue <- function(value=0, exp=1) {
    result <- 0
    
    value <- as.numeric(value)
    if (is.nan(value) | is.na(value) | value <= 0)
        result <- 0
    else {
        multiplicand <- 1
        
        switch (toupper(exp),
                "K" = multiplicand <- 1000,
                "M" = multiplicand <- 1000000,
                "B" = multiplicand <- 1000000000
        )
        
        result <- value * multiplicand
    }
    
    return(result)
}  

## Helper function to remap and consolidate the related event types.
if (!exists("remapEVTYPE", mode="function"))
    source("remapEVTYPE.R")

1) Read in the database file

The storm database file, repdata-data-StormData.csv.bz2, is assumed to be download and placed in the same directory as this file. It is a compressed CSV with 902297 rows and 37 variables.

## Given that the file is fairly large, check that the data has not already been loaded before reading it in.
if (!exists("dataDf")) 
    dataDf <- read.csv(bzfile("repdata-data-StormData.csv.bz2"))

2) Extract data relevant to the two questions

For Question 1,

## Select only columns relevant to Question 1.
q1PlotDataDf <- dataDf[, c("EVTYPE", "FATALITIES", "INJURIES")]

## Convert the character FATALITIES and INJURIES columns to numeric ones.
q1PlotDataDf$FATALITIES <- as.numeric(q1PlotDataDf$FATALITIES)
q1PlotDataDf$INJURIES <- as.numeric(q1PlotDataDf$INJURIES)

## If both fatalities and injuries are zeroes, then we are not interest to include the event in our plot.
q1PlotDataDf <- subset(q1PlotDataDf, FATALITIES + INJURIES > 0)

For Question 2,

## Select only columns relevant to Question 2.
q2PlotDataDf <- dataDf[, c("EVTYPE", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]

## Convert the character PROPDMG and CROPDMG columns to numeric ones.
q2PlotDataDf$PROPDMG <- as.numeric(q2PlotDataDf$PROPDMG)
q2PlotDataDf$CROPDMG <- as.numeric(q2PlotDataDf$CROPDMG)

## If both PROPDMG and CROPDMG are zeroes, then we are not interest to include the event in our plot.
q2PlotDataDf <- subset(q2PlotDataDf, PROPDMG + CROPDMG > 0)

3) Consolidate the event type variable (ie. EVTYPE)

This step essentially calls for various related event types to be remapped and consolidated according to the information laid out in Section 7 of the NOOA Storm Data Documentation.

For Question 1,

## To reduce discrepancies brought on by typos, 
## - Remove the leading and trailing whitespaces, then convert EVTYPE to uppercase.
q1PlotDataDf$EVTYPE <- toupper(gsub("^\\s+|\\s+$", "", q1PlotDataDf$EVTYPE))
## - Remove extra whitespaces within the EVTYPE.
q1PlotDataDf$EVTYPE <- gsub("[\\s]+", " ", q1PlotDataDf$EVTYPE)

## Call remapEVTYPE() to consolidate the different related EVTYPEs. 
q1PlotDataDf <- remapEVTYPE(q1PlotDataDf)

For Question 2,

## To reduce discrepancies brought on by typos, 
## - Remove the leading and trailing whitespaces, then convert EVTYPE to uppercase.
q2PlotDataDf$EVTYPE <- toupper(gsub("^\\s+|\\s+$", "", q2PlotDataDf$EVTYPE))
## - Remove extra whitespaces within the EVTYPE.
q2PlotDataDf$EVTYPE <- gsub("[\\s]+", " ", q2PlotDataDf$EVTYPE)

## Call remapEVTYPE() to consolidate the different related EVTYPEs. 
q2PlotDataDf <- remapEVTYPE(q2PlotDataDf)

It is worth noting at this point that EVTYPE is now a character vector, not a factor as a result of the string manipulations above.

4) Prepare the plotting data

To get the data ready to support plotting, we did the following.

4.1) For each event, sum the pertinent values.
At the end of this step, there should be 1 row for each event type.

For Question 1,

## Aggregate the data by summing the values for *FATALITIES* and *INJURIES*.
dfMelt <- melt(q1PlotDataDf, id="EVTYPE")
q1PlotDataDf <- dcast(dfMelt, EVTYPE~variable, sum)
q1PlotDataDf$TOTAL <- rowSums(q1PlotDataDf[, c("FATALITIES", "INJURIES")])

## Reshape the dataset so that it becomes long and narrow.
## The original intention was to have ggplot facet on the *statistics* field.
q1PlotDataDf <- melt(q1PlotDataDf, id="EVTYPE", measure.vars=c("FATALITIES", "INJURIES", "TOTAL"))
colnames(q1PlotDataDf) <- c("EVTYPE", "statistics", "value")

For Question 2,

## Damage value is spread across two fields PROPDMG (value) AND PROPDMGEXP (magnitude).
## PROPDMGEXP: K for thousand, M for million, B for billion
## Compute the true damage values(PROPDMG=250, PROPDMGEXP="k" = 250000) and store that instead.
idx <- which(q2PlotDataDf$PROPDMG > 0)
for (i in idx) {
    q2PlotDataDf$PROPDMG[i] <- getDamageValue(q2PlotDataDf$PROPDMG[i], q2PlotDataDf$PROPDMGEXP[i])
}

## Damage value is spread across two fields CROPDMG (value) AND CROPDMGEXP (magnitude).
## Likewise, compute the true damage value and store that.
idx <- which(q2PlotDataDf$CROPDMG > 0)
for (i in idx) {
    q2PlotDataDf$CROPDMG[i] <- getDamageValue(q2PlotDataDf$CROPDMG[i], q2PlotDataDf$CROPDMGEXP[i])
}

## Delete the 2 EXP columns by assigning NULL to them since they are not needed.
q2PlotDataDf$PROPDMGEXP <- NULL
q2PlotDataDf$CROPDMGEXP <- NULL

## Aggregate the data by summing the values for *PROPDMG* and *CROPDMG*.
dfMelt <- melt(q2PlotDataDf, id="EVTYPE")
q2PlotDataDf <- dcast(dfMelt, EVTYPE~variable, sum)
q2PlotDataDf$TOTAL <- rowSums(q2PlotDataDf[, c("PROPDMG", "CROPDMG")])

## Reshape the dataset so that it becomes long and narrow.
## The original intention was to have ggplot facet on the *statistics* field.
q2PlotDataDf <- melt(q2PlotDataDf, id="EVTYPE", measure.vars=c("PROPDMG", "CROPDMG", "TOTAL"))
colnames(q2PlotDataDf) <- c("EVTYPE", "statistics", "value")

## convert the damage values to billions of USD for more user-friendly y-axis labels.
q2PlotDataDf$value <- q2PlotDataDf$value/1000000

4.2 Separate datasets to support facet plot making.

For Question 1,

## Prepare individual datasets to support production of the Q1 suite of plots by subsetting.
plotFatalitiesDf <- q1PlotDataDf[q1PlotDataDf$statistics=="FATALITIES",]
plotInjuriesDf <- q1PlotDataDf[q1PlotDataDf$statistics=="INJURIES",]
plotTotalDf <- q1PlotDataDf[q1PlotDataDf$statistics=="TOTAL",]

For Question 2,

## Prepare the plotting data for the PROPDMG PLOT by subsetting and ordering the rows.
plotPropDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="PROPDMG",]
plotCropDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="CROPDMG",]
plotTotalDamageDf <- q2PlotDataDf[q2PlotDataDf$statistics=="TOTAL",]

4.3 Massage data to chart only the top 10 events in descending order

For Question 1,

## Sort the datasets so that events with the highest death, injury and overall values are at the top. 
fatalitiesIdx <- order(plotFatalitiesDf$value, decreasing=TRUE)
injuriesIdx <- order(plotInjuriesDf$value, decreasing=TRUE)
totalIdx <- order(plotTotalDf$value, decreasing=TRUE)

## Retain only rows for the top 10 event types.
plotFatalitiesDf <- plotFatalitiesDf[fatalitiesIdx[c(1:10)],]
plotInjuriesDf <- plotInjuriesDf[injuriesIdx[c(1:10)],]
plotTotalDf <- plotTotalDf[totalIdx[c(1:10)],]

## Explicitly reorder the factor levels based on the fatality/injuries/total numbers, so that the events will be charted in descending order. Otherwise, the plot gets ordered according to the factor levels (in alphabetical order).
plotFatalitiesDf$EVTYPE <- factor(plotFatalitiesDf$EVTYPE, 
                           levels=plotFatalitiesDf$EVTYPE[order(plotFatalitiesDf$value, decreasing=TRUE)])
plotInjuriesDf$EVTYPE <- factor(plotInjuriesDf$EVTYPE, 
                         levels=plotInjuriesDf$EVTYPE[order(plotInjuriesDf$value, decreasing=TRUE)])
plotTotalDf$EVTYPE <- factor(plotTotalDf$EVTYPE, 
                      levels=plotTotalDf$EVTYPE[order(plotTotalDf$value, decreasing=TRUE)])

For Question 2,

## Sort the datasets so that events with the highest property, crop and overall damages are at the top. 
propDamageIdx <- order(plotPropDamageDf$value, decreasing=TRUE)
cropDamageIdx <- order(plotCropDamageDf$value, decreasing=TRUE)
totalDamageIdx <- order(plotTotalDamageDf$value, decreasing=TRUE)

## Retain only the rows for the top 10 event types.
plotPropDamageDf <- plotPropDamageDf[propDamageIdx[c(1:10)],]
plotCropDamageDf <- plotCropDamageDf[cropDamageIdx[c(1:10)],]
plotTotalDamageDf <- plotTotalDamageDf[totalDamageIdx[c(1:10)],]

## Explicitly reorder the factor levels based on the property/crop/total damage values, so that the events will be charted in descending order. Otherwise, the plot gets ordered according to the factor levels (in alphabetical order).
plotPropDamageDf$EVTYPE <- factor(plotPropDamageDf$EVTYPE, 
                           levels=plotPropDamageDf$EVTYPE[order(plotPropDamageDf$value, decreasing=TRUE)])
plotCropDamageDf$EVTYPE <- factor(plotCropDamageDf$EVTYPE, 
                           levels=plotCropDamageDf$EVTYPE[order(plotCropDamageDf$value, decreasing=TRUE)])
plotTotalDamageDf$EVTYPE <- factor(plotTotalDamageDf$EVTYPE, 
                            levels=plotTotalDamageDf$EVTYPE[order(plotTotalDamageDf$value, decreasing=TRUE)])

Result

## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotFatalitiesDf$value/1000) * 1000)

## Create the FATALITIES PLOT and annotate it.
fatalitiesPlot <- ggplot(data=plotFatalitiesDf, 
                         aes(x=EVTYPE, y=value)) + 
                         geom_bar(fill="red", stat="identity") +
                         ylim(c(0, yUpperLimit)) +
                         labs(title=paste("Top 10 Death-Causing Events", "(United States)", sep="\n")) +
                         labs(x="Event Type", y="Number of Deaths") +
                         theme(axis.text.x = element_text(angle=30, hjust=1))

print(fatalitiesPlot)

## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotInjuriesDf$value/10000) * 10000)

## Create the INJURIES PLOT and annotate it.
injuriesPlot <- ggplot(data=plotInjuriesDf, 
                       aes(x=EVTYPE, y=value)) + 
                       geom_bar(fill="hotpink", stat="identity") +
                       ylim(c(0, yUpperLimit)) +
                       labs(title=paste("Top 10 Injury-Causing Events", "(United States)", sep="\n")) +
                       labs(x="Event Type", y="Number of Injuries") +
                       theme(axis.text.x = element_text(angle=30, hjust=1))

print(injuriesPlot)

## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotTotalDf$value/10000) * 10000)

## Create the combined fatalities and injuries plot and annotate it.
totalPlot <- ggplot(data=plotTotalDf, 
                    aes(x=EVTYPE, y=value)) + 
                    geom_bar(fill="deeppink", stat="identity") + 
                    ylim(c(0, yUpperLimit)) +
                    labs(title=paste("Top 10 Death and Injury Causing Events", "(United States)", sep="\n")) + 
                    labs(x="Event Type", y="Number of Deaths and Injuries") +
                    theme(axis.text.x = element_text(angle=30, hjust=1))

print(totalPlot)

Question 1
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Based on the charts, the 3 top weather events causing the greatest death and injury tolls are:

Harmful	Fatality	Injury	Fatality & Injury
Most	TORNADO	TORNADO	TORNADO
:	EXCESSIVE HEAT	MARINE THUNDERSTORM WIND	EXCESSIVE HEAT
Least	FLASH FLOOD	FLOOD	MARINE THUNDERSTORM WIND

Hence, tornado emerges as the single most harmful weather event with respect to population health in the United States, both in terms of loss of lives and injuries.

yUpperLimit <- max(ceiling(plotPropDamageDf$value))

## Create the PROPDMG PLOT and annotate it.
propDamagePlot <- ggplot(data=plotPropDamageDf, 
                         aes(x=EVTYPE, y=value)) + 
                         geom_bar(fill="blue", stat="identity") +
                         ylim(c(0, yUpperLimit)) +
                         labs(title=paste("Top 10 Property Damaging Events, 1950-2011", "(United States)", sep="\n")) + 
                         labs(x="Event Type", y="Cost in Billions of Dollars") +
                         theme(axis.text.x = element_text(angle=30, hjust=1))

print(propDamagePlot)

## Create the CROPDMG PLOT and annotate it.
cropDamagePlot <- ggplot(data=plotCropDamageDf, 
                       aes(x=EVTYPE, y=value)) + 
                       geom_bar(fill="green", stat="identity") +
                       ylim(c(0, 150000)) +
                       labs(title=paste("Top 10 Crop Damaging Events, 1950-2011", "(United States)", sep="\n")) + 
                       labs(x="Event Type", y="Cost in Billions of Dollars") +
                       theme(axis.text.x = element_text(angle=30, hjust=1))

print(cropDamagePlot)

## Calculate the y-axis range upper limit.
yUpperLimit <- max(ceiling(plotTotalDamageDf$value))

## Create the combined PROPDMG and CROPDMG plot and annotate it.
totalDamagePlot <- ggplot(data=plotTotalDamageDf, 
                    aes(x=EVTYPE, y=value)) + 
                    geom_bar(fill="cyan", stat="identity") + 
                    ylim(c(0, yUpperLimit)) +
                    labs(title=paste("Top 10 Property and Crop Damaging Events, 1950-2011", "(United States)", sep="\n")) + 
                    labs(x="Event Type", y="Cost in Billions of Dollars") +
                    theme(axis.text.x = element_text(angle=30, hjust=1))

print(totalDamagePlot)

Question 2
Across the United States, which types of events have the greatest economic consequences?

Based on the charts, the 3 top weather events causing the greatest damages to property and crop are:

Damaging	Property	Crop	Property & Crop
Most	FLOOD	DROUGHT	FLOOD
:	HURRICANE/TYPHOON	FROST/FREEZE	HURRICANE/TYPHOON
Least	TORNADO	FLASH FLOOD	TORNADO

In terms of economic consequences, flood emerges as the overall most damaging weather event across the United States, in terms of dollar value.