This report analyzes the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, and shows the adverse consequences of severe weather events by category in the U.S. from 1950 to November 2011.
# Initialize global settings and load the libraries silently.
suppressWarnings(library(knitr))
suppressWarnings(library(plyr))
suppressWarnings(library(reshape2))
suppressWarnings(library(ggplot2))
# Default setting anyway.
opts_chunk$set(echo = TRUE)
The starting point of this analysis is the raw bz2 file containing the data. Only the relevant information is extracted, i.e., event types, human fatalities and injuries, property and crop damage. The consequences of each type of event are calculated by summing up all the casualties and damage caused by that specific type of event.
# Read the bz2 file directly with read.csv().
d <- read.csv("repdata_data_StormData.csv.bz2")
d <- d[, c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]
# Map abbreviations to numbers.
d$PROPDMGEXP <- mapvalues(d$PROPDMGEXP,
from = c("", "-", "?", "+", "B", "H", "h", "K", "M", "m"),
to = c("0", "0", "0", "0", "9", "2", "2", "3", "6", "6"))
d$CROPDMGEXP <- mapvalues(d$CROPDMGEXP,
from = c("", "?", "B", "K", "k", "M", "m"),
to = c("0", "0", "9", "3", "3", "6", "6"))
# Calculate the two type of damage in the form of scientific notation.
d$PROPDMG <- d$PROPDMG * 10 ^ as.numeric(as.character(d$PROPDMGEXP))
d$CROPDMG <- d$CROPDMG * 10 ^ as.numeric(as.character(d$CROPDMGEXP))
d <- ddply(d, .(EVTYPE), summarise,
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES),
PROPDMG = sum(PROPDMG),
CROPDMG = sum(CROPDMG))
There are two questions that need to be answered:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Which types of events have the greatest economic consequences?
In order to answer the first question, the top 10 events with the highest FATALITIES are selected, and a plot is drawn for these events and the casualties (fatalities and injuries) they have caused.
by.casualties <- d[, c("EVTYPE", "FATALITIES", "INJURIES")]
by.casualties <- by.casualties[order(by.casualties$FATALITIES,
decreasing = TRUE), ]
by.casualties <- by.casualties[1:10, ]
# Get the factor levels in the right order.
by.casualties$EVTYPE <- factor(by.casualties$EVTYPE,
levels = by.casualties$EVTYPE[order(by.casualties$FATALITIES,
decreasing = TRUE)])
# Change the data format from wide to long for drawing the plot.
by.casualties <- melt(by.casualties, id.vars = "EVTYPE",
variable.name = "CASUALTYTYPE",
value.name = "COUNT")
g.casualties <- ggplot(by.casualties, aes(x = EVTYPE,
y = COUNT,
fill = CASUALTYTYPE))
# Stack fatalities and injuries.
g.casualties + geom_bar(stat = "identity", position = "stack") +
scale_fill_discrete(name = "",
breaks = c("INJURIES", "FATALITIES"),
labels = c("Injured", "Dead")) +
labs(x = "Event") +
labs(y = "Casualties") +
labs(title = "Top 10 Events by Casualties") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
As can be seen from the plot, tornado is by far the most harmful event with respect to population health, in terms of both human fatalities and injuries, and it’s followed by excessive heat, flash flood, etc.
For the second question, the top 10 events are selected for each type of damage in the same way. Plots are drawn separately.
by.prop <- d[, c("EVTYPE", "PROPDMG")]
by.prop <- by.prop[order(by.prop$PROPDMG,
decreasing = TRUE), ][1:10, ]
# Necessary for the events to be in the right order in the plot.
by.prop$EVTYPE <- factor(by.prop$EVTYPE,
levels = by.prop$EVTYPE[order(by.prop$PROPDMG,
decreasing = TRUE)])
# Change the unit to billions.
g.prop <- ggplot(by.prop, aes(x = EVTYPE, y = PROPDMG/10^9))
g.prop + geom_bar(stat = "identity", fill = "red") +
labs(x = "Event") +
labs(y = "Property Damage in Billion Dollars") +
labs(title = "Top 10 Events by Property Damage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
by.crop <- d[, c("EVTYPE", "CROPDMG")]
by.crop <- by.crop[order(by.crop$CROPDMG,
decreasing = TRUE), ][1:10, ]
by.crop$EVTYPE <- factor(by.crop$EVTYPE,
levels = by.crop$EVTYPE[order(by.crop$CROPDMG,
decreasing = TRUE)])
g.crop <- ggplot(by.crop, aes(x = EVTYPE, y = CROPDMG/10^9))
# Use a different color.
g.crop + geom_bar(stat = "identity", fill = "blue") +
labs(x = "Event") +
labs(y = "Crop Damage in Billion Dollars") +
labs(title = "Top 10 Events by Crop Damage") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Therefore, flood (too much water: very bad) causes the most property damage, while drought (too little water: very bad too) causes the most crop damage.
The End.