Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and economic damages, and understanding the relationships between storm events and impacts is critical to establishing proper preventative and reactive plans.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage. The raw data consists of 37 variables with 902,297 observation from years 1950-2011.
This research seeks to identify which storm events have the greatest health consequences and which storm events have the greatest economic consequences. To answer these questions, this analysis aggregates the data by storm event type to identify the top 10 storm events that drive:
Our analysis finds that tornados have caused the greatest number of fatalities and injuries across the U.S. from 1950-2011. Furthermore, we find that floods have caused the greatest total economic impact of all storm events totalling approximately $150 billion from 1950-2011.
The first step is to upload the storm data using the following code. Setting cache=TRUE will reduce the significant time required to load this large dataset:
stormDF <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), stringsAsFactors = FALSE)
The objective of this research is to infer two questions from the U.S. NOAA database concerning storm events in the U.S.:
To address these two questions we can reduce the U.S. NOAA data to only the variables which provide data related to these questions. These variables include:
To reduce the data we can generate a new dataframe by subsetting stormDF for these eight identified variables:
stormDF2 <- subset(stormDF, select = c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
The dataset stormDF2 now needs to be cleaned and formatted.
# Format date:
stormDF2$BGN_DATE <- as.Date(stormDF$BGN_DATE , "%m/%d/%Y")
# Format event type as a factor:
stormDF2$EVTYPE <- as.factor(stormDF2$EVTYPE)
# Format damage valuation magnitude as characters for re-coding:
stormDF2$PROPDMGEXP <- as.character(stormDF2$PROPDMGEXP)
stormDF2$CROPDMGEXP <- as.character(stormDF2$CROPDMGEXP)
# Create numeric values to represent the coded damage valuation magnitudes
# (i.e. "h" = 100; "k" = 1,000; "m" = 1,000,000; etc.)
## re-code PROPDMGEXP variable
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "h") | (stormDF2$PROPDMGEXP == "H")] <- 100
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "k") | (stormDF2$PROPDMGEXP == "K")] <- 1000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "m") | (stormDF2$PROPDMGEXP == "M")] <- 1000000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "b") | (stormDF2$PROPDMGEXP == "B")] <- 1000000000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "0")] <- 1
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "1")] <- 10
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "2")] <- 100
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "3")] <- 1000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "4")] <- 10000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "5")] <- 100000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "6")] <- 1000000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "7")] <- 10000000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "8")] <- 100000000
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "") | (stormDF2$PROPDMGEXP == "?")] <- 0
stormDF2$PROPDMGEXP[(stormDF2$PROPDMGEXP == "+") | (stormDF2$PROPDMGEXP == "-")] <- 0
## re-code CROPDMGEXP variable
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "h") | (stormDF2$CROPDMGEXP == "H")] <- 100
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "k") | (stormDF2$CROPDMGEXP == "K")] <- 1000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "m") | (stormDF2$CROPDMGEXP == "M")] <- 1000000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "b") | (stormDF2$CROPDMGEXP == "B")] <- 1000000000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "0")] <- 1
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "1")] <- 10
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "2")] <- 100
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "3")] <- 1000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "4")] <- 10000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "5")] <- 100000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "6")] <- 1000000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "7")] <- 10000000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "8")] <- 100000000
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "") | (stormDF2$CROPDMGEXP == "?")] <- 0
stormDF2$CROPDMGEXP[(stormDF2$CROPDMGEXP == "+") | (stormDF2$CROPDMGEXP == "-")] <- 0
# convert PROPDMGEXP & CROPDMGEXP variables to integers
stormDF2$PROPDMGEXP <- as.integer(stormDF2$PROPDMGEXP)
stormDF2$CROPDMGEXP <- as.integer(stormDF2$CROPDMGEXP)
At this point we now have clean & tidy data set; however, we need to combine the separate economic consequence variables. First, we get total property and crop damage values by multiplying the valuation and magnitude columns for both the property damage (PROPDMG * PROPDMGEXP) and crop damage (CROPDMG * CROPDMGEXP) variables. And lastly, we combine these two values to calculate an aggregated total economic damage valuation (total_property_damage + total_crop_damage)
stormDF2$total_property_damage <- stormDF2$PROPDMG * stormDF2$PROPDMGEXP
stormDF2$total_crop_damage <- stormDF2$CROPDMG * stormDF2$CROPDMGEXP
stormDF2$total_economic_damage <- stormDF2$total_property_damage + stormDF2$total_crop_damage
To address the first question, we’ll split the dataframe so that we can create barplots. This process will first split the dataframe in order to sum the total fatalities and injuries per event type and then we’ll sort the data and extract the top 10 events which cause the most fatalities and the top 10 events which cause the most injuries.
# install.packages("plyr") required
# install.packages("reshape") required
library(plyr)
library(reshape)
# subset the data using ddply() to summarize fatalities and injuries across storm events
healthDF <- ddply(stormDF2, .(EVTYPE), summarize, Fatalities=sum(FATALITIES), Injuries=sum(INJURIES))
# extract the top 10 fatality and injury causing events
fatalityTop10 <- healthDF[order(-healthDF$Fatalities),][1:10,1:2]
injuryTop10 <- healthDF[order(-healthDF$Injuries),][1:10,c(1,3)]
# set the ordinal levels of the events for graphing purposes
fatalityTop10 <- transform(fatalityTop10, EVTYPE = ordered(fatalityTop10$EVTYPE, levels = fatalityTop10$EVTYPE))
injuryTop10 <- transform(injuryTop10, EVTYPE = ordered(injuryTop10$EVTYPE, levels = injuryTop10$EVTYPE))
We can now generate a panel plot using ggplot2 to identify the events that have the greatest contribution to health consequences.
# install.packages("ggplot2") required
# install.packages("scales")
library(ggplot2)
library(scales)
fatalityPlot1 <- ggplot(fatalityTop10, aes(x = EVTYPE, y = Fatalities)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = comma) +
xlab("") +
ylab("Fatalities") +
ggtitle("Top 10 Storm Events\nCausing Fatalities")
injuryPlot1 <- ggplot(injuryTop10, aes(x = EVTYPE, y = Injuries)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = comma) +
xlab("") +
ylab("Injuries") +
ggtitle("Top 10 Storm Events\nCausing Injuries")
grid.draw(cbind(ggplotGrob(fatalityPlot1), ggplotGrob(injuryPlot1), size="last"))
As the plots show, tornados cause the greatest amount of fatalities and injuries by a significant amount.
Next, we can apply the same logic to identify the events which cause the greatest economic impacts. First, we’ll split the dataframe in order to sum the total property damages and crop damages per event type and then we’ll sort the data and extract the top 10 events which cause the most economic damages to properties and to crops.
# subset the data using ddply() to summarize property and crop damage across storm events
economicDF <- ddply(stormDF2, .(EVTYPE), summarize, property=sum(total_property_damage), crop=sum(total_crop_damage))
# extract the top 10 property and crop damage causing events
propertyTop10 <- economicDF[order(-economicDF$property),][1:10,1:2]
cropTop10 <- economicDF[order(-economicDF$crop),][1:10,c(1,3)]
# set the ordinal levels of the events for graphing purposes
propertyTop10 <- transform(propertyTop10, EVTYPE = ordered(propertyTop10$EVTYPE, levels = propertyTop10$EVTYPE))
cropTop10 <- transform(cropTop10, EVTYPE = ordered(cropTop10$EVTYPE, levels = cropTop10$EVTYPE))
Finally, we can generate a panel plot using ggplot2 to identify the events that have the greatest contribution to economic consequences.
propertyPlot1 <- ggplot(propertyTop10, aes(x = EVTYPE, y = property/1000000000)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = dollar) +
xlab("") +
ylab("Property Damage Value\n(U.S. $ Billion)") +
ggtitle("Top 10 Storm Events\nCausing Property Damage")
cropPlot1 <- ggplot(cropTop10, aes(x = EVTYPE, y = crop/1000000000)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = dollar) +
xlab("") +
ylab("Crop Damage Value\n(U.S. $ Billion)") +
ggtitle("Top 10 Storm Events\nCausing Crop Damage")
grid.draw(cbind(ggplotGrob(propertyPlot1), ggplotGrob(cropPlot1), size="last"))
As the plots show, floods cause the greatest amount of property damage and droughts cause the greatest amount of crop damage.
Finally, we can analyze the events that cause the greatest overall economic impact by analyzing the total_economic_damage variable we created earlier, which is the summation of total property damage and total crop damage.
totaleconomicDF <- ddply(stormDF2, .(EVTYPE), summarize, total_damage=sum(total_economic_damage))
# extract the top 10 total economic damage causing events
damageTop10 <- totaleconomicDF[order(-totaleconomicDF$total_damage),][1:10,1:2]
# set the ordinal levels of the events for graphing purposes
damageTop10 <- transform(damageTop10, EVTYPE = ordered(damageTop10$EVTYPE, levels = damageTop10$EVTYPE))
# generate the barplot
ggplot(damageTop10, aes(x = EVTYPE, y = total_damage/1000000000)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_continuous(labels = dollar) +
xlab("") +
ylab("Total Economic Damage Value\n(U.S. $ Billion)") +
ggtitle("Top 10 Storm Events\nCausing Total Economic Damage")
As this chart shows, floods cause the greatest total economic impact across the U.S., followed by hurricanes/typhoons and then tornados.