This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
First, it is necessary to load the some packages.
library(ggplot2)
library(plyr)
library(reshape2)
To begin with our analysis, we download and unzip the data.
#setInternet2(TRUE)
## Create temporary file
f <- tempfile()
## Download the data
download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
f)
## Import the data into R
data <- read.csv(bzfile(f), stringsAsFactors = FALSE)
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. As the measurements might have been prone to large errors in the past years, we chose to include only the last ten years of data in our analysis.
# Set format
data$BGN_DATE <- as.Date(data$BGN_DATE, format = "%m/%d/%Y %H:%M:%S")
# Restrict to the last ten years
data <- data[data$BGN_DATE >= as.Date("1991-01-01"), ]
Processing the categories data
It is necessary to note that the data categories are not in a perfect shape - there are cases of misspeling and duplication of categories. In further research, it might be interesting to try cleaning the data in more detail. However due to the time constraint, we tried to perform only the basic data processing steps - converting all characters to upper case and trimming the categories from leading and trailing spaces
# Define trim function
trim <- function(x) gsub("^\\s+|\\s+$", "", x)
# Clean Event Type...
data$EVTYPE <- toupper(data$EVTYPE)
data$EVTYPE <- trim(data$EVTYPE)
Processing the public health damage data
The public health damage data need to be summarized so as to show the number of injuries and fatalities by event name. Finally, top 10 events resulting in injuries and top 10 events resulting in fatalities are selected. Finally, the data is melt by the reshape2 package so that we can utilize it later in ggplot charts.
# Make sums of injuries and fatalities
sumhealth <- ddply(data, .(EVTYPE), summarise, fatalities = sum(FATALITIES),
injuries = sum(INJURIES))
# Select ten most harmful events
topfatalities <- head(sumhealth[order(sumhealth$fatalities, decreasing = T),
], n = 10)[, c(1, 2)]
topinjuries <- head(sumhealth[order(sumhealth$injuries, decreasing = T), ],
n = 10)[, c(1, 3)]
## Prepare data for the barchart
forchart <- melt(topfatalities)
## Using EVTYPE as id variables
## Using EVTYPE as id variables
forchart2 <- melt(topinjuries)
## Using EVTYPE as id variables
## Using EVTYPE as id variables
Processing the economic damage data
The economic damage data is present in the form of a base and a multiplier (in the form of abbreviations). Hence, we multiply the base numbers by multipliers.
# Property damage multiplier: prepare and use to multiply the damage
data$PROPDMGEXP[is.na(data$PROPDMGEXP)] <- 0
data$PROPDMGEXP[data$PROPDMGEXP == ""] <- 1
data$PROPDMGEXP[grep("[-+?]", data$PROPDMGEXP)] <- 1
data$PROPDMGEXP[grep("[Hh]", data$PROPDMGEXP)] <- 100
data$PROPDMGEXP[grep("[Kk]", data$PROPDMGEXP)] <- 1000
data$PROPDMGEXP[grep("[Mm]", data$PROPDMGEXP)] <- 1e+06##
data$PROPDMGEXP[grep("[Bb]", data$PROPDMGEXP)] <- 1e+09
data$PROPDMGEXP <- as.numeric(data$PROPDMGEXP)
data$PROPDMG <- data$PROPDMGEXP * data$PROPDMG
# Crop damage multiplier: prepare and use to multiply the damage
data$CROPDMGEXP[is.na(data$CROPDMGEXP)] <- 0
data$CROPDMGEXP[data$CROPDMGEXP == ""] <- 1
data$CROPDMGEXP[grep("[-+?]", data$CROPDMGEXP)] <- 1
data$CROPDMGEXP[grep("[Hh]", data$CROPDMGEXP)] <- 100
data$CROPDMGEXP[grep("[Kk]", data$CROPDMGEXP)] <- 1000
data$CROPDMGEXP[grep("[Mm]", data$CROPDMGEXP)] <- 1e+06
data$CROPDMGEXP[grep("[Bb]", data$CROPDMGEXP)] <- 1e+09
data$CROPDMGEXP <- as.numeric(data$CROPDMGEXP)
data$CROPDMG <- data$CROPDMGEXP * data$CROPDMG
Similarly to the health data processing, the economic damage figures are first summarized according to the type of event. Subsequently, top 10 events with the highest economic impact (defined as damage to crops plus damage to property) were selected. Finaly, the data was prepared for ggplots with the melt function.
# Make sums f injuries and fatalities
sumecon <- ddply(data, .(EVTYPE), summarise, cropdmg = sum(CROPDMG), propdmg = sum(PROPDMG))
sumecon$totaldamage <- sumecon$cropdmg + sumecon$propdmg
## Select top 10
topecon <- head(sumecon[order(sumecon$totaldamage, decreasing = T), ], n = 10)
## Prepare data for the barchart
forchart3 <- melt(topecon)
## Using EVTYPE as id variables
## Using EVTYPE as id variables
forchart3 <- forchart3[forchart3$variable != "totaldamage", ]
The following table and chart present the 10 most damaging events from the perspective of fatalities.
topfatalities
## **EVTYPE fatalities**
## Make the barchart
ggplot(forchart, aes(x = factor(forchart$EVTYPE), y = forchart$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") +
theme(legend.position = "none") + ggtitle("Fatalities")
The event with the highest number of fatalities during the last 10 years of the dataset was EXCESSIVE HEAT followed closely by TORNADO.
The following table and chart present the 10 most damaging events from the perspective of injuries.
topfatalities
## **EVTYPE fatalities**
## Make the barchart
ggplot(forchart2, aes(x = factor(forchart2$EVTYPE), y = forchart2$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Number of injuries") +
theme(legend.position = "none") + ggtitle("Injuries")
The event with the highest number of injuries during the last 10 years of the dataset was TORNADO.
To sum up, TORNADO seems to be the most damaging event from the perspective of public health.
The following table and chart present the 10 most damaging events from the perspective of economic damage.
topecon
## Make the barchart
ggplot(forchart3, aes(x = factor(forchart3$EVTYPE), y = forchart3$value, fill = variable)) +
geom_bar(stat = "identity") + theme(axis.text.x = element_text(angle = -270),
plot.title = element_text(face = "bold")) + labs(x = "Weather event", y = "Economic damage") +
scale_fill_discrete(name = "Type of damage", labels = c("Crop", "Property")) +
theme(legend.position = "top") + ggtitle("Economic impact")
From the above we can see that the largest aggregate property damage is again due to Tornados. Costal Erosion has the highest average per-event property damage.
Tornados dominate weather events harmful to population health. Firgure 1 above displays the most harmful weather events, fatalities vs. injuries, conditional on either injuries or fatalities reported. The size of the bubbles indicate the total count of such events. After Tornados, the next most dangerous event types as measured by fatalities include events involving Heat, and events involving Flooding. Tornados are also much more prevalent when it comes to weather events involving either injuries or fatalities. In terms of injuries, Heat, Floods, and Thunder are about equally damaging (after Tornados).
Tornado appears to be the deadliest phenomenon among occuring in USA, followed by heat and fires and then hurricanes. The amount of tornado victims exceeds 5,000 people in a considered time period.
Hurricanes hurt economy the most due to the combination of adverse effects, such as floods, followed by storms and tornadoes. The total damage of hurricanes and caused effects over reported period exceeds $300B.