In this report we aim to answer two questions regarding storms in the U.S.:
1. Across the United States, which types of events are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?
To answer this question, data was obtained from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which includes data on major hydrological events from 1950 to 2011. We conclude that tornados have been most harmful to population health and that flooding has been most damaging to the economy.
This analysis assumes the presence of the downloaded file in the working directory.
library(ggplot2)
library(plyr)
library(dplyr)
We first load in the data to R using the read.csv function.
dat <- read.csv("repdata-data-StormData.csv")
dim(dat)
## [1] 902297 37
For ease of readability, column names have been converted to lower case and _ replaced with .
names(dat) <- gsub("_", ".", tolower(names(dat)))
names(dat)[1:5]
## [1] "state.." "bgn.date" "bgn.time" "time.zone" "county"
To answer our question, we are primarily interested in the columns relating to the event type, fatalaties/injuries, and damages. Let’s retain those columns along with dates, state info, and remarks.
dat <- select(dat, bgn.date, state, evtype, fatalities, injuries,
propdmg, propdmgexp, cropdmg, cropdmgexp, remarks, refnum)
head(dat)
## bgn.date state evtype fatalities injuries propdmg propdmgexp
## 1 4/18/1950 0:00:00 AL TORNADO 0 15 25.0 K
## 2 4/18/1950 0:00:00 AL TORNADO 0 0 2.5 K
## 3 2/20/1951 0:00:00 AL TORNADO 0 2 25.0 K
## 4 6/8/1951 0:00:00 AL TORNADO 0 2 2.5 K
## 5 11/15/1951 0:00:00 AL TORNADO 0 2 2.5 K
## 6 11/15/1951 0:00:00 AL TORNADO 0 6 2.5 K
## cropdmg cropdmgexp remarks refnum
## 1 0 1
## 2 0 2
## 3 0 3
## 4 0 4
## 5 0 5
## 6 0 6
Let’s begin by looking at the different kinds of weather events.
len <- length(unique(dat$evtype))
It appears there are 985 unique weather events recorded in this dataset. This doesn’t quite add up, because the documentation provided by the NOAA indicates 48 different types of events, with some more specific categorization of major categories such as tornados, hurricanes, and floods. Let’s look at the most frequently occuring events.
tab <- sort(table(dat$evtype), decreasing = TRUE)
tab.percent = (tab / nrow(dat)) * 100
names(tab.percent[tab.percent >= 1])
## [1] "HAIL" "TSTM WIND" "THUNDERSTORM WIND"
## [4] "TORNADO" "FLASH FLOOD" "FLOOD"
## [7] "THUNDERSTORM WINDS" "HIGH WIND" "LIGHTNING"
## [10] "HEAVY SNOW" "HEAVY RAIN" "WINTER STORM"
It appears not all of these categeries are necessarily unique. For example, TSTM wind, thunderstorm wind, and thunderstorm winds are all really the same thing. Let’s address this when we get to the analysis and know which events we will be focusing on.
Now let’s look at the property and crop damage data. Property damaage is recorded with a number in the propdmg column and either a “K”, “M”, or “B” in the propdmgexp column, indicating whether the number represents thousands, millions, or billions respectively. The same is true of the crop damage. Let’s look at the most frequently occuring units to get a sense of the data, and to examine outliers.
prop.exp.tab <- sort(table(dat$propdmgexp), decreasing = TRUE)
prop.exp.tab[1:10]
##
## K M 0 B 5 1 2 ? m
## 465934 424665 11330 216 40 28 25 13 8 7
crop.exp.tab <- sort(table(dat$cropdmgexp), decreasing = TRUE)
crop.exp.tab[1:10]
##
## K M k 0 B ? 2 m <NA>
## 618413 281832 1994 21 19 9 7 1 1 NA
There do appear to be some various other characters, perhaps indicating other powers of 10. These are infrequent enough that they can be ignored, and just the data containing K, M, and B considered. (Along with their lower-case equivalents). Let’s create new columns that contain the property and crop damage as pure numerical values, and lastly, a column containing the total combined property and crop damages and a column containing total fatalities + injuries.
dat$prop <- rep(NA, times = nrow(dat)) # initialize columns with NAs
dat$crop <- rep(NA, times = nrow(dat))
dat$total.dmg <- rep(NA, times = nrow(dat))
dat$total.harm <- rep(NA, times = nrow(dat))
propk <- dat$propdmgexp == "K" | dat$propdmgexp == "k"
dat[propk, ]$prop <- dat[propk, ]$propdmg * 1000
propm <- dat$propdmgexp == "M" | dat$propdmgexp == "m"
dat[propm, ]$prop <- dat[propm, ]$propdmg * 1000000
propb <- dat$propdmgexp == "B" | dat$propdmgexp == "b"
dat[propb, ]$prop <- dat[propb, ]$propdmg * 1000000000
cropk <- dat$cropdmgexp == "K" | dat$cropdmgexp == "k"
dat[cropk, ]$crop <- dat[cropk, ]$cropdmg * 1000
cropm <- dat$cropdmgexp == "M" | dat$cropdmgexp == "m"
dat[cropm, ]$crop <- dat[cropm, ]$cropdmg * 1000000
cropb <- dat$cropdmgexp == "B" | dat$cropdmgexp == "b"
dat[cropb, ]$crop <- dat[cropb, ]$cropdmg * 1000000000
dat$total.dmg <- dat$crop + dat$prop
dat$total.harm <- dat$injuries + dat$fatalities
Let’s start by looking at what kinds of events contain the total number of injuries and fatalities.
total.harm <- tapply(dat$total.harm, dat$evtype, sum, na.rm = TRUE)
total.harm <- sort(total.harm, decreasing = TRUE)
total.harm[1:10]
## TORNADO EXCESSIVE HEAT TSTM WIND FLOOD
## 96979 8428 7461 7259
## LIGHTNING HEAT FLASH FLOOD ICE STORM
## 6046 3037 2755 2064
## THUNDERSTORM WIND WINTER STORM
## 1621 1527
It is clear that tornados account for the greatest number of fatalaties and injuries from 1950 to 2011. For the sake of further analysis, though, let’s modify our dataset to account for some of the duplicate categories. For the purposes of this analysis, “FLOOD,” “RIVER FLOOD,” and “FLASH FLOOD” are considered to be the same event, as are “HEAT” and “EXCESSIVE HEAT.”
tstm <- dat$evtype == "TSTM WIND" | dat$evtype == "THUNDERSTORM WINDS"
dat[tstm, ]$evtype <- "THUNDERSTORM WIND"
flash.flood <- dat$evtype == "FLASH FLOOD" | dat$evtype == "RIVER FLOOD"
dat[flash.flood, ]$evtype <- "FLOOD"
heat <- dat$evtype == "HEAT"
dat[heat, ]$evtype <- "EXCESSIVE HEAT"
hurricane <- grep("(HURRICANE )|HURRICANE", dat$evtype) # All hurricanes lumped together
dat[hurricane, ]$evtype <- "HURRICANE/TYPHOON"
dat$evtype <- as.factor(dat$evtype)
Now let’s reexamine which events have produced the greatest number of injuries and fatalities.
total.harm <- tapply(dat$total.harm, dat$evtype, sum, na.rm = TRUE)
total.harm <- sort(total.harm, decreasing = TRUE)
events <- names(total.harm)
harm <- as.numeric(total.harm)
harm.df <- data.frame(Event = factor(events, levels = events),
Harm = harm)
g <- ggplot(harm.df[1:5, ], aes(x = Event, y = Harm))
g + geom_bar(stat = "identity") +
ggtitle("Total Injuries and Fatalities from Hydrological Events (1950 to 2011)") +
ylab("Total Injuries and Fatalities") +
theme(text = element_text(size=10))
Lastly, let’s look at what events accounted for the most property and crop damage.
total.dmg <- tapply(dat$total.dmg, dat$evtype, sum, na.rm = TRUE)
total.dmg <- sort(total.dmg, decreasing = TRUE)
events <- names(total.dmg)
dmg <- as.numeric(total.dmg)
dmg.df <- data.frame(Event = factor(events, levels = events),
Damage = dmg)
g <- ggplot(dmg.df[1:5, ], aes(x = Event, y = Damage))
g + geom_bar(stat = "identity") +
ggtitle("Total Property and Crop Damage from Hydrological Events (1950 to 2011)") +
ylab("Total Property and Crop Damage") +
theme(text = element_text(size=10))