This paper examines which types of weather events are the most harmful, both with respect to population health, and to the economy. It finds that tornadoes cause the most injuries and deaths, but floods cause the most economic damage.
The data comes from the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which tracks the characteristics of major storms and weather events in the United States.
This assignment asks two questions:
- Across the United States, which types of events (as indicated in the EVTYPE
variable) are most harmful with respect to population health?
- Across the United States, which types of events have the greatest economic
consequences?
First, let’s download and unzip the data file.
# Get the data
require(utils)
## Set up a data folder
if(!file.exists("./data")) {
dir.create("./data")
}
## Download data file and unzip
if( !file.exists("./data/stormData.csv")) {
urlData <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile()
download.file( url = urlData
, destfile = temp
, method = "curl"
)
unzip(temp, "./data/")
unlink(temp)
dateDownloaded <- date()
}
Next, let’s open the data file.
# Open the data file
stormDataRaw <- read.csv(file="./data/repdata-data-StormData.csv.bz2")
dim(stormDataRaw)
## [1] 902297 37
There are 902,297 rows of data, which track 37 different variables.
The first question is:
- Across the United States, which types of events (as indicated in the EVTYPE
variable) are most harmful with respect to population health?
Let’s define “harm to popualation health” as either death or injury directly caused by the event. Let us weight them the same, so that one death is considered the same amount of harm as an injury. While it’s much worse to die, the economic impact of a prolonged injury can cause more economic damage, so it is hard to tell which has a greater impact on the economy. Consequently, we’ll weight them the same.
We will need information from three columns:
* EVTYPE - The type of event
* FATALITIES - Fatalities that were directly caused by the event.
* INJURIES - Injuries that were directly caused by the event.
To calculate the weather event with the greatest health impact, we will need to sum deaths, injuries, and the sum of both, for each event type. Then we’ll have a nice data set we can sort in descending order to find the worst health events.
# Load necessary packages
require(dplyr)
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following object is masked from 'package:stats':
##
## filter
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require(tidyr)
## Loading required package: tidyr
require(ggplot2)
## Loading required package: ggplot2
require(scales)
## Loading required package: scales
# Clean the data
q1Data <- stormDataRaw %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
summarize(Death = sum(FATALITIES)
, Injury = sum(INJURIES) ) %>%
mutate(HealthEffects = Death + Injury) %>%
arrange(desc(HealthEffects))
head(q1Data)
## Source: local data frame [6 x 4]
##
## EVTYPE Death Injury HealthEffects
## 1 TORNADO 5633 91346 96979
## 2 EXCESSIVE HEAT 1903 6525 8428
## 3 TSTM WIND 504 6957 7461
## 4 FLOOD 470 6789 7259
## 5 LIGHTNING 816 5230 6046
## 6 HEAT 937 2100 3037
The table above shows the most harmful health effects broken down by event type. Tornadoes are the worst.
The second question is:
- Across the United States, which types of events have the greatest economic
consequences?
Let’s define “economic consequences” as either property or crop damage, as per the columns “PROPDMG” and “CROPDMG”. (That is, we are excluding the cost to the economy of injuries and deaths, as we do not have recorded data about how much injuries and deaths have cost.)
The table records property and crop damage as decimal numbers, with
separate columns telling us the exponent (i.e., thousands, millions, or billions). It looks like this:
exponentExample <- stormDataRaw %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(!is.na(PROPDMGEXP)) %>%
filter(!is.na(CROPDMGEXP)) %>%
filter(CROPDMG > 0)
head(exponentExample)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 HURRICANE OPAL/HIGH WINDS 0.1 B 10 M
## 2 THUNDERSTORM WINDS 5.0 M 500 K
## 3 HURRICANE ERIN 25.0 M 1 M
## 4 HURRICANE OPAL 48.0 M 4 M
## 5 HURRICANE OPAL 20.0 m 10 m
## 6 THUNDERSTORM WINDS 50.0 K 50 K
To find the events with the greatest economic impact, first, let’s prepare a data set with just the columns we want:
q2Columns <- stormDataRaw %>%
select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(q2Columns)
## EVTYPE PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1 TORNADO 25.0 K 0
## 2 TORNADO 2.5 K 0
## 3 TORNADO 25.0 K 0
## 4 TORNADO 2.5 K 0
## 5 TORNADO 2.5 K 0
## 6 TORNADO 2.5 K 0
We need to find whole numbers for property damage and crop damage, by multiplying PROPDMG by the number represented in PROPDMGEXP. For example, if PROPDMGEXP has an M, we must multiply PROPDMG by 1,000,000. We’ll need to prepare a lookup table to do this.
Let’s look at all the unique values in either PROPDMGEXP or CROPDMGEXP to see what we have to translate.
unique(q2Columns$PROPDMGEXP)
## [1] K M B m + 0 5 6 ? 4 2 3 h 7 H - 1 8
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(q2Columns$CROPDMGEXP)
## [1] M K m B ? 0 k 2
## Levels: ? 0 2 B k K m M
Let’s prepare lookup tables for both Property damage and Crop damage, by substituting the unique characters listed above for exponents they represent. For instance, we’ll tell the computer to substitute an “M” for ten to the power of “6”, so we can multiply that exponent by the value in either PROPDMG or CROPDMG.
# Prepare Lookup table
letter <- c("-", "?", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8"
, "B", "h", "H", "k", "K", "m", "M", "")
multiplier <- c(0
, 0
, 0
, 0
, 1
, 2
, 3
, 4
, 5
, 6
, 7
, 8
, 9
, 2
, 2
, 3
, 3
, 6
, 6
, 0)
lookup <- as.data.frame(cbind(letter, multiplier))
# Format the lookup table to character and numeric, rather than factors
lookup$letter <- as.character(lookup$letter)
lookup$multiplier <- as.numeric(lookup$multiplier)
#Format the original table
q2Columns$PROPDMGEXP <- as.character(q2Columns$PROPDMGEXP, na.rm=TRUE)
q2Columns$CROPDMGEXP <- as.character(q2Columns$CROPDMGEXP, na.rm=TRUE)
# Replace the letters in PROPDMGEXP and CROPDMGEXP with the numbers in our
# lookup table
q2ColumnsA <- q2Columns %>%
left_join(x = q2Columns
, y = lookup
, by = c("PROPDMGEXP" = "letter")
, copy=FALSE) %>%
rename(PropertyExp = multiplier)
head(q2ColumnsA)
## PROPDMGEXP EVTYPE PROPDMG CROPDMG CROPDMGEXP PropertyExp
## 1 K TORNADO 25.0 0 4
## 2 K TORNADO 2.5 0 4
## 3 K TORNADO 25.0 0 4
## 4 K TORNADO 2.5 0 4
## 5 K TORNADO 2.5 0 4
## 6 K TORNADO 2.5 0 4
q2ColumnsB <- q2ColumnsA %>%
left_join(x = q2ColumnsA
, y = lookup
, by = c("CROPDMGEXP" = "letter")
, copy = FALSE
, na.rm= TRUE) %>%
rename(CropExp = multiplier)
q2ColumnsC <- q2ColumnsB %>%
mutate(TotalProperty = PROPDMG * 10 ^ PropertyExp
, TotalCrop = CROPDMG * 10 ^ CropExp
, TotalCost = TotalProperty + TotalCrop) %>%
select(EVTYPE, PROPDMG, PropertyExp, TotalProperty, CROPDMG, CropExp, TotalCrop, TotalCost) %>%
group_by(EVTYPE) %>%
summarize(cost = sum(TotalCost)) %>%
arrange(desc(cost))
head(q2ColumnsC)
## Source: local data frame [6 x 2]
##
## EVTYPE cost
## 1 FLOOD 1.503197e+12
## 2 HURRICANE/TYPHOON 7.191371e+11
## 3 TORNADO 5.736233e+11
## 4 STORM SURGE 4.332354e+11
## 5 HAIL 1.876122e+11
## 6 FLASH FLOOD 1.824399e+11
After our transformation, the table shows that the most costly storm events are flood, hurricane, tornado, storm surge, and hail. Floods cause roughly 3 times the economic damage as tornadoes.
Tornadoes have the biggest effect on human life, as can be seen by this chart:
# Prepare the data to be in a stacked bar chart.
q1DataChart <- q1Data %>%
gather(key="DeathOrInjury", value="Count", -EVTYPE, -HealthEffects) %>%
filter(Count > 0) %>%
arrange(desc(HealthEffects)) %>%
select(EVTYPE, DeathOrInjury, Count, HealthEffects)
# Order the result from largest to smallest
q1Data$EVTYPE <- as.character(q1Data$EVTYPE)
topFiveEffects <- as.character(q1Data$EVTYPE[1:5])
q1DataChart$EVTYPE <- factor(q1DataChart$EVTYPE
, levels = topFiveEffects
)
# Format the Events in comma form
q1DataChart$Count <- comma(q1DataChart$Count)
# Plot in a stacked bar chart
q1Plot <- ggplot(q1DataChart
, aes(x=q1DataChart$EVTYPE
, y=comma(q1DataChart$Count)
, fill=q1DataChart$DeathOrInjury
)
)
g <- testPlot +
geom_bar(stat = "identity") +
xlab("Events") +
ylab("Number of People Directly Affected") +
ggtitle("Top 5 Most Dangerous Storm Events") +
labs(fill="") +
theme_bw() +
scale_fill_manual(values=c("red", "black"), limits=c("Injury", "Death")) +
scale_y_continuous(labels=comma)
g
We can see that tornadoes cause more than ten times the number of deaths as the next most harmful event, which is excessive heat.
Floods have the greatest economic impact, as can be seen by this chart.
q2ColumnsD <- q2ColumnsC[q2ColumnsC$cost == q2ColumnsC$cost[1:5], ]
# Order the result from largest to smallest
q2ColumnsD$EVTYPE <- as.character(q2ColumnsD$EVTYPE)
topFiveEffectsEconomic <- as.character(q2ColumnsD$EVTYPE[1:5])
q2ColumnsD$EVTYPE <- factor(q2ColumnsD$EVTYPE
, levels = topFiveEffectsEconomic
)
q2Data <- ggplot(q2ColumnsD
, aes(x=q2ColumnsD$EVTYPE
, y=q2ColumnsD$cost
)
)
q2Plot <- q2Data +
geom_bar(stat = "identity") +
xlab("Events") +
ylab("Economic Impact ($)") +
ggtitle("Top 5 Most Costly Storm Events") +
labs(fill="") +
theme_bw()
q2Plot
Floods cause roughly three times the economic damage as tornadoes.