US Storm Data is analysed to determine the most harmful events for 1) population health and 2) economic damage. It is found that tornado events result in most fatalities as well as injuries, even when all high-temperature related events are considered as one category. It is found that flood events cause the largest amount of economical damage. Both findings appear reasonable because i) tornados are very fast and can result in debries becoming as dangerous as bullets and ii) floods can inflict large amounts of damage because they affect large areas (but are maybe better evaded and so lead to less fatalities).
Because the two research questions focus on different aspects of the dataset and to be more efficient, the dataset is subsetted such that it contains only these variables which are relevant for the specific research question.
The respective variables are written to disc (in rds format) and then reloaded, no external data processing is done.
To note, the data processing is inefficient at first run but is efficient on subsequent runs.
Processing for Research Question 1
For research question 1, the following variables are considered relevant and extracted to a separate, reduced dataset dred1 (subsequently then loaded again if available as d1):
STATEEVTYPEFATALITIESINJURIES# Initial load and dataset reduction.
if(!file.exists("./repdata_data_StormData.csv.bz2"))
{
temp <- "./repdata_data_StormData.csv.bz2"
download.file(
"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
temp, mode="wb"
)
}
if(!file.exists("./dred1.rds"))
{
d <- read.csv("repdata_data_StormData.csv.bz2", header=T)
dred1 <- d[, c("STATE", "EVTYPE", "FATALITIES", "INJURIES")]
saveRDS(dred1, file="./dred1.rds")
rm(ls="d")
}
d1 <- readRDS("./dred1.rds")
This transformation is justified because it drastically increases dataset handling efficiency and reduces memory requirements of the computer.
As a second transformation, all EVTYPE factor labels are converted to upper case, this gives a reduction of length(levels(d1$EVTYPE)) - length(levels(factor(toupper(d1$EVTYPE)))) = 87 factors which only differ in their factor level labels cases.
d1$EVTYPE <- factor(toupper(d1$EVTYPE))
str(d1)
## 'data.frame': 902297 obs. of 4 variables:
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 898 levels " HIGH SURF ADVISORY",..: 758 758 758 758 758 758 758 758 758 758 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
Processing for Research Question 2
For research question 2, the following variables are considered relevant and extracted to a separate, reduced dataset dred2 (subsequently then loaded again if available as d2):
STATEPROPDMGPROPDMGEXP# Initial load and dataset reduction.
if(!file.exists("./repdata_data_StormData.csv.bz2"))
{
temp <- "./repdata_data_StormData.csv.bz2"
download.file(
"http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
temp, mode="wb"
)
}
if(!file.exists("./dred2.rds"))
{
d <- read.csv("repdata_data_StormData.csv.bz2", header=T)
dred2 <- d[, c("STATE", "EVTYPE", "PROPDMG", "PROPDMGEXP")]
saveRDS(dred2, file="./dred2.rds")
rm(ls="d")
}
d2 <- readRDS("dred2.rds")
The inspection of the d2 dataset reveils that the PROPDMGEXP variable characterizes the magnitude of the PROPDMG variable, but in a non-tidy way.
Therefore the following rules are applied:
PROPDMGEXP is '-/+/0/?' it is defined as 0PROPDMGEXP 'h/H' are defined as 2, 'K' is defined as 3, 'm/M' are defined as 6, 'B' is defined as 9# Normalize ambiguous factor levels.
# Normalization of factors appears easier when the factors are first converted
# to characters followed by converstion to a numerical type for subsequent
# use as exponents.
d2$PROPDMGEXP <- as.character(d2$PROPDMGEXP)
d2[which(d2$PROPDMGEXP==""), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="-"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="?"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="+"), ]$PROPDMGEXP <- "0"
d2[which(d2$PROPDMGEXP=="h"), ]$PROPDMGEXP <- "2"
d2[which(d2$PROPDMGEXP=="H"), ]$PROPDMGEXP <- "2"
d2[which(d2$PROPDMGEXP=="K"), ]$PROPDMGEXP <- "3"
d2[which(d2$PROPDMGEXP=="m"), ]$PROPDMGEXP <- "6"
d2[which(d2$PROPDMGEXP=="M"), ]$PROPDMGEXP <- "6"
d2[which(d2$PROPDMGEXP=="B"), ]$PROPDMGEXP <- "9"
d2$PROPDMGEXP <- as.numeric(d2$PROPDMGEXP)
Conversion to a data.table object allows to easily create a new variable representing the total damage:
library(data.table)
d2 <- as.data.table(d2)
d2[, dmg:=d2$PROPDMG*10^d2$PROPDMGEXP]
## STATE EVTYPE PROPDMG PROPDMGEXP dmg
## 1: AL TORNADO 25.0 3 25000
## 2: AL TORNADO 2.5 3 2500
## 3: AL TORNADO 25.0 3 25000
## 4: AL TORNADO 2.5 3 2500
## 5: AL TORNADO 2.5 3 2500
## ---
## 902293: WY HIGH WIND 0.0 3 0
## 902294: MT HIGH WIND 0.0 3 0
## 902295: AK HIGH WIND 0.0 3 0
## 902296: AK BLIZZARD 0.0 3 0
## 902297: AL HEAVY SNOW 0.0 3 0
This dataset is then used in the analysis (of question 2) in the next section.
Question 1: What are the most harmful event types with respect to population health?
Population health indicators are considered to be
d1$FATALITIESd1$INJURIESIt appears as if temperature and tornados have highest fatality rates:
head(d1[order(d1$FAT, decreasing=T), ], 20)
## STATE EVTYPE FATALITIES INJURIES
## 198704 IL HEAT 583 0
## 862634 MO TORNADO 158 1150
## 68670 MI TORNADO 116 785
## 148852 TX TORNADO 114 597
## 355128 IL EXCESSIVE HEAT 99 0
## 67884 MA TORNADO 90 1228
## 46309 KS TORNADO 75 270
## 371112 PA EXCESSIVE HEAT 74 135
## 230927 PA EXCESSIVE HEAT 67 0
## 78567 MS TORNADO 57 504
## 247938 WI EXTREME HEAT 57 0
## 6370 AR TORNADO 50 325
## 598500 TX EXCESSIVE HEAT 49 0
## 606363 CA EXCESSIVE HEAT 46 18
## 860386 AL TORNADO 44 800
## 157885 TX TORNADO 42 1700
## 362850 MO EXCESSIVE HEAT 42 397
## 629242 NY EXCESSIVE HEAT 42 0
## 78123 MS TORNADO 38 270
## 83578 MO TORNADO 37 176
In addition to temperature and tornados, it seems as if hurricanes, floods and rain are also responsible for a number of fatalities (these will be addressed below).
However, what if there are far more tornados than heat events so that the sum of tornado related fatalities would exceed heat related fatalities?
This will be further inspected below.
In order to analyse this, all event types related to temperature and tornados are combined into a generalized HEAT/TORNADO category and then the number of related fatalities and are counted.
In order to normalize the categories, all levels relating either the expressions 'WARM', 'HEAT' or 'TORNADO' are collected:
# Unlike e.g. Bash, 'grep' in R returns indeces, not matched strings.
heat_related_levels <- grep("HEAT|WARM", x=levels(d1$EV))
heat_related_events <- d1$EV %in% levels(d1$EV)[heat_related_levels]
d1[heat_related_events, ]$EVTYPE <- factor("HEAT")
# 'TORNADO' related:
wind_related_levels <- grep("STORM|WIND", x=levels(d1$EV))
wind_related_events <- d1$EV %in% levels(d1$EV)[wind_related_levels]
d1[wind_related_events, ]$EVTYPE <- factor("WIND")
For simplicity, a new data frame is created and then analysed using a barplot:
# (Rotate x-axis labels and adjust margins.)
par(las=2); par(mar=c(8, 4, 2, 1))
k <- as.data.frame(tapply(d1$FATALITIES, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "FATALITIES"
barplot(k[order(k$FATALITIES, decreasing=T), ][1:10],
main="10 Most Harmful Event Types [FATALITIES]", ylab="Fatalities")
So indeed, as was guessed above, even if the most fatal event is a heat-type event, the total number of wind related fatalities is larger.
Injuries can be analysed analogously.
par(las=2); par(mar=c(8, 4, 2, 1))
k <- as.data.frame(tapply(d1$INJURIES, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "INJURIES"
barplot(k[order(k$INJURIES, decreasing=T), ][1:10],
main="10 Most Harmful Event Types [INJURIES]", ylab="Injuries")
To summarize, both with respect to fatalities and injuries, tornados are the most harmful events.
Question 2: Which events result in highest economic damage?
The events causing the largest economic damage are identified in the following figure.
par(las=2); par(mar=c(8, 6, 2, 1))
k <- as.data.frame(tapply(d2$dmg, INDEX=d1$EVTYPE, FUN=sum))
colnames(k) <- "DAMAGE"
barplot(k[order(k$DAMAGE, decreasing=T), ][1:10],
main="Top 10 Most Damaging Event Types", ylab="Damage")
The most damaging events are thus floods.
.bz2 file extractionSTORM/HURRICANE, FLOOD/WATER and possibly others