In this analysis we aim to describe what type of events are most harmful with respect to population health and to economy across the United States. To investigate this we explored the NOAA Storm Database. From this data, we found that tornadoes are the most harmful event to population health, both in terms of injuries and fatalities. With respect to economic consequencies, floods ranks first.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. It can be downloaded from: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
There is also some documentation of the database available:
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Firts, lets download the required file:
file_name <- "repdata_data_StormData.csv.bz2"
file_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists(file_name)) {
download.file(file_url, destfile = file_name)
}
We have obtained a file in “bz2” format. According to instructions, it’s a compressed csv file where:
With that information, we can load the csv file into a “df_storm_data” data frame. According to R documentation, read.table function can read inside a bz2 file so we don’t decompress it. We choose to keep the headers and read strings as factors.
Note: it’s a big file and reading it takes quite a few time. We can’t afford to read it every time we process the document. On the other hand, the file is not going to change so we can cache it.
df_storm_data <- read.table(file = file_name,
header = TRUE,
sep = ",",
dec = ".",
na.strings = "",
stringsAsFactors = TRUE)
Let’s check the dimension and field names:
dim(df_storm_data)
## [1] 902297 37
names(df_storm_data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
Using dplyr package we are going the following:
if (!require("dplyr")) {
install.packages("dplyr")
require("dplyr")
}
df_mydata <- df_storm_data %>% select(EVTYPE,
INJURIES, FATALITIES,
PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP) %>%
mutate(PROPERTIES_DAMAGE =
ifelse(
PROPDMGEXP %in% c("h","H"),
PROPDMG * 10^2,
ifelse(
PROPDMGEXP %in% c("k","K"),
PROPDMG * 10^3,
ifelse(
PROPDMGEXP %in% c("m","M"),
PROPDMG * 10^6,
ifelse(
PROPDMGEXP %in% c("b","B"),
PROPDMG * 10^9,
0
))))) %>%
mutate(CROPS_DAMAGE =
ifelse(
CROPDMGEXP %in% c("h","H"),
CROPDMG * 10^2,
ifelse(
CROPDMGEXP %in% c("k","K"),
CROPDMG * 10^3,
ifelse(
CROPDMGEXP %in% c("m","M"),
CROPDMG * 10^6,
ifelse(
CROPDMGEXP %in% c("b","B"),
CROPDMG * 10^9,
0
))))) %>%
mutate(EVENT_TYPE = toupper(EVTYPE)) %>%
select(EVENT_TYPE, PROPERTIES_DAMAGE, CROPS_DAMAGE,
INJURIES, FATALITIES)
We’ll separate the harm type in injuries and fatalities.
To address the injury risk, we’ll make a “df_injuries” data frame as follows:
df_injuries <- df_mydata %>%
select(EVENT_TYPE, INJURIES) %>%
group_by(EVENT_TYPE) %>%
summarise(INJURIES = sum(INJURIES)) %>%
arrange(-INJURIES)
Let’s plot it.
ggplot2 graphic package will be needed:
if (!require("ggplot2")) {
install.packages("ggplot2")
require("ggplot2")
}
ggplot(data = df_injuries[1:10,],
aes(x = reorder(EVENT_TYPE, -INJURIES),
y = INJURIES)) +
geom_bar(stat = "identity", fill = "green") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Top 10 injury causing events",
x = "Event type",
y = "Number of injuries")
Tornadoes are clearly the main injury causing event, followed by tstm winds, floods and excesive heat.
To address the death risk, we’ll make a “df_fatalities” data frame as follows:
df_fatalities <- df_mydata %>%
select(EVENT_TYPE, FATALITIES) %>%
group_by(EVENT_TYPE) %>%
summarise(FATALITIES = sum(FATALITIES)) %>%
arrange(-FATALITIES)
Let’s plot it.
ggplot(data = df_fatalities[1:10,],
aes(x = reorder(EVENT_TYPE, -FATALITIES),
y = FATALITIES)) +
geom_bar(stat = "identity", fill = "green") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Top 10 death causing events",
x = "Event type",
y = "Number of fatalities")
Tornadoes are clearly the main death causing event, followed by excesive heat, flash floods and heat.
To address the risks on economy, we’ll make a “df_economics” data frame as follows:
Package “reshape2” will be needed for melting:
if (!require("reshape2")) {
install.packages("reshape2")
require("reshape2")
}
df_economics <- df_mydata %>%
select(EVENT_TYPE, PROPERTIES_DAMAGE, CROPS_DAMAGE) %>%
group_by(EVENT_TYPE) %>%
summarise(PROPERTIES_DAMAGE = sum(PROPERTIES_DAMAGE),
CROPS_DAMAGE = sum(CROPS_DAMAGE)) %>%
mutate(TOTAL_DAMAGE = PROPERTIES_DAMAGE + CROPS_DAMAGE)
df_economics <- melt(df_economics,
id = c("EVENT_TYPE", "TOTAL_DAMAGE"),
measure.vars = c("PROPERTIES_DAMAGE","CROPS_DAMAGE"))
df_economics <- df_economics %>% arrange(-TOTAL_DAMAGE)
Let’s plot it.
ggplot(data = df_economics[1:20,],
aes(x = reorder(EVENT_TYPE, -value),
y = value,
fill = variable)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle=45, hjust=1)) +
labs(title = "Top 10 property and crop damaging events",
x = "Event type",
y = "Total damage in dollars")
Floods have clearly the greatest impact on economy, followed by hurricanes(typhons), tornadoes and storm surges. In most cases, the impact comes from damages on properties more than crops.