Looking at the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which recorded storm data from 1950 to November of 2011, the author is interested in answering questions as to what are the effects of these recorded events in terms of population safety and the economy by looking at the casualties incurred and the amount of property or crop damage respectively.
By virtue of literate statistical programming, the author has compelled to include processes important to the pipelined approach of answering questions with data from data processing, data visualization and analysis and result, all of which are outlined in their respective sections below.
Result shows that TORNADO, EXCESSIVE HEAT and TSTM WIND are the three dominant EVENT TYPES that caused harm to US population while DROUGHT, FLOOD, HURRICANE/TYPHOON and TORNADO are four EVENT TYPES that largely affected the economy in terms of the amount of crops and properties destroyed.
In this section, the author preprocess the data in such a way that it would be ideal in answering the following questions:
Across the United States, which types of events, are the most harmful with respect to population health. This can be answered by looking at the fatalities and injuries per event type.
Across the United States, which types of events have the greatest economic consequences? We can answer this by looking at the following variables:
PROPDMG - for property damage base value
PROPDMGEXP - Property damage base value modifier
CROPDMG - Crop damage base value
CROPDMGEXP - Crop damage base value modifier
The rules on how to resolve the real value costs can be found here: https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.
Generally, the dataset must be wrangled into two parts - (1) effects of storm events on population health and (2) effects of storm events on the economy. All of the data manipulations are done using dplyr techniques.
# load necessary packages
library(dplyr)
library(ggplot2)
library(gtable)
library(grid)
library(gridExtra)
# load dataset
storm.data.raw <- read.csv("./repdata%2Fdata%2FStormData.csv.bz2")
# reduce dataset columns to useful ones
columns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP",
"CROPDMG", "CROPDMGEXP", "REFNUM")
storm.data.reduced <- storm.data.raw[, columns]
# sample dataset to minimize for testing
storm.data.sample <- sample_n(storm.data.reduced, 300)
# Add all recorded fatalities and injuries by event type
health.outcomes <- storm.data.reduced %>%
group_by(EVTYPE) %>%
summarize(FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES))
# THE FOLLOWING BLOCK DOES THE FOLLOWING:
# * Separate health.outcomes by fatalities and injuries
# * Remove event types with 0 total
# * Arrange event types from highest to lowest
# 1. Fatalities
fatalities <- arrange(health.outcomes[, c("EVTYPE", "FATALITIES")],
desc(FATALITIES))
fatalities <- fatalities[fatalities$FATALITIES != 0, ]
# 2. Injuries
injuries <- arrange(health.outcomes[, c("EVTYPE", "INJURIES")],
desc(INJURIES))
injuries <- injuries[injuries$INJURIES != 0, ]
# Visualize the first few rows of fatalities and injuries
# 1. Fatalities
head(fatalities)
## # A tibble: 6 x 2
## EVTYPE FATALITIES
## <fct> <dbl>
## 1 TORNADO 5633
## 2 EXCESSIVE HEAT 1903
## 3 FLASH FLOOD 978
## 4 HEAT 937
## 5 LIGHTNING 816
## 6 TSTM WIND 504
# 2. Injuries
head(injuries)
## # A tibble: 6 x 2
## EVTYPE INJURIES
## <fct> <dbl>
## 1 TORNADO 91346
## 2 TSTM WIND 6957
## 3 FLOOD 6789
## 4 EXCESSIVE HEAT 6525
## 5 LIGHTNING 5230
## 6 HEAT 2100
resolve.real.value <- function (value = 0, modifier = "") {
# make the modifier lowercase
modifier <- tolower(modifier)
# This is a not-so-pretty chain of ifelse but this is the only solution that
# is accepted by any dplyr verbs.
# What I did here was the following:
# if the modifier is a number
# then by our reference on how to handle 'EXP columns, we multiply the
# values by 10
# else ( meaning that the modifier is not a number)
# exponentiate the value by the following rules:
# h = 2
# k = 3
# m = 6
# b = 9
# + = 1
# - = 0
# ? = 0
# empty = 0
ifelse (! is.na(suppressWarnings(as.numeric(modifier))),
value * 10,
ifelse(modifier == "h", value * (10 ^ 2),
ifelse(modifier == "k", value * (10 ^ 3),
ifelse(modifier == "m", value * (10 ^ 6),
ifelse(modifier == "b", value * (10 ^ 9),
ifelse(modifier == "+", value * 1, 0))))))
}
# Add calculated columns(CROP.DAMAGE and PROPERTY.DAMAGE) in the reduced dataset
# by computing the real value amount resolving base value and modifier (see the
# function above)
storm.data.reduced <- storm.data.reduced %>%
mutate(CROP.DAMAGE.COST = resolve.real.value(value = CROPDMG,
modifier = CROPDMGEXP),
PROPERTY.DAMAGE.COST = resolve.real.value(value = PROPDMG,
modifier = PROPDMGEXP))
# Generate boxplot for CROP and PROPERTY
p1 <- ggplot(data = storm.data.reduced,
aes(x = "CROP", y = CROP.DAMAGE.COST)) +
geom_boxplot() +
labs(title = "Crop Damage Boxplot")
p2 <- ggplot(data = storm.data.reduced,
aes(x = "PROPERTY", y = PROPERTY.DAMAGE.COST)) +
geom_boxplot() +
labs(title = "Property Damage Boxplot")
grid.arrange(p1, p2, ncol = 2)
# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue",
fill = NA))
As you can see from the boxplot, there are values that are way off the system of values in both property and crop damages. Printing these rows, we have:
# print suspected outliers for crop damage cost
storm.data.reduced[storm.data.reduced$CROP.DAMAGE.COST > 4e+09, ]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 198389 RIVER FLOOD 0 0 5 B 5
## 211900 ICE STORM 0 0 500 K 5
## CROPDMGEXP REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 198389 B 198375 5e+09 5e+09
## 211900 B 211887 5e+09 5e+05
# print suspected outliers for property damage cost
storm.data.reduced[storm.data.reduced$PROPERTY.DAMAGE.COST > 9e+10, ]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 FLOOD 0 0 115 B 32.5 M
## REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 605953 605943 32500000 1.15e+11
Some of the values are quite extreme thus, the author investigated if the remarks associated with each suspected outlier match the reported values. Looking at the raw records of these suspected outlier observations are too long that the author did not show result. But we can look at the raw data by running the following lines of code.
# FOR CROP DAMAGE COST
storm.data.raw[storm.data.raw$REFNUM == 198375, ]
storm.data.raw[storm.data.raw$REFNUM == 211887, ]
# FOR PROPERTY DAMAGE COST
storm.data.raw[storm.data.raw$REFNUM == 605953, ]
The first result of the crop damage cost outlier analysis points to the Great Flood of 1993 which by looking at the REMARKS variable and the supplemental Wikipedia page (https://en.wikipedia.org/wiki/Great_Flood_of_1993) is one of the most devastating events with billions of total crop and property damage combined. So the values are justifiable. In the other hand, the last result for this category points to an event in 1994 called the Delta Ice Storm (a supplemental resource can be found here: https://www.weather.gov/jan/1994_deltaicestorm) where crop and property damage are also justified. Therefore, no modifications are necessary for these two observation.
Let us now look at the second category with one suspected outlier. Unfortunately, the reported 115 Billion property damage during the extreme flooding in Downtown Napa does not match with the records where property damage only amounts to 70 million. Therefore, a modification is needed to rectify this mismatch.
# fix the observation with wrong reported property damage
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPDMG"] <- 70
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPDMGEXP"] <- "m"
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPERTY.DAMAGE.COST"] <- 70000000
storm.data.reduced[storm.data.reduced$REFNUM == 605943, ]
## EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 FLOOD 0 0 70 m 32.5 M
## REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 605953 605943 32500000 7e+07
Now that we have an acceptable analytic data for answering economic questions, we now summarize our data by aggregating event type and computing total damage cost per event type. Like so:
# Add all Crop Damage and Property Damage costs by event type
economic.consequences <- storm.data.reduced %>%
group_by(EVTYPE) %>%
summarize(CROP.TOTAL.DAMAGE = sum(CROP.DAMAGE.COST),
PROPERTY.TOTAL.DAMAGE = sum(PROPERTY.DAMAGE.COST))
# THE FOLLOWING BLOCK DOES THE FOLLOWING:
# * Separate economic consequences by crop and property
# * Remove event types with 0 total damage cost
# * Arrange event types from highest to lowest total damage
# 1. CROPS
economic.consequences.crop <- transmute(economic.consequences, EVTYPE,
CROP.TOTAL.DAMAGE)
economic.consequences.crop <-
economic.consequences.crop[economic.consequences.crop$CROP.TOTAL.DAMAGE != 0,]
economic.consequences.crop <- economic.consequences.crop %>%
arrange(desc(CROP.TOTAL.DAMAGE))
# 2. PROPERTY
economic.consequences.property <- transmute(economic.consequences, EVTYPE,
PROPERTY.TOTAL.DAMAGE)
economic.consequences.property <-
economic.consequences.property[economic.consequences.property$
PROPERTY.TOTAL.DAMAGE != 0, ]
economic.consequences.property <- economic.consequences.property %>%
arrange(desc(PROPERTY.TOTAL.DAMAGE))
In answering the questions as to which types of storm events have caused population hazards and economic loss, the author presents several charts that are divided into categories. For population hazards, we have fatalities and injuries and for the economic loss, it is interesting to know total damage in terms of crops and properties per event type. We generate Pareto charts on the Top 20 storm event types on these four categories and provide analysis for each. The function that returns the Pareto Chart is as follows:
# the following function generates panel plots (two Pareto charts, one per each category)
generate.plot <- function (data, title, xlabel, ylabel, num.of.categories) {
# convert the argument data to a dataframe
data <- data.frame(data)
# extract the column names of the data to generalize reference of columns
column.names <- colnames(data)
# add additional columns for processing
# 1. CUMULATIVE SUM - either cumulative sum of fatalities or injuries OR property or crop total damage
# 2. PROPORTION - compute percentage per EVENT TYPE
# 3. CUMULATIVE.PROPORTION - same as 1 but for PROPORTION
data$CUMULATIVE.SUM <- cumsum(data[, column.names[2]])
data[, column.names[1]] <- factor(data[, column.names[1]],
levels = data[, column.names[1]])
data$PROPORTION <- data[, column.names[2]] / sum(data[, column.names[2]]) * 100
data$CUMULATIVE.PROPORTION <- cumsum(data$PROPORTION)
# We generate two plots, one for sum and one for proportion to show the tick marks of sum and
# proportion in one plot. There is no straightforward way to do it in dplyr as far as the
# author knows but this is does the job
# generate plot for sum
# # Compute how much many percent does the top categories incurred
total.percent <- data[num.of.categories, "CUMULATIVE.PROPORTION"]
tp.subtitle <- paste(round(total.percent), "% OF TOTAL", column.names[2],
" ARE COVERED BY THE EVENT TYPES SHOWN")
p1 <- data[1:num.of.categories,] %>%
ggplot(aes(x = EVTYPE)) +
geom_bar(aes_string(y = column.names[2]), fill = "blue", stat = "identity") +
geom_point(aes(y = CUMULATIVE.SUM, group = 1),
color = "darkblue",
pch=16,
size=2) +
geom_line(aes(y = CUMULATIVE.SUM, group = 1),
colour = "slateblue1",
size = 1) +
theme(axis.text.x = element_text(angle=90, vjust = 0.6)) +
labs(title = title,
subtitle = tp.subtitle,
x = xlabel,
y = ylabel)
# generate plot for proportion
p2 <- data[1:num.of.categories,] %>%
ggplot(aes(x = EVTYPE)) +
geom_bar(aes(y = PROPORTION), fill = "blue", stat = "identity") +
geom_point(aes(y = CUMULATIVE.PROPORTION, group = 1),
color = "darkblue",
pch=16,
size=2) +
geom_line(aes(y = CUMULATIVE.PROPORTION, group = 1),
colour = "slateblue1",
size = 1) +
scale_y_continuous(labels=function(x) paste0(" ", x,"%")) +
theme(panel.background = element_rect(fill = NA))
# The following block puts two plots on top of each other tweaking the location of the
# percent tick marks in the process to move it to the right of the plot and return the result
# of overlapping as one plot.
# extract gtable
g1 <- ggplot_gtable(ggplot_build(p1))
g2 <- ggplot_gtable(ggplot_build(p2))
# overlap the panel of 2nd plot on that of 1st plot
pp <- c(subset(g1$layout, name == "panel", se = t:r))
g <- gtable_add_grob(g1, g2$grobs[[which(g2$layout$name == "panel")]], pp$t,
pp$l, pp$b, pp$l)
# axis tweaks
ia <- which(g2$layout$name == "axis-l")
ga <- g2$grobs[[ia]]
ax <- ga$children[[2]]
ax$widths <- rev(ax$widths)
ax$grobs <- rev(ax$grobs)
ax$grobs[[1]]$x <- ax$grobs[[1]]$x - unit(1, "npc") + unit(0.15, "cm")
g <- gtable_add_cols(g, g2$widths[g2$layout[ia, ]$l], length(g$widths) - 1)
g <- gtable_add_grob(g, ax, pp$t, length(g$widths) - 1, pp$b)
# return the generated plot
g
}
# Generate Pareto Chart for Fatalities Category
plot1 <- generate.plot(fatalities,
title = "FIG 1A. TOP 20 EVENTS AND THEIR EFFECT ON POPULATION HEALTH - FATALITIES",
xlabel = "EVENT TYPES",
ylabel = "CASUALTIES",
num.of.categories = 20)
# Injuries
plot2 <- generate.plot(injuries,
title = "FIG 1B. TOP 20 EVENTS AND THEIR EFFECT ON POPULATION HEALTH - INJURIES",
xlabel = "EVENT TYPES",
ylabel = "CASUALTIES",
num.of.categories = 20)
grid.arrange(plot1, plot2)
# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue",
fill = NA))
Looking at Figure 1A, we see that TORNADO and EXCESSIVE.HEAT are the two main cause of fatalities with roughly 37% and 12% of the total death casualties respectively. All of the EVENT TYPES shown in this figure encompass 89% of the total death casualties.
Looking at Figure 1B, we see that TORNADO and TSTM WIND are the two main cause of injuries. What’s interesting here is that roughly 70% of the total injured casualties are caused by tornadoes (100000 if we roughly translate from the graph). All of the EVENT TYPES shown in this figure amounts to 96% of the total damage reported by this dataset from 1950 to 2011.
# Effects of Storms on Crops
plot1 <- generate.plot(economic.consequences.crop,
title = "FIG 2A. TOP 20 EVENTS AND THEIR EFFECT ON THE ECONOMY - CROPS",
xlabel = "EVENT TYPES",
ylabel = "COST ($)",
num.of.categories = 20)
# Effects of Storm Events on Properties
plot2 <- generate.plot(economic.consequences.property,
title = "FIGURE 2B. TOP 20 EVENTS AND THEIR EFFECT ON THE ECONOMY - PROPERTIES",
xlabel = "EVENT TYPES",
ylabel = "COST ($)",
num.of.categories = 20)
grid.arrange(plot1, plot2)
# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue",
fill = NA))
By observing the first Pareto Chart on the above plot (FIG 2B), we can see that roughly 10.3 Billion dollars in crop damages are caused by DROUGHTS which accounted for roughly 23% of the total crop damage between 1950 and November 2011. Additionally, FLOOD damage caused about 5 Billion dollars in damages or roughly 15 percent of the total crop damages.
In the other hand, property damage were largely caused by HURRICANE/TYPHOON with reported cost of 50 Billion dollars in total damages from 1950 to 2011. Close on the second is TORNADO with 50 Billion. These two event types accounted for 20% and 15% of the total property damage respectively from 1950 to 2011.
In dollars, event types reported by both figures encompass 96% of all the damages reported by this dataset from 1950 to 2011.
Result shows that TORNADO, EXCESSIVE HEAT and TSTM WIND were the three dominant EVENT TYPES that caused harm to US population while DROUGHT, FLOOD, HURRICANE/TYPHOON and TORNADO were four EVENT TYPES that largely affected the economy in terms of the amount of crops and properties destroyed.
With this type of analysis, the author observed that looking at the long term effects of storm in US Population Health and Economy is an evocation of serious problems caused by these natural disasters. Disaster preparedness with respect to these types of natural phenomenon as well as robust state laws are necessary to lessen if not mitigate their serious impact to US economy and US citizens.