Synopsis

Looking at the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which recorded storm data from 1950 to November of 2011, the author is interested in answering questions as to what are the effects of these recorded events in terms of population safety and the economy by looking at the casualties incurred and the amount of property or crop damage respectively.

By virtue of literate statistical programming, the author has compelled to include processes important to the pipelined approach of answering questions with data from data processing, data visualization and analysis and result, all of which are outlined in their respective sections below.

Result shows that TORNADO, EXCESSIVE HEAT and TSTM WIND are the three dominant EVENT TYPES that caused harm to US population while DROUGHT, FLOOD, HURRICANE/TYPHOON and TORNADO are four EVENT TYPES that largely affected the economy in terms of the amount of crops and properties destroyed.

Data Processing

In this section, the author preprocess the data in such a way that it would be ideal in answering the following questions:

  1. Across the United States, which types of events, are the most harmful with respect to population health. This can be answered by looking at the fatalities and injuries per event type.

  2. Across the United States, which types of events have the greatest economic consequences? We can answer this by looking at the following variables:

    • PROPDMG - for property damage base value

    • PROPDMGEXP - Property damage base value modifier

    • CROPDMG - Crop damage base value

    • CROPDMGEXP - Crop damage base value modifier

The rules on how to resolve the real value costs can be found here: https://rstudio-pubs-static.s3.amazonaws.com/58957_37b6723ee52b455990e149edde45e5b6.html.

Generally, the dataset must be wrangled into two parts - (1) effects of storm events on population health and (2) effects of storm events on the economy. All of the data manipulations are done using dplyr techniques.

# load necessary packages
library(dplyr)
library(ggplot2)
library(gtable)
library(grid)
library(gridExtra)
# load dataset
storm.data.raw <- read.csv("./repdata%2Fdata%2FStormData.csv.bz2")

# reduce dataset columns to useful ones
columns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG","PROPDMGEXP", 
             "CROPDMG", "CROPDMGEXP", "REFNUM")
storm.data.reduced <- storm.data.raw[, columns]

# sample dataset to minimize for testing
storm.data.sample <- sample_n(storm.data.reduced, 300)

Effects of Storm Events on Population Health

# Add all recorded fatalities and injuries by event type
health.outcomes <- storm.data.reduced %>%
  group_by(EVTYPE) %>%
  summarize(FATALITIES = sum(FATALITIES),
            INJURIES = sum(INJURIES))

# THE FOLLOWING BLOCK DOES THE FOLLOWING:
# * Separate health.outcomes by fatalities and injuries
#   * Remove event types with 0 total
#   * Arrange event types from highest to lowest
# 1. Fatalities
fatalities <- arrange(health.outcomes[, c("EVTYPE", "FATALITIES")], 
                      desc(FATALITIES))
fatalities <- fatalities[fatalities$FATALITIES != 0, ]
# 2. Injuries
injuries <- arrange(health.outcomes[, c("EVTYPE", "INJURIES")], 
                    desc(INJURIES))
injuries <- injuries[injuries$INJURIES != 0, ]

# Visualize the first few rows of fatalities and injuries
# 1. Fatalities
head(fatalities)
## # A tibble: 6 x 2
##   EVTYPE         FATALITIES
##   <fct>               <dbl>
## 1 TORNADO              5633
## 2 EXCESSIVE HEAT       1903
## 3 FLASH FLOOD           978
## 4 HEAT                  937
## 5 LIGHTNING             816
## 6 TSTM WIND             504
# 2. Injuries
head(injuries)
## # A tibble: 6 x 2
##   EVTYPE         INJURIES
##   <fct>             <dbl>
## 1 TORNADO           91346
## 2 TSTM WIND          6957
## 3 FLOOD              6789
## 4 EXCESSIVE HEAT     6525
## 5 LIGHTNING          5230
## 6 HEAT               2100

Effects of Storm Events on the Economy

resolve.real.value <- function (value = 0, modifier = "") {
  # make the modifier lowercase
  modifier <- tolower(modifier)
  # This is a not-so-pretty chain of ifelse but this is the only solution that
  # is accepted by any dplyr verbs. 
  # What I did here was the following:
  # if the modifier is a number
  #   then by our reference on how to handle 'EXP columns, we multiply the 
  #   values by 10
  # else ( meaning that the modifier is not a number)
  #   exponentiate the value by the following rules:
  #     h = 2
  #     k = 3
  #     m = 6
  #     b = 9
  #     + = 1
  #     - = 0
  #     ? = 0
  #     empty = 0
  ifelse (! is.na(suppressWarnings(as.numeric(modifier))),
          value * 10,
          ifelse(modifier == "h", value * (10 ^ 2),
                 ifelse(modifier == "k", value * (10 ^ 3),
                        ifelse(modifier == "m", value * (10 ^ 6),
                               ifelse(modifier == "b", value * (10 ^ 9), 
                                      ifelse(modifier == "+", value * 1, 0))))))
}

# Add calculated columns(CROP.DAMAGE and PROPERTY.DAMAGE) in the reduced dataset
# by computing the real value amount resolving base value and modifier (see the 
# function above)
storm.data.reduced <- storm.data.reduced %>%
  mutate(CROP.DAMAGE.COST = resolve.real.value(value = CROPDMG, 
                                          modifier = CROPDMGEXP),
         PROPERTY.DAMAGE.COST = resolve.real.value(value = PROPDMG, 
                                              modifier = PROPDMGEXP))

# Generate boxplot for CROP and PROPERTY
p1 <- ggplot(data = storm.data.reduced, 
       aes(x = "CROP", y = CROP.DAMAGE.COST)) +
  geom_boxplot() +
  labs(title = "Crop Damage Boxplot")
p2 <- ggplot(data = storm.data.reduced, 
       aes(x = "PROPERTY", y = PROPERTY.DAMAGE.COST)) +
  geom_boxplot() +
  labs(title = "Property Damage Boxplot")

grid.arrange(p1, p2, ncol = 2)

# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue", 
                                               fill = NA))

As you can see from the boxplot, there are values that are way off the system of values in both property and crop damages. Printing these rows, we have:

# print suspected outliers for crop damage cost
storm.data.reduced[storm.data.reduced$CROP.DAMAGE.COST > 4e+09, ]
##             EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 198389 RIVER FLOOD          0        0       5          B       5
## 211900   ICE STORM          0        0     500          K       5
##        CROPDMGEXP REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 198389          B 198375            5e+09                5e+09
## 211900          B 211887            5e+09                5e+05
# print suspected outliers for property damage cost
storm.data.reduced[storm.data.reduced$PROPERTY.DAMAGE.COST > 9e+10, ]
##        EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953  FLOOD          0        0     115          B    32.5          M
##        REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 605953 605943         32500000             1.15e+11

Some of the values are quite extreme thus, the author investigated if the remarks associated with each suspected outlier match the reported values. Looking at the raw records of these suspected outlier observations are too long that the author did not show result. But we can look at the raw data by running the following lines of code.

# FOR CROP DAMAGE COST
storm.data.raw[storm.data.raw$REFNUM == 198375, ]
storm.data.raw[storm.data.raw$REFNUM == 211887, ]

# FOR PROPERTY DAMAGE COST
storm.data.raw[storm.data.raw$REFNUM == 605953, ]

The first result of the crop damage cost outlier analysis points to the Great Flood of 1993 which by looking at the REMARKS variable and the supplemental Wikipedia page (https://en.wikipedia.org/wiki/Great_Flood_of_1993) is one of the most devastating events with billions of total crop and property damage combined. So the values are justifiable. In the other hand, the last result for this category points to an event in 1994 called the Delta Ice Storm (a supplemental resource can be found here: https://www.weather.gov/jan/1994_deltaicestorm) where crop and property damage are also justified. Therefore, no modifications are necessary for these two observation.

Let us now look at the second category with one suspected outlier. Unfortunately, the reported 115 Billion property damage during the extreme flooding in Downtown Napa does not match with the records where property damage only amounts to 70 million. Therefore, a modification is needed to rectify this mismatch.

# fix the observation with wrong reported property damage
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPDMG"] <- 70
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPDMGEXP"] <- "m"
storm.data.reduced[storm.data.reduced$REFNUM == 605943, "PROPERTY.DAMAGE.COST"] <- 70000000
storm.data.reduced[storm.data.reduced$REFNUM == 605943, ]
##        EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953  FLOOD          0        0      70          m    32.5          M
##        REFNUM CROP.DAMAGE.COST PROPERTY.DAMAGE.COST
## 605953 605943         32500000                7e+07

Now that we have an acceptable analytic data for answering economic questions, we now summarize our data by aggregating event type and computing total damage cost per event type. Like so:

# Add all Crop Damage and Property Damage costs by event type
economic.consequences <- storm.data.reduced %>%
  group_by(EVTYPE) %>%
  summarize(CROP.TOTAL.DAMAGE = sum(CROP.DAMAGE.COST),
            PROPERTY.TOTAL.DAMAGE = sum(PROPERTY.DAMAGE.COST))

# THE FOLLOWING BLOCK DOES THE FOLLOWING:
# * Separate economic consequences by crop and property
#   * Remove event types with 0 total damage cost
#   * Arrange event types from highest to lowest total damage 
# 1. CROPS
economic.consequences.crop <- transmute(economic.consequences, EVTYPE, 
                                        CROP.TOTAL.DAMAGE)
economic.consequences.crop <- 
  economic.consequences.crop[economic.consequences.crop$CROP.TOTAL.DAMAGE != 0,]
economic.consequences.crop <- economic.consequences.crop %>%
  arrange(desc(CROP.TOTAL.DAMAGE))

# 2. PROPERTY
economic.consequences.property <- transmute(economic.consequences, EVTYPE, 
                                        PROPERTY.TOTAL.DAMAGE)
economic.consequences.property <- 
  economic.consequences.property[economic.consequences.property$
                                   PROPERTY.TOTAL.DAMAGE != 0, ]
economic.consequences.property <- economic.consequences.property %>%
  arrange(desc(PROPERTY.TOTAL.DAMAGE))

Data Analysis

In answering the questions as to which types of storm events have caused population hazards and economic loss, the author presents several charts that are divided into categories. For population hazards, we have fatalities and injuries and for the economic loss, it is interesting to know total damage in terms of crops and properties per event type. We generate Pareto charts on the Top 20 storm event types on these four categories and provide analysis for each. The function that returns the Pareto Chart is as follows:

# the following function generates panel plots (two Pareto charts, one per each category)
generate.plot <- function (data, title, xlabel, ylabel, num.of.categories) {
  # convert the argument data to a dataframe
  data <- data.frame(data)
  # extract the column names of the data to generalize reference of columns
  column.names <- colnames(data)
  # add additional columns for processing
  # 1. CUMULATIVE SUM - either cumulative sum of fatalities or injuries OR property or crop total damage
  # 2. PROPORTION - compute percentage per EVENT TYPE
  # 3. CUMULATIVE.PROPORTION - same as 1 but for PROPORTION
  data$CUMULATIVE.SUM <- cumsum(data[, column.names[2]])
  data[, column.names[1]] <- factor(data[, column.names[1]], 
                                    levels = data[, column.names[1]])
  data$PROPORTION <- data[, column.names[2]] / sum(data[, column.names[2]]) * 100
  data$CUMULATIVE.PROPORTION <- cumsum(data$PROPORTION)
  
  # We generate two plots, one for sum and one for proportion to show the tick marks of sum and 
  # proportion in one plot. There is no straightforward way to do it in dplyr as far as the 
  # author knows but this is does the job 
  # generate plot for sum
  
  # # Compute how much many percent does the top categories incurred
  total.percent <- data[num.of.categories, "CUMULATIVE.PROPORTION"]
  tp.subtitle <- paste(round(total.percent), "% OF TOTAL", column.names[2], 
                       " ARE COVERED BY THE EVENT TYPES SHOWN")
  
  p1 <- data[1:num.of.categories,] %>%
    ggplot(aes(x = EVTYPE)) +
    geom_bar(aes_string(y = column.names[2]), fill = "blue", stat = "identity") +
    geom_point(aes(y = CUMULATIVE.SUM, group = 1), 
               color = "darkblue", 
               pch=16, 
               size=2) +
    geom_line(aes(y = CUMULATIVE.SUM, group = 1), 
              colour = "slateblue1", 
              size = 1) +
    theme(axis.text.x = element_text(angle=90, vjust = 0.6)) +
    labs(title = title, 
       subtitle = tp.subtitle, 
       x = xlabel, 
       y = ylabel)
  
  # generate plot for proportion
  p2 <- data[1:num.of.categories,] %>%
    ggplot(aes(x = EVTYPE)) +
    geom_bar(aes(y = PROPORTION), fill = "blue", stat = "identity") +
    geom_point(aes(y = CUMULATIVE.PROPORTION, group = 1), 
               color = "darkblue", 
               pch=16, 
               size=2) +
    geom_line(aes(y = CUMULATIVE.PROPORTION, group = 1), 
              colour = "slateblue1", 
              size = 1) +
    scale_y_continuous(labels=function(x) paste0(" ", x,"%")) +
    theme(panel.background = element_rect(fill = NA))
  
  # The following block puts two plots on top of each other tweaking the location of the 
  # percent tick marks in the process to move it to the right of the plot and return the result 
  # of overlapping as one plot.
    # extract gtable
    g1 <- ggplot_gtable(ggplot_build(p1))
    g2 <- ggplot_gtable(ggplot_build(p2))
    # overlap the panel of 2nd plot on that of 1st plot
    pp <- c(subset(g1$layout, name == "panel", se = t:r))
    g <- gtable_add_grob(g1, g2$grobs[[which(g2$layout$name == "panel")]], pp$t, 
        pp$l, pp$b, pp$l)
    # axis tweaks
    ia <- which(g2$layout$name == "axis-l")
    ga <- g2$grobs[[ia]]
    ax <- ga$children[[2]]
    ax$widths <- rev(ax$widths)
    ax$grobs <- rev(ax$grobs)
    ax$grobs[[1]]$x <- ax$grobs[[1]]$x - unit(1, "npc") + unit(0.15, "cm")
    g <- gtable_add_cols(g, g2$widths[g2$layout[ia, ]$l], length(g$widths) - 1)
    g <- gtable_add_grob(g, ax, pp$t, length(g$widths) - 1, pp$b)

    # return the generated plot
    g
}

Effects on population health

# Generate Pareto Chart for Fatalities Category
plot1 <- generate.plot(fatalities, 
                       title = "FIG 1A. TOP 20 EVENTS AND THEIR EFFECT ON POPULATION HEALTH - FATALITIES", 
                       xlabel = "EVENT TYPES", 
                       ylabel = "CASUALTIES",
                       num.of.categories = 20)

# Injuries
plot2 <- generate.plot(injuries, 
                       title = "FIG 1B. TOP 20 EVENTS AND THEIR EFFECT ON POPULATION HEALTH - INJURIES", 
                       xlabel = "EVENT TYPES", 
                       ylabel = "CASUALTIES", 
                       num.of.categories = 20)

grid.arrange(plot1, plot2)

# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue", 
                                               fill = NA))

Looking at Figure 1A, we see that TORNADO and EXCESSIVE.HEAT are the two main cause of fatalities with roughly 37% and 12% of the total death casualties respectively. All of the EVENT TYPES shown in this figure encompass 89% of the total death casualties.

Looking at Figure 1B, we see that TORNADO and TSTM WIND are the two main cause of injuries. What’s interesting here is that roughly 70% of the total injured casualties are caused by tornadoes (100000 if we roughly translate from the graph). All of the EVENT TYPES shown in this figure amounts to 96% of the total damage reported by this dataset from 1950 to 2011.

Economic Consequences

# Effects of Storms on Crops
plot1 <- generate.plot(economic.consequences.crop, 
                       title = "FIG 2A. TOP 20 EVENTS AND THEIR EFFECT ON THE ECONOMY - CROPS",
                       xlabel = "EVENT TYPES", 
                       ylabel = "COST ($)",
                       num.of.categories = 20)

# Effects of Storm Events on Properties
plot2 <- generate.plot(economic.consequences.property, 
                       title = "FIGURE 2B. TOP 20 EVENTS AND THEIR EFFECT ON THE ECONOMY - PROPERTIES",
                       xlabel = "EVENT TYPES", 
                       ylabel = "COST ($)",
                       num.of.categories = 20)

grid.arrange(plot1, plot2)

# next line adds border
grid.rect(width = 1, height = 1, gp = gpar(lwd = 2, col = "blue", 
                                               fill = NA))

By observing the first Pareto Chart on the above plot (FIG 2B), we can see that roughly 10.3 Billion dollars in crop damages are caused by DROUGHTS which accounted for roughly 23% of the total crop damage between 1950 and November 2011. Additionally, FLOOD damage caused about 5 Billion dollars in damages or roughly 15 percent of the total crop damages.

In the other hand, property damage were largely caused by HURRICANE/TYPHOON with reported cost of 50 Billion dollars in total damages from 1950 to 2011. Close on the second is TORNADO with 50 Billion. These two event types accounted for 20% and 15% of the total property damage respectively from 1950 to 2011.

In dollars, event types reported by both figures encompass 96% of all the damages reported by this dataset from 1950 to 2011.

Results

Result shows that TORNADO, EXCESSIVE HEAT and TSTM WIND were the three dominant EVENT TYPES that caused harm to US population while DROUGHT, FLOOD, HURRICANE/TYPHOON and TORNADO were four EVENT TYPES that largely affected the economy in terms of the amount of crops and properties destroyed.

With this type of analysis, the author observed that looking at the long term effects of storm in US Population Health and Economy is an evocation of serious problems caused by these natural disasters. Disaster preparedness with respect to these types of natural phenomenon as well as robust state laws are necessary to lessen if not mitigate their serious impact to US economy and US citizens.