Synopsis

Storms and other severe weather events can cause both public health (fatalities, injuries) and economic problems (property, crop damage) for a population. This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This documents the occurrence of storms and other significant weather phenomena having sufficient intensity to cause loss of life, injuries, significant property damage, and/or disruption to commerce.

Tornado seems to have the greatest health impact both in terms of fatalities and injuries. It is followed by Thunderstorm Wind in terms of injuries, and Excessive Heat in terms of fatalities. The top ten causes of injuries and fatalities are largely the same.

Flood has caused the most property damage while Drought has caused the most crop damage. What a contrast!!

Load Required Libraries

library(dplyr)
library(ggplot2)
library(reshape2)

Loading and Reading Data

fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if(!file.exists("data")) {
    dir.create("data")
    download.file(fileUrl, destfile = "data/stormdata.csv.bz2")
}

## A total of 902,297 observations
sdata_f <- read.csv("data/stormdata.csv.bz2")

Data Transformations

Filter Out Non-impact Observations

Subset the observations which have health or economic impacts. The original database has 902,297 observations.

################################################################################
##
## We don't need data where there are no fatalities/injuries or economic impact
## Left with 254,633 observations only (647,664 less observations to worry)
##
################################################################################
sdata <- subset(sdata_f,
                FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)

After filtering out the non-impact observation, we are left with only 254,633 observations!! Significantly less observations to analyze.

Transform Event Type to Permitted Storm Data Events

The storm data has been transformed using the National Weather Service Instruction 10-1605, August 17, 2007\(^{[1]}\). The events in the database had to be conformed to the permitted storm data events. Some of the instructions to highlight are:

  1. A gustnado is a small, and usually weak whirlwind which forms as an eddy in thunderstorm outflows. They do not connect with any cloud-base rotation and are not tornadoes. Since their origin is associated with cumuliform clouds, gustnadoes will be classified as Thunderstorm Wind events.
  2. Unusually large amount of rain which does not cause a flash flood or flood, but causes damage, e.g., roof collapse or other human/economic impact. Heavy rain situations, resulting in urban and/or small stream flooding, should be classified as a Heavy Rain event.
  3. The event name of Landslide was renamed to Debris Flow.
  4. When events such as mudslides or lahars are caused primarily by volcanic activity, or when rainfall is not the primary cause, then document them as a Debris Flow.
  5. Downbursts, including dry, or wet, microbursts or macrobursts, will be classified as Thunderstorm Wind events.

There are a lot of free flow texts not conforming to the permitted storm data events appear in the database including the use of abbreviations and spelling mistakes. The best effort has been made to transform all these transcription errors to the permitted events. Below is the transformation table that has been used for the same.

################################################################################
##
## Create a substitution list containing "from" and "to" columns. The "from"
## contains the regular expressions that will be "grep"ed in the database and
## will be substituted by the permitted storm data even in the respective "to"
## list.
##
################################################################################
subsList <- rbind(c("TSTM|THUN|TUND|THUD|BURST",   "THUNDERSTORM WIND"),
                  c("STORM|GUSTNADO",              "THUNDERSTORM WIND"),
                  c("FIRE",                        "WILDFIRE"),
                  c("WINTRY MIX|RAIN/SNOW",        "SLEET"),
                  c("HEAVY MIX|MIXED PREC",        "SLEET"),
                  c("URBAN|HEAVY RAIN",            "HEAVY RAIN"),
                  c("HEAVY SWE|HEAVY SEA",         "HEAVY RAIN"),
                  c("HEAVY SHO|^RAIN(.*)",         "HEAVY RAIN"),
                  c("RECORD(.*)RAIN|UNSE(.*)RAIN", "HEAVY RAIN"),
                  c("HVY(.*)RAIN|TORR(.*)RAIN",    "HEAVY RAIN"),
                  c("HEAVY PREC|EXCES(.*)RAIN",    "HEAVY RAIN"),
                  c("HURRICANE|TYPHOON",           "HURRICANE(TYPHOON)"),
                  c("HIGH WIND|^WIND(.*)",         "HIGH WIND"),
                  c("TORN|WHIRL",                  "TORNADO"),
                  c("FLASH|RIVER|DAM BREAK",       "FLASH FLOOD"),
                  c("MAJ(.*)FLOOD|RUR(.*)FLOOD",   "FLASH FLOOD"),
                  c("STR(.*)FLOOD|LAK(.*)FLOOD",   "LAKESHORE FLOOD"),
                  c("COASTAL",                     "COASTAL FLOOD"),
                  c("^FLOOD|FLOODING",             "FLOOD"),
                  c("HAIL",                        "HAIL"),
                  c("BLIZZARD",                    "BLIZZARD"),
                  c("EXTENDED(.*)COLD",            "EXTREME COLD/WIND CHILL"),
                  c("EXTREME(.*)COLD",             "EXTREME COLD/WIND CHILL"),
                  c("EXTREME(.*)WIND",             "EXTREME COLD/WIND CHILL"),
                  c("RECORD(.*)COLD",              "EXTREME COLD/WIND CHILL"),
                  c("UNSEAS(.*)COLD|HYPOTHERM",    "EXTREME COLD/WIND CHILL"),
                  c("FOG",                         "DENSE FOG"),
                  c("STORM SURGE",                 "STORM SURGE/TIDE"),
                  c("HEAVY(.*)SNOW|RECORD SNOW",   "HEAVY SNOW"),
                  c("^SNOW(.*)|BLOWING(.*)SNOW",   "HEAVY SNOW"),
                  c("LATE(.*)SNOW|EXCE(.*)SNOW",   "HEAVY SNOW"),
                  c("SLIDE|LAND|EROS",             "DEBRIS FLOW"),
                  c("GUSTY|STRONG WIND",           "STRONG WIND"),
                  c("HYPERTH|UNS(.*)WARM",         "EXCESSIVE HEAT"),
                  c("RECORD HEAT|EXCE(.*)HEAT",    "EXCESSIVE HEAT"),
                  c("EXTREME HEAT",                "EXCESSIVE HEAT"),
                  c("HIGH TIDE",                   "STORM SURGE/TIDE"),
                  c("LIGHT SNOW|LAKE(.*)SNOW",     "LAKE-EFFECT SNOW"),
                  c("TROPICAL STORM",              "TROPICAL STORM"),
                  c("HEAT WAVE|WARM",              "HEAT"),
                  c("WINTER STORM",                "WINTER STORM"),
                  c("ICE",                         "ICE STORM"),
                  c("SURF|^HIGH(.*)S|^HIGH(.*)W",  "HIGH SURF"),
                  c("SLEET",                       "SLEET"),
                  c("FREEZ|ICY ROADS|FROST",       "FROST/FREEZE"),
                  c("WINTER WEATHER|SQUALL",       "WINTER WEATHER"),
                  c("WATERSPOUT",                  "WATERSPOUT"),
                  c("^COLD(.*)|LOW(.*)TEMP",       "COLD/WIND CHILL"),
                  c("RIP CURRENT",                 "RIP CURRENT"),
                  c("GLAZE",                       "FREEZING FOG"),
                  c("GRADIENT WIND",               "TROPICAL DEPRESSION"),
                  c("LIG",                         "LIGHTNING"),
                  c("DUST",                        "DUST STORM"),
                  c("MARINE|ROUGH SEAS|ROGUE WAV", "MARINE STRONG WIND"),
                  c("AVALAN",                      "AVALANCHE")
                  )
colnames(subsList) <- c("from", "to")
for(i in 1:nrow(subsList)) {
    sdata$EVTYPE[grep(subsList[i, "from"],
                      sdata$EVTYPE, ignore.case = TRUE)] <- subsList[i, "to"]
}

Health Impact

Data Processing

Let’s first examine the leading causes of deaths.

################################################################################
##
## Filter out the fatalities data, group them by events, and get the total
## fatalities for each of these groups.
##
################################################################################
fatalities_t <-
sdata %>%
    select(EVTYPE, FATALITIES) %>%
    group_by(EVTYPE) %>%
    summarise(fatalities = sum(FATALITIES))

The top ten causes of death are:

################################################################################
##
## Arrange the fatalities table in decreasing order of fatalities and display
## the first 10 observations.
##
################################################################################
arrange(fatalities_t, -fatalities)[1:10, ]
## # A tibble: 10 x 2
##    EVTYPE                  fatalities
##    <chr>                        <dbl>
##  1 TORNADO                       5637
##  2 EXCESSIVE HEAT                2061
##  3 THUNDERSTORM WIND             1182
##  4 HEAT                          1118
##  5 FLASH FLOOD                   1040
##  6 LIGHTNING                      817
##  7 RIP CURRENT                    572
##  8 HIGH SURF                      496
##  9 FLOOD                          477
## 10 EXTREME COLD/WIND CHILL        316

What are the leading causes of injuries?

################################################################################
##
## Filter out the injuries data, group them by events, and get the total
## injuries for each of these groups.
##
################################################################################
injuries_t <-
sdata %>%
    select(EVTYPE, INJURIES) %>%
    group_by(EVTYPE) %>%
    summarise(injuries = sum(INJURIES))

The top ten causes of injuries are:

################################################################################
##
## Arrange the injuries table in decreasing order of injuries and display the
## first 10 observations.
##
################################################################################
arrange(injuries_t, -injuries)[1:10, ]
## # A tibble: 10 x 2
##    EVTYPE            injuries
##    <chr>                <dbl>
##  1 TORNADO              91407
##  2 THUNDERSTORM WIND    13787
##  3 FLOOD                 6792
##  4 EXCESSIVE HEAT        6747
##  5 LIGHTNING             5231
##  6 HEAT                  2496
##  7 HIGH SURF             1849
##  8 FLASH FLOOD           1803
##  9 WILDFIRE              1608
## 10 HAIL                  1371

The number of injuries caused by Tornado is much higher than others. Let’s note this figure as this is kind of an outlier compared to others.

################################################################################
##
## Note the unusually higher number of injuries caused by TORNADO. This is kind
## of an outlier compared to others.
##
################################################################################
tornado_inj <- filter(injuries_t, EVTYPE =="TORNADO")[["injuries"]]

Health Impact Results

We notice that the top ten causes of injuries and deaths are roughly the same. So we will merge the two tables - fatalities and injuries and plot them together.

################################################################################
##
## Merge the two tables - fatalities and injuries. Melt the top ten observations
## so that both the fatalities and injuries can be plotted on to the same plot.
##
################################################################################
health_impact <-
merge(fatalities_t, injuries_t) %>%
  arrange(-fatalities)

health_impact_plot <- melt(health_impact[1:10, ])
## Using EVTYPE as id variables
g <- ggplot(health_impact_plot, aes(reorder(EVTYPE, -value),
                                    value, fill = variable))

The injuries caused by Tornado was an outlier. So plot the graph, zooming in to other observations and just labeling the Tornado injuries. This gives a better graph which would otherwise would have been squished just because of one data point.

################################################################################
##
## Since the injuries caused by tornado is an outlier, we will zoom in to other
## observations by limiting the "ylim" of the plot and just label the outlier.
##
################################################################################
g + geom_bar(stat = "identity", position = "dodge") +
    coord_cartesian(ylim = c(0, 15000)) +
    geom_text(x = 1.25, y = 15400, label = tornado_inj, col = "brown") +
    scale_x_discrete(guide = guide_axis(angle = 60)) +
    labs(fill = "Impact Type") +
    labs(x = "", y = "Health Impact on the Population") +
    labs(title = "Health Impact of Various Natural Calamities on the Population")

Economic Impact

Data Processing

################################################################################
##
## Select only the economic impact related columns from the database
##
################################################################################
damages <-
sdata %>%
    select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

################################################################################
##
## The damages PROPDMG and CROPDMG are supplemented with PROPDMGEXP and
## CROPDMGEXP respectively. These contain the expressions (like K, M, B, etc.)
## to denote by how much should the ?ROPDMG value be multiplied to get the
## effective damage values.
##
## Get all the unique expressions that are possible and sort them
##
################################################################################
dmg_exp <-
sdata$PROPDMGEXP %>%
    unique %>%
    sort

################################################################################
##
## These unique expressions correspond to these numbers that needs to be
## multiplied.
##
################################################################################
mult <- c(1, 1, 1, 10^0, 10^2, 10^3, 10^4, 10^5, 10^6,
          10^7, 10^9, 10^2, 10^2, 10^3, 10^6, 10^6)

The expression and there corresponding multipliers are:

mult_table <- data.frame(dmg_exp, mult)
mult_table
##    dmg_exp  mult
## 1          1e+00
## 2        - 1e+00
## 3        + 1e+00
## 4        0 1e+00
## 5        2 1e+02
## 6        3 1e+03
## 7        4 1e+04
## 8        5 1e+05
## 9        6 1e+06
## 10       7 1e+07
## 11       B 1e+09
## 12       h 1e+02
## 13       H 1e+02
## 14       K 1e+03
## 15       m 1e+06
## 16       M 1e+06
################################################################################
##
## Substitute the expression with their corresponding multiplier values for both
## property and crop related expressions.
##
################################################################################
damages$PROPDMGEXP <- mult_table$mult[match(damages$PROPDMGEXP,
                                            mult_table$dmg_exp)]
damages$CROPDMGEXP <- mult_table$mult[match(damages$CROPDMGEXP,
                                            mult_table$dmg_exp)]

################################################################################
##
## Calculate the total property and crop damage by multiplying the values with
## corresponding multipliers grouped by event. The damage values are calculated
## in USD billions.
##
################################################################################
economic_impact <-
damages %>%
    mutate(PROPDMG = (PROPDMG * PROPDMGEXP)) %>%
    mutate(CROPDMG = (CROPDMG * CROPDMGEXP)) %>%
    group_by(EVTYPE) %>%
    summarise(propdmg = sum(PROPDMG)/(10^9), cropdmg = sum(CROPDMG)/(10^9))
## `summarise()` ungrouping output (override with `.groups` argument)

Get the leading cause of property damage and crop damage.

######################################################
## Get the leading cause of property and crop damage
#####################################################
arrange(economic_impact, -propdmg)[1:10, ]
## # A tibble: 10 x 3
##    EVTYPE             propdmg cropdmg
##    <chr>                <dbl>   <dbl>
##  1 FLOOD               145.     5.79 
##  2 HURRICANE(TYPHOON)   85.4    5.52 
##  3 THUNDERSTORM WIND    79.4   NA    
##  4 TORNADO              57.0    0.415
##  5 FLASH FLOOD          22.9   NA    
##  6 HAIL                 15.7   NA    
##  7 WILDFIRE              8.50   0.403
##  8 HIGH SURF             6.11  NA    
##  9 HEAVY RAIN            3.34   0.816
## 10 DROUGHT               1.05  14.0
arrange(economic_impact, -cropdmg)[1:10, ]
## # A tibble: 10 x 3
##    EVTYPE                    propdmg cropdmg
##    <chr>                       <dbl>   <dbl>
##  1 DROUGHT                   1.05     14.0  
##  2 FLOOD                   145.        5.79 
##  3 HURRICANE(TYPHOON)       85.4       5.52 
##  4 FROST/FREEZE              0.0283    2.00 
##  5 EXTREME COLD/WIND CHILL   0.133     1.36 
##  6 HEAVY RAIN                3.34      0.816
##  7 EXCESSIVE HEAT            0.00787   0.497
##  8 TORNADO                  57.0       0.415
##  9 HEAT                      0.0125    0.407
## 10 WILDFIRE                  8.50      0.403

We notice that the leading cause of property damage and crop damage are different. So it’s better to examine them separately.

Property Damage: Results

################################################################################
##
## Extract just the property damage figures, arrange them in descending order of
## impact, and plot it.
##
################################################################################
propdmg_t <-
economic_impact %>%
    select(EVTYPE, propdmg) %>%
    arrange(-propdmg)

g <- ggplot(propdmg_t[1:10, ], aes(reorder(EVTYPE, -propdmg),
                                   propdmg, fill = EVTYPE))
g + geom_bar(stat = "identity") +
    scale_x_discrete(guide = guide_axis(angle = 60)) +
    theme(legend.position = "none") +
    labs(x = "", y = "Property Damage (in USD Billion)") +
    labs(title = "Impact of Various Natural Calamities on Property")

The leading cause of property damage is Flood followed by Hurricane(Typhoon) and Thunderstorm Wind.

Crop Damage: Results

################################################################################
##
## Extract just the crop damage figures, arrange them in descending order of
## impact, and plot it.
##
################################################################################
cropdmg_t <-
economic_impact %>%
    select(EVTYPE, cropdmg) %>%
    arrange(-cropdmg)

g <- ggplot(cropdmg_t[1:10, ], aes(reorder(EVTYPE, -cropdmg), cropdmg, fill = EVTYPE))
g + geom_bar(stat = "identity") +
    scale_x_discrete(guide = guide_axis(angle = 60)) +
    theme(legend.position = "none") +
    labs(x = "", y = "Crop Damage (in USD Billion)") +
    labs(title = "Impact of Various Natural Calamities on Crop")

The leading cause of property damage is Drought followed by Flood and Hurricane(Typhoon).

Results

Tornado seems to have the greatest health impact both in terms of fatalities and injuries. It is followed by Thunderstorm Wind in terms of injuries, and Excessive Heat in terms of fatalities. The top ten causes of injuries and fatalities are largely the same.

Flood has caused the most property damage while Drought has caused the most crop damage. What a contrast!!