Synopsis

In this analysis, historical storm data collected by the NOAA was examined with the goal of answering two specific questions. 1) Across the United States, which types of events are most harmful with respect to population health? 2) Across the United States, which types of events have the greatest economic consequences? The most complex aspect of answering these questions related to the characteristics of the data frame, and most of the analysis below is aimed at distilling the pertinent information from the dataset. This analysis was conducted entirely in R, and the relevant code has been provided along with commentary.

Data Processing

The original datafile (which can be downloaded here) contained 902297 rows and 37 columns, so it’s somewhat unwieldy. We’ll be using the free R Software to conduct our analysis. We’ll also include the code so that the analysis can be performed so long as the datafile is located in the working directory of the user’s R software.

Let’s begin by grabbing our data. Since R tends to hold objects in a system’s RAM, we’ll only load the columns related to our goals.

select.vec <- c(rep("NULL", 3), "character", rep("NULL", 2), rep("character", 2),
    rep("NULL", 14), rep("numeric", 3), "character", "numeric", "character", rep("NULL", 9))
storms <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), colClasses = select.vec)
#to make coding easier, we'll make column names lowercase
colnames(storms) <- tolower(colnames(storms))

A quick exploration of the dataframe reveals that there may be some data-entry errors in a few of the columns. The “propdmgexp” and“cropdmgexp” variables, in particular, seem to be affected. According to the pg. 12 of the provided NOAA documentation, damage estimates are to be followed by “K” when they represent thousands of dollars, “M” when they represent millions, and “B” when they represent billions. It would seem to follow, then, that “propdmgexp” and “cropdmgexp” should each consist of those three levels. However, this is not what we observe.

levels(storms$propdmgexp)
##  [1] ""  "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(storms$cropdmgexp)
## [1] ""  "?" "0" "2" "B" "k" "K" "m" "M"

The apparent inconsistencies in the data need to be addressed before the analysis can proceed. Unfortunately, we can’t be certain of the intent with which each of these entries were recorded. We’ll be forced to make a few assumptions. First, we’ll eliminate the case-disparities within the factors by making all the levels lowercase.

storms$propdmgexp <- factor(tolower(storms$propdmgexp))
storms$cropdmgexp <- factor(tolower(storms$cropdmgexp))

Next, we’ll group every entry not identified with a letter into the “k” group. This is because the documentation only specifies three levels, and it seems reasonable to conclude that none of the the data would have been recorded at a multiplier smaller than “thousands”. (We don’t, however, want to assign these into the “millions” or “billions” groups without explicit cause because that might cause way too much weight to be given). Before we completely reconfigure these entries, we would like to flag each of these entries as a potential source of error in our analysis, so we’ll temporarily assign them into a “?” group.

levels(storms$propdmgexp) <- c(rep("?", times = 13), "b", "h", "k", "m")
levels(storms$cropdmgexp) <- c(rep("?", times = 4), "b", "k", "m")

Moving on to potentially erroneous entries that contain a letter, we find there are only seven instances in the entire dataset (for comparison, there are well over 400,000 non-alphabetical entries in “propdmgexp” and “cropdmgexp”). Surprisingly, all of the potentially erroneous alphabetical entries are the letter “h”. My first instinct was to assume these entries were intended to represent “billions” based on the proximity of “h” and “b” on a standard keyboard. However, I decided to classify these entries the same way as the non-alphabetical entries (into the “k” level) because an examination of the relevent rows revealed these entries were related to thunderstorm events and it seemed unlikely individual thunderstorms would cause billions of dollars in damage. Now that we have reached a decision on all of our potential errors, we’ll flag them and then re-classify them.

#flagging potential data-entry errors
storms$flag <- (grepl("[^bkm]", storms$propdmgexp)|grepl("[^bkm]", storms$cropdmgexp))
#re-classifying "mysterious" levels
levels(storms$propdmgexp) <- c("k", "b", "k", "k", "m")
levels(storms$cropdmgexp) <- c("k", "b", "k", "m")

Finally, we need to apply our multipliers to the numbers in our “propdmg” and “cropdmg” variables. To avoid accidentally repeating this process and distorting the numbers, we’ll discard the multiplier variables (“propdmgexp” and “cropdmgexp”) afterwards.

storms$propdmg[storms$propdmgexp == "b"] <- 
            storms$propdmg[storms$propdmgexp == "b"]*(10^9)
storms$propdmg[storms$propdmgexp == "k"] <-
            storms$propdmg[storms$propdmgexp == "k"]*(10^3)
storms$propdmg[storms$propdmgexp == "m"] <- 
            storms$propdmg[storms$propdmgexp == "m"]*(10^6)
storms$propdmgexp <- NULL
storms$cropdmg[storms$cropdmgexp == "b"] <- 
            storms$cropdmg[storms$cropdmgexp == "b"]*(10^9)
storms$cropdmg[storms$cropdmgexp == "k"] <-
            storms$cropdmg[storms$cropdmgexp == "k"]*(10^3)
storms$cropdmg[storms$cropdmgexp == "m"] <- 
            storms$cropdmg[storms$cropdmgexp == "m"]*(10^6)
storms$cropdmgexp <- NULL

Analysis

Now that our data are in a more practical form, we can proceed with our analysis. Let’s sum the number of fatalities and injuries by the type of storm event. We’ll cobble these totals together in a dataframe along with their ranks (where the entries with the largest totals are assigned a rank of 1).

fatalities.sum.type <- as.vector(tapply(storms$fatalities, storms$evtype, sum))
injuries.sum.type <- as.vector(tapply(storms$injuries, storms$evtype, sum))
fatalities.rank <- rank(-fatalities.sum.type)
injuries.rank <- rank(-injuries.sum.type)
health.damage <- data.frame(event.type = levels(storms$evtype),
        fatalities = fatalities.sum.type,
        fatal.rank = fatalities.rank,
        injuries = injuries.sum.type,
        injuries.rank = injuries.rank)

It turns out that the dataframe formed is quite large (the “evtype” variable has 985 levels). So we’ll have to find a way to focus in on the “most damaging” storm types when it comes to health. Let’s look at entries that rank in the top 25 in both fatalities and injuries.

#top 25 in both fatalities and injuries
health.top25 <- health.damage[fatalities.rank<=25 & injuries.rank<=25,]
health.top25
##            event.type fatalities fatal.rank injuries injuries.rank
## 30           BLIZZARD        101       19.5      805            17
## 130    EXCESSIVE HEAT       1903        2.0     6525             4
## 153       FLASH FLOOD        978        3.0     1777             8
## 170             FLOOD        470        7.0     6789             3
## 275              HEAT        937        4.0     2100             6
## 278         HEAT WAVE        172       13.0      309            24
## 310        HEAVY SNOW        127       16.0     1021            14
## 359         HIGH WIND        248        9.0     1137            13
## 427         ICE STORM         89       24.0     1975             7
## 464         LIGHTNING        816        5.0     5230             5
## 760 THUNDERSTORM WIND        133       15.0     1488             9
## 834           TORNADO       5633        1.0    91346             1
## 856         TSTM WIND        504        6.0     6957             2
## 957          WILDFIRE         75       25.0      911            15
## 972      WINTER STORM        206       11.0     1321            11

That’s much more manageable. Let’s refine things even further to see which storm types are in the top 10 in both fatalities and injuries.

#top 10 in both fatalities and injuries
health.top10 <- health.damage[fatalities.rank<=10 & injuries.rank<=10,]
health.top10
##         event.type fatalities fatal.rank injuries injuries.rank
## 130 EXCESSIVE HEAT       1903          2     6525             4
## 153    FLASH FLOOD        978          3     1777             8
## 170          FLOOD        470          7     6789             3
## 275           HEAT        937          4     2100             6
## 464      LIGHTNING        816          5     5230             5
## 834        TORNADO       5633          1    91346             1
## 856      TSTM WIND        504          6     6957             2

If the numbers still seem a little hard to put in perspective, we can try sorting the data, and even make a plot of the two rankings.

#top 25 both health damage ranks - figure#
f.ordered.health <- health.top25[
  order(health.top25$fatalities, health.top25$injuries, decreasing = TRUE),
    c("event.type", "fatalities", "fatal.rank")]
names.f <- f.ordered.health$event.type
length.f <- length(f.ordered.health$fatal.rank)
i.ordered.health <- health.top25[
    order(health.top25$injuries, health.top25$fatalities, decreasing = TRUE),
    c("event.type", "injuries", "injuries.rank")]
names.i <- i.ordered.health$event.type
length.i <- length(i.ordered.health$injuries.rank)
par(mfrow = c(1,2), mar = c(1, 3.6, 2, 0), oma = c(0, 0, 1, 0), mgp = c(2.3,1,0))
plot(y = f.ordered.health$fatal.rank, x = 1:length.f,
    ylim = c(25,1), xlim = c(1,50), pch = 19, xlab = "",
    xaxt = "n", ylab = "Ranks in Descending Order",
    main = "Fatality Ranks", cex.lab = 1.15)
text(y = f.ordered.health$fatal.rank, x = 1:length.f, names.f, pos=4)
par(mar = c(1, 0.6, 2, 3))
plot(y = i.ordered.health$injuries.rank, x = 1:length.i,
    ylim = c(25,1), xlim = c(1,50), pch = 19, xlab = "", xaxt = "n",
    yaxt = "n", ylab = "", main = "Injury Ranks")
text(y = i.ordered.health$injuries.rank, x = 1:length.i, names.i, pos=4)

We can follow a similar process to determine economic damage. We’ll create an economic damage dataframe:

###economic damage == propdmg | cropdmg
propdmg.sum.type <- as.vector(tapply(storms$propdmg, storms$evtype, sum))
cropdmg.sum.type <- as.vector(tapply(storms$cropdmg, storms$evtype, sum))
propdmg.rank <- rank(-propdmg.sum.type)
cropdmg.rank <- rank(-cropdmg.sum.type)
econ.damage <- data.frame(event.type = levels(storms$evtype),
    propdmg = propdmg.sum.type,
        propdmg.rank = propdmg.rank,
        cropdmg = cropdmg.sum.type,
        cropdmg.rank = cropdmg.rank)

And we’ll identify the entries that are in both the top 25 in property and crop damage.

#top 25 in both property and crop damage
econ.damage[propdmg.rank<=25 & cropdmg.rank<=25,]
##             event.type      propdmg propdmg.rank     cropdmg cropdmg.rank
## 95             DROUGHT   1046106000           23 13972566000            1
## 153        FLASH FLOOD  16141368610            5  1421317100            8
## 170              FLOOD 144657716800            1  5661968450            2
## 244               HAIL  15732594420            6  3025977450            5
## 310         HEAVY SNOW    932590840           24   134653100           25
## 359          HIGH WIND   5270081260           10   638571300           13
## 402          HURRICANE  11868319010            7  2741910000            6
## 411  HURRICANE/TYPHOON  69305840000            2  2607872800            7
## 427          ICE STORM   3944977810           15  5022113500            4
## 590        RIVER FLOOD   5118945500           11  5029459000            3
## 760  THUNDERSTORM WIND   3483265140           16   414843050           18
## 786 THUNDERSTORM WINDS   1742126050           20   190742700           22
## 834            TORNADO  56937459180            3   415113110           17
## 848     TROPICAL STORM   7703890550            8   678346000           12
## 856          TSTM WIND   4484983440           14   554007350           14
## 957           WILDFIRE   4765114000           12   295472800           20

To see the entries ranked in the top 10 for both property and crop damage

#top 10 in both property and crop damage
econ.damage[propdmg.rank<=10 & cropdmg.rank<=10,]
##            event.type      propdmg propdmg.rank    cropdmg cropdmg.rank
## 153       FLASH FLOOD  16141368610            5 1421317100            8
## 170             FLOOD 144657716800            1 5661968450            2
## 244              HAIL  15732594420            6 3025977450            5
## 402         HURRICANE  11868319010            7 2741910000            6
## 411 HURRICANE/TYPHOON  69305840000            2 2607872800            7

Again, if we want to a visual of the entries sorted by property and crop damage ranks, we can achieve that via the following code:

#top 25 both economic damage ranks - figure#
econ.top25 <- econ.damage[propdmg.rank<=25 & cropdmg.rank<=25,]
p.ordered.econ <- econ.top25[
    order(econ.top25$propdmg, econ.top25$cropdmg, decreasing = TRUE),
    c("event.type", "propdmg", "propdmg.rank")]
names.p <- p.ordered.econ$event.type
length.p <- length(p.ordered.econ$propdmg.rank)
c.ordered.econ <- econ.top25[
    order(econ.top25$cropdmg, econ.top25$propdmg, decreasing = TRUE),
    c("event.type", "cropdmg", "cropdmg.rank")]
names.c <- c.ordered.econ$event.type
length.c <- length(c.ordered.econ$cropdmg.rank)
par(mfrow = c(1,2), mar = c(1, 3.6, 2, 0), oma = c(0, 0, 1, 0), mgp = c(2.3,1,0))
plot(y = p.ordered.econ$propdmg.rank, x = 1:length.p,
    ylim = c(25,1), xlim = c(1,70), pch = 19, xlab = "",
    xaxt = "n", ylab = "Ranks in Descending Order",
    main = "Property Damage", cex.lab = 1.15)
text(y = p.ordered.econ$propdmg.rank, x = 1:length.p, names.p, pos=4)
par(mar = c(1, 0.6, 2, 3))
plot(y = c.ordered.econ$cropdmg.rank, x = 1:length.c,
    ylim = c(25,1), xlim = c(1,70), pch = 19, xlab = "", xaxt = "n",
    yaxt = "n", ylab = "", main = "Crop Damage")
text(y = c.ordered.econ$cropdmg.rank, x = 1:length.c, names.c, pos=4)

These results are intriguing, but this analysis presumes that the “evtype”" variable (which identifies the type of storm) is properly inputted. A simple exploration reveals this is not the case. As an example, look at this snippet of the “levels” present in the “evtype”" variable.

levels(storms$evtype)[20:40]
##  [1] "BEACH EROSIN"                   "Beach Erosion"                 
##  [3] "BEACH EROSION"                  "BEACH EROSION/COASTAL FLOOD"   
##  [5] "BEACH FLOOD"                    "BELOW NORMAL PRECIPITATION"    
##  [7] "BITTER WIND CHILL"              "BITTER WIND CHILL TEMPERATURES"
##  [9] "Black Ice"                      "BLACK ICE"                     
## [11] "BLIZZARD"                       "BLIZZARD AND EXTREME WIND CHIL"
## [13] "BLIZZARD AND HEAVY SNOW"        "Blizzard Summary"              
## [15] "BLIZZARD WEATHER"               "BLIZZARD/FREEZING RAIN"        
## [17] "BLIZZARD/HEAVY SNOW"            "BLIZZARD/HIGH WIND"            
## [19] "BLIZZARD/WINTER STORM"          "BLOW-OUT TIDE"                 
## [21] "BLOW-OUT TIDES"

The redundancy that exists within the “evtype” variable undermines any validity our previous analysis could profess to offer. What if the most damaging types of storms have been split into so many pieces that their totals escape our notice? Adding to the predicament is the notion that some storm-type labels are distinct in appearance but highly similar in meaning. For example, “Coastal Storm”, “Hurricane”, “Typhoon”, etc. all appear to be describing what is essentially the same phenomena, yet they are present as distinct levels within the “evtype” variable. It’s probably improper, and it’s certainly arbitrary, to separate these effects. Unfortunately, the size (985 levels) and complexity of the naming system within the “evtype” variable make direct intervention seem impractical. As a sort of last resort, I constructed 15 broad categories to collect labels that appeared to be describing similar phenomena. This was a crude work-around, and it introduces its own bias into the analysis, but I wanted to provide at least some sort of check against the potentially devastating confounders present in our initial analysis.

#addressing redundance issues #creating broad categories
damage <- data.frame(health.damage, econ.damage[,2:5])
damage$event.type <- tolower(damage$event.type)
damage$event.cat <- character(length(damage$event.type))
damage[grepl("flood|(small stream)",damage$event.type),"event.cat"] <- "flood"
damage[grepl("flash flood",damage$event.type),"event.cat"] <- "flash.flood"
damage[grepl("slide|erosion|landslump|avalanc",damage$event.type),"event.cat"] <- "land.slide"
damage[grepl("surf|seas|tide|swells|current",damage$event.type),"event.cat"] <- "water.current"
damage[grepl("cold|freez|frost|chill|glaze|hypothermia|exposure",damage$event.type),"event.cat"] <- "cold" 
damage[grepl("snow|ice|blizzard|winter|wintry",damage$event.type),"event.cat"] <- "snow.ice"
damage[grepl("(mixed precip)|(excessive wetness)|(cool and wet)",damage$event.type),"event.cat"] <- "mixed.precip" 
damage[grepl("wind|tornado|nado|waterspout|whirlwind|(dust devil)|(dust storm)",damage$event.type),"event.cat"] <- "wind.storm"
damage[grepl("microburst|thunder|rain|shower|lightning|hail|tstm",damage$event.type),"event.cat"] <- "rain.storm"
damage[grepl("hurricane|(tropical depression)|(tropical storm)|(coastal storm)",damage$event.type),"event.cat"] <- "tropical.storm"
damage[grepl("mishap|accident",damage$event.type),"event.cat"] <- "accident"
damage[grepl("tsunami",damage$event.type),"event.cat"] <- "tsunami"
damage[grepl("fire",damage$event.type),"event.cat"] <- "fire"
damage[grepl("heat|drought|(unseasonably warm)",damage$event.type),"event.cat"] <- "heat"
damage$event.cat <- factor(damage$event.cat)
levels(damage$event.cat) <- c("misc",levels(damage$event.cat)[2:15])

Now that we’ve created these new categories, we can create a new dataframe for the sums of fatalities, injuries, property damages, and crop damages by category.

fatal2 <- as.vector(tapply(damage[,2], damage$event.cat, sum))
injuries2 <- as.vector(tapply(damage[,4], damage$event.cat, sum))
propdmg2 <- as.vector(tapply(damage[,6], damage$event.cat, sum))
cropdmg2 <- as.vector(tapply(damage[,8], damage$event.cat, sum))
dmg.vec <- gl(4,15,labels = c("fatalities", "injuries", "propdmg", "cropdmg"))
cat.vec <- factor(rep(levels(damage$event.cat), times = 4))
damage2 <- data.frame(total = c(fatal2, injuries2, propdmg2, cropdmg2), dmg.vec, cat.vec)

Results

Our initial analysis suggested that Tornado, Heat, and Flood were probably the most damaging hazards to human health. They also revealed that Flood & Hurricane were probably the most damaging economically. We chose to revisit this analysis after attempting to address some of the redundancies in the storm type variable (“evtype”). While it was necessary, in the spirit of validity, to see if the results were consistent when the issues with “evtype” were mitigated, it turns out that the results were not that different. “Windstorms”, which include tornadoes, still represented the most dangerous events to human health (by a wide margin). Heat and flood also stood out from the pack, as did “Rain Storms” (thunderstorms, hail, etc.), which made a more obvious appearance than in the initial analysis. When it came to economic damage, “Flood” and “Tropical Storm” finished in the lead once again.

library(ggplot2)
ggplot(data = damage2, aes(cat.vec, total))+
geom_point()+facet_grid(dmg.vec ~ ., scales = "free")+
theme(axis.text.x = element_text(angle=90, size = 14, vjust = 0.25),
    strip.text.y = element_text(size = 14))+
labs(x = NULL, y = NULL)

Obviously, there are caveats with any analysis. As an example, it may seem surprising that “flash floods” and “floods” were created as separate categories. This was done because flash floods are often considered spontaneous while floods are often considered seasonal. Fortunately, it doesn’t appear the results would be substantially different if these two categories were combined. Nevertheless, this example highlights the reality that this analysis (like any analysis) was forced to rely on assumptions. I did my best to handle what I perceived to be obstacles to proper examination, but it’s possible I erred in my application. Either way, I think it might be beneficial to streamline the data-collection process and standardize the methods for determining event type. This might result in a more stable dataset on which to conduct analysis.

Overall, it seems that the the most dangerous event in terms of human health is a tornado. Notable contenders include excessive heat and thunderstorms. In terms of economic damage, the heaviest hitters are floods and tropical storms, though heat proved especially perilous for crops.

Thank you for taking the time to read this report. I hope you’re having a great day.