In this analysis, historical storm data collected by the NOAA was examined with the goal of answering two specific questions. 1) Across the United States, which types of events are most harmful with respect to population health? 2) Across the United States, which types of events have the greatest economic consequences? The most complex aspect of answering these questions related to the characteristics of the data frame, and most of the analysis below is aimed at distilling the pertinent information from the dataset. This analysis was conducted entirely in R, and the relevant code has been provided along with commentary.
The original datafile (which can be downloaded here) contained 902297 rows and 37 columns, so it’s somewhat unwieldy. We’ll be using the free R Software to conduct our analysis. We’ll also include the code so that the analysis can be performed so long as the datafile is located in the working directory of the user’s R software.
Let’s begin by grabbing our data. Since R tends to hold objects in a system’s RAM, we’ll only load the columns related to our goals.
select.vec <- c(rep("NULL", 3), "character", rep("NULL", 2), rep("character", 2),
rep("NULL", 14), rep("numeric", 3), "character", "numeric", "character", rep("NULL", 9))
storms <- read.csv(bzfile("repdata-data-StormData.csv.bz2"), colClasses = select.vec)
#to make coding easier, we'll make column names lowercase
colnames(storms) <- tolower(colnames(storms))
A quick exploration of the dataframe reveals that there may be some data-entry errors in a few of the columns. The “propdmgexp” and“cropdmgexp” variables, in particular, seem to be affected. According to the pg. 12 of the provided NOAA documentation, damage estimates are to be followed by “K” when they represent thousands of dollars, “M” when they represent millions, and “B” when they represent billions. It would seem to follow, then, that “propdmgexp” and “cropdmgexp” should each consist of those three levels. However, this is not what we observe.
levels(storms$propdmgexp)
## [1] "" "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K"
## [18] "m" "M"
levels(storms$cropdmgexp)
## [1] "" "?" "0" "2" "B" "k" "K" "m" "M"
The apparent inconsistencies in the data need to be addressed before the analysis can proceed. Unfortunately, we can’t be certain of the intent with which each of these entries were recorded. We’ll be forced to make a few assumptions. First, we’ll eliminate the case-disparities within the factors by making all the levels lowercase.
storms$propdmgexp <- factor(tolower(storms$propdmgexp))
storms$cropdmgexp <- factor(tolower(storms$cropdmgexp))
Next, we’ll group every entry not identified with a letter into the “k” group. This is because the documentation only specifies three levels, and it seems reasonable to conclude that none of the the data would have been recorded at a multiplier smaller than “thousands”. (We don’t, however, want to assign these into the “millions” or “billions” groups without explicit cause because that might cause way too much weight to be given). Before we completely reconfigure these entries, we would like to flag each of these entries as a potential source of error in our analysis, so we’ll temporarily assign them into a “?” group.
levels(storms$propdmgexp) <- c(rep("?", times = 13), "b", "h", "k", "m")
levels(storms$cropdmgexp) <- c(rep("?", times = 4), "b", "k", "m")
Moving on to potentially erroneous entries that contain a letter, we find there are only seven instances in the entire dataset (for comparison, there are well over 400,000 non-alphabetical entries in “propdmgexp” and “cropdmgexp”). Surprisingly, all of the potentially erroneous alphabetical entries are the letter “h”. My first instinct was to assume these entries were intended to represent “billions” based on the proximity of “h” and “b” on a standard keyboard. However, I decided to classify these entries the same way as the non-alphabetical entries (into the “k” level) because an examination of the relevent rows revealed these entries were related to thunderstorm events and it seemed unlikely individual thunderstorms would cause billions of dollars in damage. Now that we have reached a decision on all of our potential errors, we’ll flag them and then re-classify them.
#flagging potential data-entry errors
storms$flag <- (grepl("[^bkm]", storms$propdmgexp)|grepl("[^bkm]", storms$cropdmgexp))
#re-classifying "mysterious" levels
levels(storms$propdmgexp) <- c("k", "b", "k", "k", "m")
levels(storms$cropdmgexp) <- c("k", "b", "k", "m")
Finally, we need to apply our multipliers to the numbers in our “propdmg” and “cropdmg” variables. To avoid accidentally repeating this process and distorting the numbers, we’ll discard the multiplier variables (“propdmgexp” and “cropdmgexp”) afterwards.
storms$propdmg[storms$propdmgexp == "b"] <-
storms$propdmg[storms$propdmgexp == "b"]*(10^9)
storms$propdmg[storms$propdmgexp == "k"] <-
storms$propdmg[storms$propdmgexp == "k"]*(10^3)
storms$propdmg[storms$propdmgexp == "m"] <-
storms$propdmg[storms$propdmgexp == "m"]*(10^6)
storms$propdmgexp <- NULL
storms$cropdmg[storms$cropdmgexp == "b"] <-
storms$cropdmg[storms$cropdmgexp == "b"]*(10^9)
storms$cropdmg[storms$cropdmgexp == "k"] <-
storms$cropdmg[storms$cropdmgexp == "k"]*(10^3)
storms$cropdmg[storms$cropdmgexp == "m"] <-
storms$cropdmg[storms$cropdmgexp == "m"]*(10^6)
storms$cropdmgexp <- NULL
Now that our data are in a more practical form, we can proceed with our analysis. Let’s sum the number of fatalities and injuries by the type of storm event. We’ll cobble these totals together in a dataframe along with their ranks (where the entries with the largest totals are assigned a rank of 1).
fatalities.sum.type <- as.vector(tapply(storms$fatalities, storms$evtype, sum))
injuries.sum.type <- as.vector(tapply(storms$injuries, storms$evtype, sum))
fatalities.rank <- rank(-fatalities.sum.type)
injuries.rank <- rank(-injuries.sum.type)
health.damage <- data.frame(event.type = levels(storms$evtype),
fatalities = fatalities.sum.type,
fatal.rank = fatalities.rank,
injuries = injuries.sum.type,
injuries.rank = injuries.rank)
It turns out that the dataframe formed is quite large (the “evtype” variable has 985 levels). So we’ll have to find a way to focus in on the “most damaging” storm types when it comes to health. Let’s look at entries that rank in the top 25 in both fatalities and injuries.
#top 25 in both fatalities and injuries
health.top25 <- health.damage[fatalities.rank<=25 & injuries.rank<=25,]
health.top25
## event.type fatalities fatal.rank injuries injuries.rank
## 30 BLIZZARD 101 19.5 805 17
## 130 EXCESSIVE HEAT 1903 2.0 6525 4
## 153 FLASH FLOOD 978 3.0 1777 8
## 170 FLOOD 470 7.0 6789 3
## 275 HEAT 937 4.0 2100 6
## 278 HEAT WAVE 172 13.0 309 24
## 310 HEAVY SNOW 127 16.0 1021 14
## 359 HIGH WIND 248 9.0 1137 13
## 427 ICE STORM 89 24.0 1975 7
## 464 LIGHTNING 816 5.0 5230 5
## 760 THUNDERSTORM WIND 133 15.0 1488 9
## 834 TORNADO 5633 1.0 91346 1
## 856 TSTM WIND 504 6.0 6957 2
## 957 WILDFIRE 75 25.0 911 15
## 972 WINTER STORM 206 11.0 1321 11
That’s much more manageable. Let’s refine things even further to see which storm types are in the top 10 in both fatalities and injuries.
#top 10 in both fatalities and injuries
health.top10 <- health.damage[fatalities.rank<=10 & injuries.rank<=10,]
health.top10
## event.type fatalities fatal.rank injuries injuries.rank
## 130 EXCESSIVE HEAT 1903 2 6525 4
## 153 FLASH FLOOD 978 3 1777 8
## 170 FLOOD 470 7 6789 3
## 275 HEAT 937 4 2100 6
## 464 LIGHTNING 816 5 5230 5
## 834 TORNADO 5633 1 91346 1
## 856 TSTM WIND 504 6 6957 2
If the numbers still seem a little hard to put in perspective, we can try sorting the data, and even make a plot of the two rankings.
#top 25 both health damage ranks - figure#
f.ordered.health <- health.top25[
order(health.top25$fatalities, health.top25$injuries, decreasing = TRUE),
c("event.type", "fatalities", "fatal.rank")]
names.f <- f.ordered.health$event.type
length.f <- length(f.ordered.health$fatal.rank)
i.ordered.health <- health.top25[
order(health.top25$injuries, health.top25$fatalities, decreasing = TRUE),
c("event.type", "injuries", "injuries.rank")]
names.i <- i.ordered.health$event.type
length.i <- length(i.ordered.health$injuries.rank)
par(mfrow = c(1,2), mar = c(1, 3.6, 2, 0), oma = c(0, 0, 1, 0), mgp = c(2.3,1,0))
plot(y = f.ordered.health$fatal.rank, x = 1:length.f,
ylim = c(25,1), xlim = c(1,50), pch = 19, xlab = "",
xaxt = "n", ylab = "Ranks in Descending Order",
main = "Fatality Ranks", cex.lab = 1.15)
text(y = f.ordered.health$fatal.rank, x = 1:length.f, names.f, pos=4)
par(mar = c(1, 0.6, 2, 3))
plot(y = i.ordered.health$injuries.rank, x = 1:length.i,
ylim = c(25,1), xlim = c(1,50), pch = 19, xlab = "", xaxt = "n",
yaxt = "n", ylab = "", main = "Injury Ranks")
text(y = i.ordered.health$injuries.rank, x = 1:length.i, names.i, pos=4)
We can follow a similar process to determine economic damage. We’ll create an economic damage dataframe:
###economic damage == propdmg | cropdmg
propdmg.sum.type <- as.vector(tapply(storms$propdmg, storms$evtype, sum))
cropdmg.sum.type <- as.vector(tapply(storms$cropdmg, storms$evtype, sum))
propdmg.rank <- rank(-propdmg.sum.type)
cropdmg.rank <- rank(-cropdmg.sum.type)
econ.damage <- data.frame(event.type = levels(storms$evtype),
propdmg = propdmg.sum.type,
propdmg.rank = propdmg.rank,
cropdmg = cropdmg.sum.type,
cropdmg.rank = cropdmg.rank)
And we’ll identify the entries that are in both the top 25 in property and crop damage.
#top 25 in both property and crop damage
econ.damage[propdmg.rank<=25 & cropdmg.rank<=25,]
## event.type propdmg propdmg.rank cropdmg cropdmg.rank
## 95 DROUGHT 1046106000 23 13972566000 1
## 153 FLASH FLOOD 16141368610 5 1421317100 8
## 170 FLOOD 144657716800 1 5661968450 2
## 244 HAIL 15732594420 6 3025977450 5
## 310 HEAVY SNOW 932590840 24 134653100 25
## 359 HIGH WIND 5270081260 10 638571300 13
## 402 HURRICANE 11868319010 7 2741910000 6
## 411 HURRICANE/TYPHOON 69305840000 2 2607872800 7
## 427 ICE STORM 3944977810 15 5022113500 4
## 590 RIVER FLOOD 5118945500 11 5029459000 3
## 760 THUNDERSTORM WIND 3483265140 16 414843050 18
## 786 THUNDERSTORM WINDS 1742126050 20 190742700 22
## 834 TORNADO 56937459180 3 415113110 17
## 848 TROPICAL STORM 7703890550 8 678346000 12
## 856 TSTM WIND 4484983440 14 554007350 14
## 957 WILDFIRE 4765114000 12 295472800 20
To see the entries ranked in the top 10 for both property and crop damage
#top 10 in both property and crop damage
econ.damage[propdmg.rank<=10 & cropdmg.rank<=10,]
## event.type propdmg propdmg.rank cropdmg cropdmg.rank
## 153 FLASH FLOOD 16141368610 5 1421317100 8
## 170 FLOOD 144657716800 1 5661968450 2
## 244 HAIL 15732594420 6 3025977450 5
## 402 HURRICANE 11868319010 7 2741910000 6
## 411 HURRICANE/TYPHOON 69305840000 2 2607872800 7
Again, if we want to a visual of the entries sorted by property and crop damage ranks, we can achieve that via the following code:
#top 25 both economic damage ranks - figure#
econ.top25 <- econ.damage[propdmg.rank<=25 & cropdmg.rank<=25,]
p.ordered.econ <- econ.top25[
order(econ.top25$propdmg, econ.top25$cropdmg, decreasing = TRUE),
c("event.type", "propdmg", "propdmg.rank")]
names.p <- p.ordered.econ$event.type
length.p <- length(p.ordered.econ$propdmg.rank)
c.ordered.econ <- econ.top25[
order(econ.top25$cropdmg, econ.top25$propdmg, decreasing = TRUE),
c("event.type", "cropdmg", "cropdmg.rank")]
names.c <- c.ordered.econ$event.type
length.c <- length(c.ordered.econ$cropdmg.rank)
par(mfrow = c(1,2), mar = c(1, 3.6, 2, 0), oma = c(0, 0, 1, 0), mgp = c(2.3,1,0))
plot(y = p.ordered.econ$propdmg.rank, x = 1:length.p,
ylim = c(25,1), xlim = c(1,70), pch = 19, xlab = "",
xaxt = "n", ylab = "Ranks in Descending Order",
main = "Property Damage", cex.lab = 1.15)
text(y = p.ordered.econ$propdmg.rank, x = 1:length.p, names.p, pos=4)
par(mar = c(1, 0.6, 2, 3))
plot(y = c.ordered.econ$cropdmg.rank, x = 1:length.c,
ylim = c(25,1), xlim = c(1,70), pch = 19, xlab = "", xaxt = "n",
yaxt = "n", ylab = "", main = "Crop Damage")
text(y = c.ordered.econ$cropdmg.rank, x = 1:length.c, names.c, pos=4)
These results are intriguing, but this analysis presumes that the “evtype”" variable (which identifies the type of storm) is properly inputted. A simple exploration reveals this is not the case. As an example, look at this snippet of the “levels” present in the “evtype”" variable.
levels(storms$evtype)[20:40]
## [1] "BEACH EROSIN" "Beach Erosion"
## [3] "BEACH EROSION" "BEACH EROSION/COASTAL FLOOD"
## [5] "BEACH FLOOD" "BELOW NORMAL PRECIPITATION"
## [7] "BITTER WIND CHILL" "BITTER WIND CHILL TEMPERATURES"
## [9] "Black Ice" "BLACK ICE"
## [11] "BLIZZARD" "BLIZZARD AND EXTREME WIND CHIL"
## [13] "BLIZZARD AND HEAVY SNOW" "Blizzard Summary"
## [15] "BLIZZARD WEATHER" "BLIZZARD/FREEZING RAIN"
## [17] "BLIZZARD/HEAVY SNOW" "BLIZZARD/HIGH WIND"
## [19] "BLIZZARD/WINTER STORM" "BLOW-OUT TIDE"
## [21] "BLOW-OUT TIDES"
The redundancy that exists within the “evtype” variable undermines any validity our previous analysis could profess to offer. What if the most damaging types of storms have been split into so many pieces that their totals escape our notice? Adding to the predicament is the notion that some storm-type labels are distinct in appearance but highly similar in meaning. For example, “Coastal Storm”, “Hurricane”, “Typhoon”, etc. all appear to be describing what is essentially the same phenomena, yet they are present as distinct levels within the “evtype” variable. It’s probably improper, and it’s certainly arbitrary, to separate these effects. Unfortunately, the size (985 levels) and complexity of the naming system within the “evtype” variable make direct intervention seem impractical. As a sort of last resort, I constructed 15 broad categories to collect labels that appeared to be describing similar phenomena. This was a crude work-around, and it introduces its own bias into the analysis, but I wanted to provide at least some sort of check against the potentially devastating confounders present in our initial analysis.
#addressing redundance issues #creating broad categories
damage <- data.frame(health.damage, econ.damage[,2:5])
damage$event.type <- tolower(damage$event.type)
damage$event.cat <- character(length(damage$event.type))
damage[grepl("flood|(small stream)",damage$event.type),"event.cat"] <- "flood"
damage[grepl("flash flood",damage$event.type),"event.cat"] <- "flash.flood"
damage[grepl("slide|erosion|landslump|avalanc",damage$event.type),"event.cat"] <- "land.slide"
damage[grepl("surf|seas|tide|swells|current",damage$event.type),"event.cat"] <- "water.current"
damage[grepl("cold|freez|frost|chill|glaze|hypothermia|exposure",damage$event.type),"event.cat"] <- "cold"
damage[grepl("snow|ice|blizzard|winter|wintry",damage$event.type),"event.cat"] <- "snow.ice"
damage[grepl("(mixed precip)|(excessive wetness)|(cool and wet)",damage$event.type),"event.cat"] <- "mixed.precip"
damage[grepl("wind|tornado|nado|waterspout|whirlwind|(dust devil)|(dust storm)",damage$event.type),"event.cat"] <- "wind.storm"
damage[grepl("microburst|thunder|rain|shower|lightning|hail|tstm",damage$event.type),"event.cat"] <- "rain.storm"
damage[grepl("hurricane|(tropical depression)|(tropical storm)|(coastal storm)",damage$event.type),"event.cat"] <- "tropical.storm"
damage[grepl("mishap|accident",damage$event.type),"event.cat"] <- "accident"
damage[grepl("tsunami",damage$event.type),"event.cat"] <- "tsunami"
damage[grepl("fire",damage$event.type),"event.cat"] <- "fire"
damage[grepl("heat|drought|(unseasonably warm)",damage$event.type),"event.cat"] <- "heat"
damage$event.cat <- factor(damage$event.cat)
levels(damage$event.cat) <- c("misc",levels(damage$event.cat)[2:15])
Now that we’ve created these new categories, we can create a new dataframe for the sums of fatalities, injuries, property damages, and crop damages by category.
fatal2 <- as.vector(tapply(damage[,2], damage$event.cat, sum))
injuries2 <- as.vector(tapply(damage[,4], damage$event.cat, sum))
propdmg2 <- as.vector(tapply(damage[,6], damage$event.cat, sum))
cropdmg2 <- as.vector(tapply(damage[,8], damage$event.cat, sum))
dmg.vec <- gl(4,15,labels = c("fatalities", "injuries", "propdmg", "cropdmg"))
cat.vec <- factor(rep(levels(damage$event.cat), times = 4))
damage2 <- data.frame(total = c(fatal2, injuries2, propdmg2, cropdmg2), dmg.vec, cat.vec)
Our initial analysis suggested that Tornado, Heat, and Flood were probably the most damaging hazards to human health. They also revealed that Flood & Hurricane were probably the most damaging economically. We chose to revisit this analysis after attempting to address some of the redundancies in the storm type variable (“evtype”). While it was necessary, in the spirit of validity, to see if the results were consistent when the issues with “evtype” were mitigated, it turns out that the results were not that different. “Windstorms”, which include tornadoes, still represented the most dangerous events to human health (by a wide margin). Heat and flood also stood out from the pack, as did “Rain Storms” (thunderstorms, hail, etc.), which made a more obvious appearance than in the initial analysis. When it came to economic damage, “Flood” and “Tropical Storm” finished in the lead once again.
library(ggplot2)
ggplot(data = damage2, aes(cat.vec, total))+
geom_point()+facet_grid(dmg.vec ~ ., scales = "free")+
theme(axis.text.x = element_text(angle=90, size = 14, vjust = 0.25),
strip.text.y = element_text(size = 14))+
labs(x = NULL, y = NULL)
Obviously, there are caveats with any analysis. As an example, it may seem surprising that “flash floods” and “floods” were created as separate categories. This was done because flash floods are often considered spontaneous while floods are often considered seasonal. Fortunately, it doesn’t appear the results would be substantially different if these two categories were combined. Nevertheless, this example highlights the reality that this analysis (like any analysis) was forced to rely on assumptions. I did my best to handle what I perceived to be obstacles to proper examination, but it’s possible I erred in my application. Either way, I think it might be beneficial to streamline the data-collection process and standardize the methods for determining event type. This might result in a more stable dataset on which to conduct analysis.
Overall, it seems that the the most dangerous event in terms of human health is a tornado. Notable contenders include excessive heat and thunderstorms. In terms of economic damage, the heaviest hitters are floods and tropical storms, though heat proved especially perilous for crops.
Thank you for taking the time to read this report. I hope you’re having a great day.