In the following analysis, we attempt to answer two important questions.
and
The analysis was done using the R statistical program using data from the NOAA Storm Database (1950-2011).
As we see below, the analysis concludes that Tornados have been the most harmful both in terms of health consequences (injuries and deaths), while floods have had the most devastating economic consequences.
First, we load the dataset.
setInternet2(use = TRUE)
#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",temp)
data <- read.csv("repdata_data_StormData.csv.bz2", header=TRUE)
Here, we adjust damage values based on the associated exponent variables (VERY important!) and then remove zero values (i.e. rows where all damage variables and health variables are less than or equal to zero.)
lookup <- c(k = 10^3, K = 10^3, m = 10^6, M = 10^6, B = 10^9,
"2" = 10^2, "3" = 10^3, "4" = 10^4, "5" = 10^5, "6" = 10^6, "7" = 10^7, "8" = 10^8,
"-" = 0, "0" = 0, "?" = 0, h = 0)
data$CROPDMG <- data$CROPDMG * lookup[as.character(data$CROPDMGEXP)]
data$PROPDMG <- data$PROPDMG * lookup[as.character(data$PROPDMGEXP)]
data <- subset(data, FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0 | PROPDMG > 0)
From here on out, we use data.table since it’s a faster method of accomplishing our task.
The relevant variables are each summarized (summed) by EVTYPE. Then since we are only interested in the most impactful variables, we take the top 15 in each category. We look at injuries and fatalities separately but consider economic cost as a total (crops + property damage).
NOTE that we don’t amalgmate the EVTYPE values (next section below) until AFTER we significantly subset the data. However, we are not affecting our results by doing so given how skewed the data is towards the top 15 events (see commented out code for details.)
library(data.table)
data <- data.table(data)
summary <- data[, list(Injured = sum(INJURIES),
Fatalities = sum(FATALITIES),
Dmg = sum(CROPDMG,PROPDMG, na.rm=TRUE)),
by=EVTYPE]
#As we can see if we ran this code, over 93% of the injuries happened in the top 15 injury causing #events. Thus, removing all but the top 15 has little to no effect on our results.
#injuries <- summary[order(-summary$Injured), ]
#sum(injuries$Injured[1:15])/sum(injuries$Injured[1:438])
#More than 85% of the fatalities happened in the top 15 events, ordered by fatalities.
#fatalities <- summary[order(-summary$Fatalities), ]
#sum(fatalities$Fatalities[1:15])/sum(fatalities$Fatalities[1:438])
#Lastly, over 92% of the damage is explained by the top 15 events, when ordered by damages.
#damage <- summary[order(-summary$Dmg), ]
#sum(damage$Dmg[1:15])/sum(damage$Dmg[1:438])
injuries <- summary[order(-summary$Injured), ][1:15, ]
fatalities <- summary[order(-summary$Fatalities), ][1:15, ]
damage <- summary[order(-summary$Dmg), ][1:15, ]
Lastly, we adjust the EVTYPE names to amalgamate ones that clearly belong together (e.g. “TSTM WIND” and “THUNDERSTORM WIND” to just “THUNDERSTORM WIND” or “HURRICANE/TYPHOON” and “HURRICANE” to “HURRICANE/TYPHOON”). One could make different argumrents for how the EVTYPEs should be aggregated (or not) but upon manual inspection and comparison to the page 6 of the data documentation, this seems to be the most logical and correct method. Note, for example that the documentation treats Flash Flood and Flood as two separate items!
library(data.table)
## Warning: package 'data.table' was built under R version 3.1.2
injuries$EVTYPE[grepl("TSTM WIND",injuries$EVTYPE)] <- "THUNDERSTORM WIND"
fatalities$EVTYPE[grepl("TSTM WIND",fatalities$EVTYPE)] <- "THUNDERSTORM WIND"
damage$EVTYPE[grepl("TSTM WIND",damage$EVTYPE)] <- "THUNDERSTORM WIND"
injuries$EVTYPE[grepl("HURRICANE",injuries$EVTYPE)] <- "HURRICANE/TYPHOON"
fatalities$EVTYPE[grepl("HURRICANE",fatalities$EVTYPE)] <- "HURRICANE/TYPHOON"
damage$EVTYPE[grepl("HURRICANE",damage$EVTYPE)] <- "HURRICANE/TYPHOON"
injuries$EVTYPE[grepl("RIP CURRENTS",injuries$EVTYPE)] <- "RIP CURRENT"
fatalities$EVTYPE[grepl("RIP CURRENTS",fatalities$EVTYPE)] <- "RIP CURRENT"
damage$EVTYPE[grepl("RIP CURRENTS",damage$EVTYPE)] <- "RIP CURRENT"
injuries$EVTYPE[grepl("HEAT WAVE",injuries$EVTYPE)] <- "HEAT"
fatalities$EVTYPE[grepl("HEAT WAVE",fatalities$EVTYPE)] <- "HEAT"
damage$EVTYPE[grepl("HEAT WAVE",damage$EVTYPE)] <- "HEAT"
injuries$EVTYPE[grepl("RIVER FLOOD",injuries$EVTYPE)] <- "FLOOD"
fatalities$EVTYPE[grepl("RIVER FLOOD",fatalities$EVTYPE)] <- "FLOOD"
damage$EVTYPE[grepl("RIVER FLOOD",damage$EVTYPE)] <- "FLOOD"
#Now we re-summarize the data:
injuries <- injuries[, list(Injured = sum(Injured),
Fatalities = sum(Fatalities),
Dmg = sum(Dmg)),
by=EVTYPE]
fatalities <- fatalities[, list(Injured = sum(Injured),
Fatalities = sum(Fatalities),
Dmg = sum(Dmg)),
by=EVTYPE]
damage <- damage[, list(Injured = sum(Injured),
Fatalities = sum(Fatalities),
Dmg = sum(Dmg)),
by=EVTYPE]
We are now ready to chart and interpret our results.
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.2
qplot(Injured, EVTYPE, data = injuries)
qplot(Fatalities, EVTYPE, data = fatalities)
We can clearly see that Tornados have the most devastating impact health-wise. They are the largest EVTYPE cause of both injuries and deaths.
qplot(Dmg, EVTYPE, data = damage)
Here, we see that flood is most detrimental from an economic standpoint, with hurricanes and tornados also taking a serious toll.
One limitation we will mention is that although here we’ve considered property damage and crop damage as equal, one could assume that crop damage has longer-lasting economic impact (i.e. more of a ripple effect), for example. Regardless, we understood the question asked to mean immediate financial impact as reported in the data, as opposed to impact to the overall U.S. economy. Also, we could not find or reasonably deduce what “h” meant in the exponent variable, so we treated it as a zero.