This analysis aims to address two questions:
The analysis makes use of the NOAA NWS Storm Events Database, which may be obtained here. The data are subsetted to 1996-2011. Data are then aggregated across the 48 NWS Storm Event types and harm is measured in both total harm and harm per event.
These two approaches reveal some significant distinctions, as absolute damage is not necessarily tied to damage per event. In terms of absolute damage to human health, property, and agriculture, the most damaging events are tornado, flood, and drought respectively. However, in terms of damage per event, the most damaging event with respect to health is tsunami, and the most damaging event with respect to economics in general is hurricane/typhoon.
First, we download and load the file into a data.table. This allows for more efficient processing and aggregation than a data.frame. We need only load a subset of columns, as (for instance) precise geographical analysis and changes over time are both beyond the scope of this particular study.
library(data.table)
if (!file.exists("repdata_data_StormData.csv")) {
if (!file.exists("repdata_data_StormData.csv.bz2")) {
download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "repdata_data_StormData.csv.bz2")
}
bunzip2("repdata_data_StormData.csv.bz2")
}
rawData <- fread("repdata_data_StormData.csv",
nrows = 902297,
header = TRUE,
na.strings = "",
sep = ",",
select = c(2,8,23,24,25,26,27,28),
showProgress = FALSE)
The official NWS classification of storm events has been revised three times (see details); for this reason, only data from 1996 to present can be expected to contain the full range of events needed for analysis. We therefore can subset the data set to use only those rows reported in 1996 or later.
rawData$year <- year(as.Date(rawData$BGN_DATE, format = "%m/%d/%Y"))
post1996 <- rawData[rawData$year >= 1996, ]
Given that the Storm Events Database begins in 1950, this may seem like a substantial reduction amount of data under consideration, but it actually retains 72% of the data.
nrow(post1996) / nrow(rawData)
## [1] 0.7242959
Additionally, it may actually make the analysis more fair; 1996 dollar values and technology levels are much closer to 2011 dollar values and technology levels than they are to the 1951 equivalents.
The post-1996 data set contains human injury and fatality data expressed as raw numbers. Crop and property damage, however, are expressed with some exponent, e.g. 10 K meaning 10,000. To perform analysis on these figures, we can apply the exponents to the damage figures contained in the data.
applyExponent <- function (x, exponent) {
if (!is.na(x) && !is.na(exponent)) {
if (exponent == "K") { x * 10**3 }
else if (exponent == "M") { x * 10**6 }
else if (exponent == "B") { x * 10**9 }
else { x }
} else {
x
}
}
post1996$PROPVALUE <- mapply(x = post1996$PROPDMG, y = post1996$PROPDMGEXP,
FUN = function (x, y) applyExponent(x, y))
post1996$CROPVALUE <- mapply(x = post1996$CROPDMG, y = post1996$CROPDMGEXP,
FUN = function (x, y) applyExponent(x, y))
The next step is to organize the data into an easily aggregated form. Because the EVTYPE data is a somewhat free-form column, we need to map it onto the Event Types defined in section 7 of National Weather Service Instruction 10-1605.
post1996$EVTYPE <- gsub("[^[:alpha:] /]", "", post1996$EVTYPE) # Remove everything but numbers and slashes
post1996$EVTYPE <- gsub(" +", " ", post1996$EVTYPE) # Remove all sequences of multiple spaces
post1996$EVTYPE <- gsub("/$", "", post1996$EVTYPE) # Remove trailing slashes
post1996$EVTYPE <- gsub(" ?/ ?", "/", post1996$EVTYPE) # Remove any slashes around spaces
post1996$EVTYPE <- trimws(tolower(post1996$EVTYPE)) # Convert all text to lowercase and remove
# leading/trailing spaces.
The event database contains some rows that provide only summary information on periods of time, without being explicitly linked to an event classification. These rows can be excluded from the analysis.
post1996 <- post1996[!grepl("summary", post1996$EVTYPE)]
The next step is to classify events by searching for text. We can first create a column in the data:
post1996$eventType <- "Other"
At this point, everything has an eventType of “Other”. At any point, the items not classified with a true eventType can be viewed via the following:
unclassifiedTypes <- levels(as.factor(post1996[post1996$eventType=="Other",]$EVTYPE))
Initially, there are 352 distinct EVTYPE values.
length(unclassifiedTypes)
## [1] 352
The process of reclassification consists of making changes to the eventType value based on text contained in EVTYPE. This is probably this single most error-prone component of the analysis. The NWS classifications have very similar text-based descriptions in some places, so the general process taken for classification was as follows:
The specific textual analysis code is presented below.
post1996[post1996$EVTYPE == "astronomical low tide", ]$eventType <- "Astronomical Low Tide"
post1996[grepl("avalanche|slide", post1996$EVTYPE), ]$eventType <- "Avalanche"
post1996[grepl("blizzard", post1996$EVTYPE), ]$eventType <- "Blizzard"
post1996[grepl("coastal ?flood", post1996$EVTYPE), ]$eventType <- "Coastal Flood"
post1996[grepl("cold|chill|cool|low temp|hypothermia", post1996$EVTYPE) &
!grepl("extreme", post1996$EVTYPE), ]$eventType <- "Cold/Wind Chill"
# "Debris Flow": no obvious corresponding values
post1996[grepl("fog|vog", post1996$EVTYPE) & !grepl("freezing", post1996$EVTYPE), ]$eventType <- "Dense Fog"
post1996[grepl("smoke", post1996$EVTYPE), ]$eventType <- "Dense Smoke"
post1996[grepl("drought|dry|dri", post1996$EVTYPE), ]$eventType <- "Drought"
post1996[grepl("devil|devel", post1996$EVTYPE), ]$eventType <- "Dust Devil"
post1996[grepl("dust", post1996$EVTYPE) & !grepl("devil|devel", post1996$EVTYPE), ]$eventType <- "Dust Storm"
post1996[grepl("excessive heat", post1996$EVTYPE), ]$eventType <- "Excessive Heat"
post1996[grepl("cold|chill", post1996$EVTYPE) &
grepl("extreme", post1996$EVTYPE), ]$eventType <- "Extreme Cold/Wind Chill"
post1996[grepl("flood", post1996$EVTYPE) &
grepl("flash", post1996$EVTYPE), ]$eventType <- "Flash Flood"
post1996[grepl("flood|high water", post1996$EVTYPE) &
!grepl("flash", post1996$EVTYPE), ]$eventType <- "Flood"
post1996[grepl("frost|freeze", post1996$EVTYPE), ]$eventType <- "Frost/Freeze"
post1996[grepl("funnel", post1996$EVTYPE), ]$eventType <- "Funnel Cloud"
post1996[grepl("freezing fog", post1996$EVTYPE), ]$eventType <- "Freezing Fog"
post1996[grepl("hail", post1996$EVTYPE) &
!grepl("marine", post1996$EVTYPE), ]$eventType <- "Hail"
post1996[grepl("hot|warm|heat|hyperthermia", post1996$EVTYPE) &
!grepl("excessive", post1996$EVTYPE), ]$eventType <- "Heat"
post1996[grepl("rain", post1996$EVTYPE) &
!grepl("freezing", post1996$EVTYPE), ]$eventType <- "Heavy Rain"
post1996[grepl("snow", post1996$EVTYPE) &
grepl("heavy", post1996$EVTYPE), ]$eventType <- "Heavy Snow"
post1996[grepl("surf|seas|swell", post1996$EVTYPE), ]$eventType <- "High Surf"
post1996[grepl("wind", post1996$EVTYPE) &
!grepl("tstm|thunderstorm", post1996$EVTYPE), ]$eventType <- "High Wind"
post1996[grepl("hurricane|typhoon|floyd", post1996$EVTYPE), ]$eventType <- "Hurricane (Typhoon)"
post1996[grepl("ice|icy", post1996$EVTYPE), ]$eventType <- "Ice Storm"
post1996[grepl("lake ?effect", post1996$EVTYPE), ]$eventType <- "Lake-Effect Snow"
# "Lakeshore Flood": no obvious corresponding values
post1996[grepl("lightning", post1996$EVTYPE), ]$eventType <- "Lightning"
post1996[grepl("hail", post1996$EVTYPE) &
grepl("marine", post1996$EVTYPE), ]$eventType <- "Marine Hail"
# "Marine High Wind": no obvious corresponding values
# "Marine Strong Wind": no obvious corresponding values
post1996[grepl("coastal", post1996$EVTYPE), ]$eventType <- "Marine Thunderstorm Wind"
post1996[grepl("current", post1996$EVTYPE), ]$eventType <- "Rip Current"
post1996[grepl("seiche", post1996$EVTYPE), ]$eventType <- "Seiche"
post1996[grepl("sleet", post1996$EVTYPE), ]$eventType <- "Sleet"
post1996[grepl("surge|tide", post1996$EVTYPE), ]$eventType <- "Storm Surge/Tide"
post1996[grepl("(strong|gusty) wind|gusts|winds|wind damage|high wind", post1996$EVTYPE) &
!grepl("marine", post1996$EVTYPE), ]$eventType <- "Strong Wind"
post1996[grepl("tstm|thunderstorm", post1996$EVTYPE), ]$eventType <- "Thunderstorm Wind"
post1996[grepl("tornado", post1996$EVTYPE), ]$eventType <- "Tornado"
post1996[grepl("tropical depression", post1996$EVTYPE), ]$eventType <- "Tropical Depression"
post1996[grepl("tropical storm", post1996$EVTYPE), ]$eventType <- "Tropical Storm"
post1996[grepl("tsunami|rogue wave", post1996$EVTYPE), ]$eventType <- "Tsunami"
post1996[grepl("volcan", post1996$EVTYPE), ]$eventType <- "Volcanic Ash"
post1996[grepl("waterspout", post1996$EVTYPE), ]$eventType <- "Waterspout"
post1996[grepl("fire", post1996$EVTYPE), ]$eventType <- "Wildfire"
post1996[grepl("winter storm|freezing rain|freezing drizzle", post1996$EVTYPE), ]$eventType <- "Winter Storm"
post1996[grepl("winter|wintry|snow", post1996$EVTYPE) &
!grepl("storm|heavy", post1996$EVTYPE), ]$eventType <- "Winter Weather"
Anything not classified via the above code will keep the default eventType of “Other”.
After the processing above, the post-1996 data set contains the 48 NWS event types and an “other” category for those items whose classification was unclear.
We can aggregate the data by eventType now. There is no particularly numerically meaningful way of comparing fatalities and injuries, so we might be inclined to treat these as separate values. However, there is actually a relatively strong positive correlation between injuries and fatalities in the data set.
cor(post1996$FATALITIES, post1996$INJURIES)
## [1] 0.4262357
Therefore let us combine these values into a general “human toll” value:
healthEffects <- post1996[, list("totalHumans" = sum(FATALITIES+INJURIES),
"avgHumans" = mean(FATALITIES+INJURIES),
"numObservations" = .N),
by = eventType]
The correlation between crop damage and property damage is negligible:
cor(post1996$CROPVALUE, post1996$PROPVALUE)
## [1] 0.04759622
So we should not combine these.
economicEffects <- post1996[, list("totalProp" = sum(PROPVALUE),
"avgProp" = mean(PROPVALUE),
"totalCrop" = sum(CROPVALUE),
"avgCrop" = mean(CROPVALUE),
"numObservations" = .N),
by = eventType]
It is worth considering a few assumptions of this analysis. The fundamental question under consideration is which types of events have the greatest consequences in terms of harm.
Both data points are important, as weather is a generally local phenomenon and preparation for weather also is likely to be conducted at a local level. The data here could be used to determine geographical frequency of events, but not other factors in terms of actual likelihood. As an example, consider tsunamis. These are a very uncommon event in any location. Both Newport, RI and Des Moines, IA have not experienced recent tsunamis. Having at least some level of readiness for a tsunami in a coastal city like Newport, though, makes much more sense than tsunami preparations in a city like Des Moines (over 300 miles from the Great Lakes).
In short, we need to consider both the sum of damage (to understand the overall risk from event types) and the average damage (to understand the risk of individual events).
Having aggregated the data appropriately, we need only sort and plot:
library(ggplot2)
library(gridExtra)
mostEffect <- healthEffects[order(-totalHumans), .(eventType, totalHumans)][1:5]
totalPlot <- ggplot(mostEffect, aes(x=factor(eventType, levels=unique(eventType)), y=totalHumans)) +
geom_bar(stat="identity", colour="black", fill="darkred") +
ylab("Fatalities + injuries") +
xlab(NULL) +
theme_classic()
mostAvgEffect <- healthEffects[order(-avgHumans), .(eventType, avgHumans)][1:5]
avgPlot <- ggplot(mostAvgEffect, aes(x=factor(eventType, levels=unique(eventType)), y=avgHumans)) +
geom_bar(stat="identity", colour="black", fill="darkred") +
ylab("Fatalities + injuries per event") +
xlab(NULL) +
theme_classic()
grid.arrange(totalPlot, avgPlot, top="Figure 1: Fatalities and injuries by event type, 1996-2011")
In terms of absolute total of human health effects, the most harmful event is tornadoes, followed by excessive heat and floods.
However, the most harmful events in terms of fatalities and injuries per event are tsunamis and hurricanes/typhoons. It is worth noting that these two events are significantly less common than other events. This suggests that some additional risk assessment is most important in preparing for dangerous weather events; a tsunami is a rare occurrence, but the most likely to cause fatalities or injuries in the population when it happens.
We can perform an analogous comparison to the human toll analysis for property damage here.
mostProp <- economicEffects[order(-totalProp), .(eventType, totalProp)][1:5]
totalPlot <- ggplot(mostProp, aes(x=factor(eventType, levels=unique(eventType)), y=totalProp/10**9)) +
geom_bar(stat="identity", colour="black", fill="green") +
ylab("Total damage (billions $)") +
xlab(NULL) +
theme_classic()
mostAvgProp <- economicEffects[order(-avgProp), .(eventType, avgProp)][1:5]
avgPlot <- ggplot(mostAvgProp, aes(x=factor(eventType, levels=unique(eventType)), y=avgProp/10**6)) +
geom_bar(stat="identity", colour="black", fill="green") +
ylab("Damage per event (millions $)") +
xlab(NULL) +
theme_classic()
grid.arrange(totalPlot, avgPlot, top="Figure 2: Property damage by event type, 1996-2011")
Here we see that the most damaging events overall are flood, hurricane/typhoon, and storm surge/tide; all water-related events. However, the shape of the damage per event graph is extremely steep; hurricane/typhoon events cause vast amounts of property damage per event compared to all other types. Given that these events are already the second most harmful in terms of property damage cost, this suggests that planning absolutely needs to take local risk into consideration.
Crop damage is calculated with the same set of calculations:
mostCrop <- economicEffects[order(-totalCrop), .(eventType, totalCrop)][1:5]
totalPlot <- ggplot(mostCrop, aes(x=factor(eventType, levels=unique(eventType)), y=totalCrop/10**9)) +
geom_bar(stat="identity", colour="black", fill="blue") +
ylab("Total damage (billions $)") +
xlab(NULL) +
theme_classic()
mostAvgCrop <- economicEffects[order(-avgCrop), .(eventType, avgCrop)][1:5]
avgPlot <- ggplot(mostAvgCrop, aes(x=factor(eventType, levels=unique(eventType)), y=avgCrop/10**6)) +
geom_bar(stat="identity", colour="black", fill="blue") +
ylab("Damage per event (millions $)") +
xlab(NULL) +
theme_classic()
grid.arrange(totalPlot, avgPlot, top="Figure 3: Crop damage by event type, 1996-2011")
Here, the biggest cause of damage by overall cost is drought, which is unique as it does not appear as a top cause for either the human health toll or the property damage data. Hurricane/typhoon and flood are the next largest.
In terms of damage per event, we see again that hurricanes/typhoons are substantially more damaging single events than anything else.