Severe weather events can have profound impacts on both finances and public health, often leading to damages, injuries and fatalities. As such, it is vital to minimise the effects of these events wherever possible. To that end, this analysis involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which documents details of severe weather events across the United States. In particular the analysis will answer the following questions:
The analysis shows that between 1996 and 2011, Flooding, Hurricanes and Storm Surge/Tide events had the greatest economic consequences. Indeed they led to a combined total damages of approximately $275 Billion.
Over the same period, Tornadoes, Excessive Heat and Flooding (or Flash Flooding) lead to a high number of injuries and fatalities. Indeed, over 34,00 injuries and 4,000 deaths are attributable to these events.
To begin, the data is loaded into R.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, "stormdata.csv.bz2")
stormdata <- read.csv("stormdata.csv.bz2", header = TRUE, na.strings = "")
According to the NOAA, data only started being collected on all event types in January 1996. Given that the focus of this analysis is to compare which types of events were the most severe, we can filter on events which began on or after January 1st 1996.
suppressMessages(library(dplyr))
stormdata$BGN_DATE <- as.Date(stormdata$BGN_DATE, format = "%m/%d/%Y")
stormdata <- filter(stormdata, BGN_DATE >= as.Date("1996-01-01"))
The focus of the analysis is on the financial and public health impacts, as measured by injuries, fatalities, property damage and crop damage. Hence, we can further reduce the size of the data set by filtering on any rows which have at least one non-zero value for fatalities, injures, property damage or crop damage.
stormdata <- filter(stormdata, FATALITIES > 0 | INJURIES > 0 | PROPDMG > 0 | CROPDMG > 0)
Looking at the event type, there are 222 rather than the official 48 listed in the NOAA documentation.
length(unique(stormdata$EVTYPE))
## [1] 222
Several of the wind-related events have the Gail force measurement in the name. For example THUNDERSTORM WIND (G40). However this is not a feature of the official list, hence the names are tidied by removing these measurements. In addition, any leading, trailing or multiple spaces are removed and the events are made uppercase.
library(stringr)
cleanevents <- function(x){
x <- gsub("\\s*[0-9]+", "", x)
x <- gsub("\\s*\\(.*\\)", "", x)
x <- str_squish(x) %>% toupper()
x
}
stormdata$EVTYPE <- cleanevents(stormdata$EVTYPE)
Following this, there are still 173 events, so there remains some work to deal with unmatched events. To match each of the events to an event in the official list of 48 would be time-consuming to do manually. Instead, we will use the function amatch which will try to find matches between the events in the dataset and the official list of events.
This function takes a parameter maxDist which sets how similar two strings can be in order to be matched. Deciding on which value this should take is somewhat subjective; too small and not enough of the events will be matched; too large and all the events will be matched but potentially incorrectly. Even trying to compromise between the two would be a subjective process.
The approach we will take is to compute the smallest value maxDist can take in order to match all the events.
library(stringdist)
official <- toupper(c("Astronomical Low Tide",
"Avalanche",
"Blizzard",
"Coastal Flood",
"Cold/Wind Chill",
"Debris Flow",
"Dense Fog",
"Dense Smoke",
"Drought",
"Dust Devil",
"Dust Storm",
"Excessive Heat",
"Extreme Cold/Wind Chill",
"Flash Flood",
"Flood",
"Frost/Freeze",
"Funnel Cloud",
"Freezing Fog",
"Hail",
"Heat",
"Heavy Rain",
"Heavy Snow",
"High Surf",
"High Wind",
"Hurricane (Typhoon)",
"Ice Storm",
"Lake-Effect Snow",
"Lakeshore Flood",
"Lightning",
"Marine Hail",
"Marine High Wind",
"Marine Strong Wind",
"Marine Thunderstorm Wind",
"Rip Current",
"Seiche",
"Sleet",
"Storm Surge/Tide",
"Strong Wind",
"Thunderstorm Wind",
"Tornado",
"Tropical Depression",
"Tropical Storm",
"Tsunami",
"Volcanic Ash",
"Waterspout",
"Wildfire",
"Winter Storm",
"Winter Weather"))
indicator = FALSE
m <- 1
while(indicator == FALSE){
matching <- amatch(unique(stormdata$EVTYPE), official, maxDist = m)
indicator <- (sum(!is.na(matching))/length(matching)) == 1
m <- m + 1
}
m - 1
## [1] 16
The output of this computation reveal we use value of 16 for maxDist.
We then run the matching using amatch with this value for maxDist and confirm all the events have been matched.
matching <- amatch(stormdata$EVTYPE, official, maxDist = m - 1)
sum(is.na(matching))
## [1] 0
Next we substitute the events in the data set for their corresponding matched event and confirm we are left with only the official 48 events.
for(i in 1:length(stormdata$EVTYPE)){
stormdata$EVTYPE[i] <- official[matching[i]]
}
unique(stormdata$EVTYPE)
## [1] "WINTER STORM" "TORNADO"
## [3] "HIGH WIND" "FLASH FLOOD"
## [5] "FREEZING FOG" "FUNNEL CLOUD"
## [7] "LIGHTNING" "HAIL"
## [9] "FLOOD" "MARINE HAIL"
## [11] "EXCESSIVE HEAT" "RIP CURRENT"
## [13] "HEAT" "HEAVY SNOW"
## [15] "WILDFIRE" "ICE STORM"
## [17] "BLIZZARD" "STORM SURGE/TIDE"
## [19] "COASTAL FLOOD" "DUST STORM"
## [21] "STRONG WIND" "DUST DEVIL"
## [23] "HIGH SURF" "HEAVY RAIN"
## [25] "AVALANCHE" "SLEET"
## [27] "DROUGHT" "WATERSPOUT"
## [29] "FROST/FREEZE" "SEICHE"
## [31] "TROPICAL STORM" "DEBRIS FLOW"
## [33] "COLD/WIND CHILL" "HURRICANE (TYPHOON)"
## [35] "EXTREME COLD/WIND CHILL" "DENSE FOG"
## [37] "WINTER WEATHER" "THUNDERSTORM WIND"
## [39] "LAKE-EFFECT SNOW" "VOLCANIC ASH"
## [41] "MARINE HIGH WIND" "ASTRONOMICAL LOW TIDE"
## [43] "TROPICAL DEPRESSION" "TSUNAMI"
## [45] "LAKESHORE FLOOD" "MARINE THUNDERSTORM WIND"
## [47] "MARINE STRONG WIND" "DENSE SMOKE"
Having dealt with the event types, we can next move onto tidying up the property and crop damages. These are split into columns for values (PROPDMG & CROPDMG) and columns for their exponents (PROPDMGEXP & CROPDMGEXP). Let’s take a look at the exponents.
exponents <- unique(c(unique(stormdata$PROPDMGEXP), unique(stormdata$PROPDMGEXP)))
exponents
## [1] "K" NA "M" "B"
The NOAA documentation explains that K represents thousands, M represents millions and B represents billions. We tidy up these exponents.
suppressMessages(library(tidyr))
fixexp <- function(x){
x <- str_replace(x, "[Kk]", "1000") %>%
str_replace("[Mm]", "1000000") %>%
str_replace("[Bb]", "1000000000") %>%
replace_na("0")
suppressWarnings(as.numeric(x))
}
stormdata$PROPDMGEXP <- fixexp(stormdata$PROPDMGEXP)
stormdata$CROPDMGEXP <- fixexp(stormdata$CROPDMGEXP)
Finally, we combine the values and exponents to get actual values fo property and crop damages.
stormdata$PropertyDamage <- stormdata$PROPDMG * stormdata$PROPDMGEXP
stormdata$CropDamage <- stormdata$CROPDMG * stormdata$CROPDMGEXP
This concludes the tidying and pre-processing of the data.
Having tided the data set, we can now move on to addressing the two focuses of the analysis. Namely which events have the greatest financial and public health impact.
We compute the property and crop damages by event type.
propertyDamage <- tapply(stormdata$PropertyDamage, stormdata$EVTYPE, sum, na.rm = TRUE)
cropDamage <- tapply(stormdata$CropDamage, stormdata$EVTYPE, sum, na.rm = TRUE)
Next, we create a data frame of the damages by event type, compute the total damages and filter on the top 10 events by total damages.
suppressMessages(library(reshape2))
damagesdf <- data.frame(Event = names(propertyDamage), Property = as.numeric(propertyDamage), Crop = as.numeric(cropDamage), stringsAsFactors = TRUE) %>%
mutate(Total = rowSums(across(where(is.numeric)))) %>%
arrange(desc(Total)) %>%
head(10) %>%
melt(id.vars = c("Event", "Total"), value.name = "Damages")
Note that the top 10 events by total damages account for nearly 94% of all damages; this justifies only focusing on the top 10 events.
sum(damagesdf$Damages)/sum(propertyDamage, cropDamage)
## [1] 0.9373861
We now plot the top 10 events by total damages.
library(ggplot2)
library(scales)
damagesdf$Event <- factor(damagesdf$Event, levels = unique(damagesdf$Event))
ggplot(damagesdf, aes(x = Damages, y = Event, fill = variable)) +
geom_bar(stat = "identity") +
scale_x_continuous(labels = comma) +
labs(title = "US Severe Weather Damages 1996 - 2011") +
xlab("Damages ($)")
Clearly, the vast majority of damages come from property damage rather than crop damage. The exception to this is droughts for which the damages are mostly made up by crop damages.
Flooding was by far the most damaging event financially, costing around $150 Billion. Hurricanes and Storm Surge/Tide were also particularly costly, coming in at around $75 Billion and $50 Billion respectively. Working to reduce the damages caused by these three events (especially floods) should be focus.
We compute the number of fatalities for each event type and filter on the top 20.
fatalities <- tapply(stormdata$FATALITIES, stormdata$EVTYPE, sum)
fatalitiesdf <- data.frame(Event = names(fatalities), Fatalities = as.numeric(fatalities), stringsAsFactors = TRUE) %>%
arrange(desc(fatalities)) %>%
head(20)
Note that the top 20 events account for over 94% of all fatalities; this justifies only focusing on the top 20 events.
sum(fatalitiesdf$Fatalities)/sum(stormdata$FATALITIES)
## [1] 0.9421667
We now plot the top 20 events by fatalities.
fatalitiesdf$Event <- factor(fatalitiesdf$Event, levels = fatalitiesdf$Event)
ggplot(fatalitiesdf, aes(x = Fatalities, y = Event)) +
geom_bar(stat = "identity", fill = "red") +
geom_text(aes(label = Fatalities), hjust = -0.1, colour = "black") +
xlim(0, 2200) +
labs(title = "US Severe Weather Fatalities 1996 - 2011")
Excessive Heat and Tornadoes were the most fatal events, with over 3300 deaths attributable to them. Flash Floods were also deadly with 889 deaths.
We next compute the number of injuries for each event type and filter on the top 20.
injuries <- tapply(stormdata$INJURIES, stormdata$EVTYPE, sum)
injuriesdf <- data.frame(Event = names(injuries), Injuries = as.numeric(injuries), stringsAsFactors = TRUE) %>%
arrange(desc(injuries)) %>%
head(20)
Note that the top 20 events account for over 97% of all injuries; this justifies only focusing on the top 20 events.
sum(injuriesdf$Injuries)/sum(stormdata$INJURIES)
## [1] 0.9740233
We now plot the top 20 events by injuries.
injuriesdf$Event <- factor(injuriesdf$Event, levels = injuriesdf$Event)
ggplot(injuriesdf, aes(x = Injuries, y = Event)) +
geom_bar(stat = "identity", fill = "red") +
geom_text(aes(label = Injuries), hjust = -0.1, colour = "black") +
xlim(0, 25000) +
labs(title = "US Severe Weather Injuries 1996 - 2011") +
theme_bw()
Tornadoes were by far the most dangerous event with respect to injuries with over 20,000 injuries caused by them. Excessive Heat and Flooding were also dangerous with a combined 14,191 injuries attributable to the two events.
Overall, Tornadoes and Excessive Heat lead to many injuries and fatalities; as such the focus of any efforts to reduce the impact on public health of severe weather events should begin with these two events. Flooding (or Flash Flooding) was also dangerous and it should be a focus of efforts to reduce injuries and fatalities.