Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The objective of this analysis to identify which severe weather events 1) are most harmful with respect to population health, 2) have the greatest economic consequences.
The global options for the document are set with echo = TRUE so that reviewers can see the code, data is lazy-loaded from previously saved databases with cache = TRUE, and figures sizes are pre-set.
# set the global options
knitr::opts_chunk$set(echo = TRUE, cache = TRUE, warning = FALSE, message = FALSE, fig.width=12, fig.height=8)
The analysis requires the r packages dplyr, lubridate, stringr, ggplot2, scales, and gridExtra.
# load the required packages
library(dplyr)
library(lubridate)
library(stringr)
library(ggplot2)
library(scales)
library(gridExtra)
The data for the analysis is downloaded and the generated csv is read in R.
# download file
if (!"StormData.csv.bz2" %in% dir("./")) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", destfile = "StormData.csv.bz2", mode="wb")
}
# read file into r
dataset <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
The documentation of the database is available at the following links
Exploring the original dataset, we find out that there are 902,297 rows and 37 columns in total. The events in the database start in the year 1950 and end in November 2011.
# dimensions of dataset
dim(dataset)
## [1] 902297 37
# names of the columns
names(dataset)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
We create a subset with only the information relevant for the analysis,ie, types of weather events, and their impact on health and the economy. The selected columns are:
# Subset dataset
impact <- select(dataset, BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
dim(impact)
## [1] 902297 8
In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of reliable/complete records. This is shown by plotting the histogram of the observations over the years.
# To assist in date analysis, dates are converted to POSIXct format
impact <- mutate(impact, BGN_DATE = mdy(str_extract(impact$BGN_DATE, "[^ ]+")), YEAR = year(BGN_DATE))
# Plot the histogram of observations over the years
ggplot(data = impact, aes(impact$YEAR)) +
geom_histogram(binwidth = 3) +
scale_x_continuous(limits=c(1950,2012), breaks = seq(1950,2011, by=5)) +
geom_vline(xintercept=1995) +
ggtitle("Histogram of Weather Events") +
xlab("Year") +
ylab("Frequency")+
theme_bw() +
theme(
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = "lightgray"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.position = "bottom",
legend.title = element_blank(),
legend.key = element_blank()
)
Fig.1 Frequency of Observations from 1950 to 2011
As we can see from the above chart, the dataset starts to have more records after the year 1995. So we decide to retain only the rows after 1995 and discard the remaining rows.
impact <- filter(impact, YEAR>=1995)
dim(impact)
## [1] 681500 9
The list of events within the column impact$EVTYPE include several repetitions, typos, and categories are generally not consistent.
To group and cleanse the impact$EVTYPE column, we use the list of events included in the storm data event table, available from the National Weather Service Instruction Directive.
# clean EVTYPE
impact <- mutate(impact, EVTYPE = ifelse(grepl("astro", EVTYPE, ignore.case = TRUE), "Astronomical Low Tide", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("avalanche|slide|slump|landspout", EVTYPE, ignore.case = TRUE), "Avalanche", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("blizzard", EVTYPE, ignore.case = TRUE), "Blizzard", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("coast|beach|tide", EVTYPE, ignore.case = TRUE), "Coastal Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("cold|hypo|cool|low", EVTYPE, ignore.case = TRUE), "Cold/Windhill", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("debris", EVTYPE, ignore.case = TRUE), "Debris Flow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("fog|vog", EVTYPE, ignore.case = TRUE), "Dense Fog", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("smoke", EVTYPE, ignore.case = TRUE), "Dense Smoke", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("drought|dry|driest", EVTYPE, ignore.case = TRUE), "Drought", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("devil|devel", EVTYPE, ignore.case = TRUE), "Dust Devil", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("dust storm|blowing|saharan", EVTYPE, ignore.case = TRUE), "Dust Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("heat|record high|warm|hot|temperature|wet", EVTYPE, ignore.case = TRUE), "Excessive Heat", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("flood|sea|water|swell|stream|dam|floyd", EVTYPE, ignore.case = TRUE), "Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("flash", EVTYPE, ignore.case = TRUE), "Flash Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("freez|frost", EVTYPE, ignore.case = TRUE), "Frost/Freeze", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("funnel|cloud", EVTYPE, ignore.case = TRUE), "Funnelloud", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("freezing fog", EVTYPE, ignore.case = TRUE), "Freezing Fog", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("hail", EVTYPE, ignore.case = TRUE), "Hail", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("rain|precipitation|shower", EVTYPE, ignore.case = TRUE), "Heavy Rain", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("snow", EVTYPE, ignore.case = TRUE), "Heavy Snow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("surf", EVTYPE, ignore.case = TRUE), "High Surf", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("wind|wnd", EVTYPE, ignore.case = TRUE), "High Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("hurrican|typhoon", EVTYPE, ignore.case = TRUE), "Hurricane", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("ice|glaze|icy", EVTYPE, ignore.case = TRUE), "Ice Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lake.snow", EVTYPE, ignore.case = TRUE), "Lake-Effect Snow", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lake.flood", EVTYPE, ignore.case = TRUE), "Lakeshore Flood", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("lightning|LIGNTNING", EVTYPE, ignore.case = TRUE), "Lightning", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine hail", EVTYPE, ignore.case = TRUE), "Marine Hail", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine high|mishap|accident|drowning|wave", EVTYPE, ignore.case = TRUE), "Marine High Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("marine strong wind", EVTYPE, ignore.case = TRUE), "Marine Strong Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("thunder|TSTM", EVTYPE, ignore.case = TRUE), "Thunderstorm Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("MARINE TSTM WIND|MARINE THUNDERSTORM WIND", EVTYPE, ignore.case = TRUE), "Marine Thunderstorm Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("rip", EVTYPE, ignore.case = TRUE), "Ripurrent", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("seiche", EVTYPE, ignore.case = TRUE), "Seiche", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("sleet", EVTYPE, ignore.case = TRUE), "Sleet", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("surge", EVTYPE, ignore.case = TRUE), "Storm Surge/Tide", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("^strong wind", EVTYPE, ignore.case = TRUE), "Strong Wind", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tornado|burst|gustnado", EVTYPE, ignore.case = TRUE), "Tornado", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("depression", EVTYPE, ignore.case = TRUE), "Tropical Depression", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tropical storm", EVTYPE, ignore.case = TRUE), "Tropical Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("tsunami", EVTYPE, ignore.case = TRUE), "Tsunami", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("volcanic", EVTYPE, ignore.case = TRUE), "Volcanic Ash", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("waterspout", EVTYPE, ignore.case = TRUE), "Waterspout", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("fire", EVTYPE, ignore.case = TRUE), "Wildfire", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("winter storm", EVTYPE, ignore.case = TRUE), "Winter Storm", EVTYPE))
impact <- mutate(impact, EVTYPE = ifelse(grepl("winter weather|wintry|wintery", EVTYPE, ignore.case = TRUE), "Winter Weather", EVTYPE))
The variables PROPDMGEXP and CROPDMGEXP are then converted to numeric
# Format damage valuation magnitude as characters for re-coding:
impact$PROPDMGEXP <- as.character(impact$PROPDMGEXP)
impact$CROPDMGEXP <- as.character(impact$CROPDMGEXP)
# Create numeric values to represent the coded damage valuation magnitudes
# (i.e. "h" = 100; "k" = 1,000; "m" = 1,000,000; etc.)
# re-code PROPDMGEXP variable
impact$PROPDMGEXP[(impact$PROPDMGEXP == "h") | (impact$PROPDMGEXP == "H")] <- 100
impact$PROPDMGEXP[(impact$PROPDMGEXP == "k") | (impact$PROPDMGEXP == "K")] <- 1000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "m") | (impact$PROPDMGEXP == "M")] <- 1000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "b") | (impact$PROPDMGEXP == "B")] <- 1000000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "0")] <- 1
impact$PROPDMGEXP[(impact$PROPDMGEXP == "1")] <- 10
impact$PROPDMGEXP[(impact$PROPDMGEXP == "2")] <- 100
impact$PROPDMGEXP[(impact$PROPDMGEXP == "3")] <- 1000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "4")] <- 10000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "5")] <- 100000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "6")] <- 1000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "7")] <- 10000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "8")] <- 100000000
impact$PROPDMGEXP[(impact$PROPDMGEXP == "") | (impact$PROPDMGEXP == "?")] <- 0
impact$PROPDMGEXP[(impact$PROPDMGEXP == "+") | (impact$PROPDMGEXP == "-")] <- 0
# re-code CROPDMGEXP variable
impact$CROPDMGEXP[(impact$CROPDMGEXP == "h") | (impact$CROPDMGEXP == "H")] <- 100
impact$CROPDMGEXP[(impact$CROPDMGEXP == "k") | (impact$CROPDMGEXP == "K")] <- 1000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "m") | (impact$CROPDMGEXP == "M")] <- 1000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "b") | (impact$CROPDMGEXP == "B")] <- 1000000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "0")] <- 1
impact$CROPDMGEXP[(impact$CROPDMGEXP == "1")] <- 10
impact$CROPDMGEXP[(impact$CROPDMGEXP == "2")] <- 100
impact$CROPDMGEXP[(impact$CROPDMGEXP == "3")] <- 1000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "4")] <- 10000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "5")] <- 100000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "6")] <- 1000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "7")] <- 10000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "8")] <- 100000000
impact$CROPDMGEXP[(impact$CROPDMGEXP == "") | (impact$CROPDMGEXP == "?")] <- 0
impact$CROPDMGEXP[(impact$CROPDMGEXP == "+") | (impact$CROPDMGEXP == "-")] <- 0
# convert PROPDMGEXP & CROPDMGEXP variables to integers
impact$PROPDMGEXP <- as.integer(impact$PROPDMGEXP)
impact$CROPDMGEXP <- as.integer(impact$CROPDMGEXP)
We get total property and crop damage values by multiplying the valuation and magnitude columns for both the property damage (PROPDMG * PROPDMGEXP) and crop damage (CROPDMG * CROPDMGEXP) variables. Then we combine these two values to calculate an aggregated total economic damage valuation (total_property_damage + total_crop_damage).
impact$total_property_damage <- impact$PROPDMG * impact$PROPDMGEXP
impact$total_crop_damage <- impact$CROPDMG * impact$CROPDMGEXP
impact$total_economic_damage <- impact$total_property_damage + impact$total_crop_damage
EVTYPE variable) are most harmful with respect to population health?In this section we want to find the top 10 of weather events that caused the most fatalities and injuries.
We group fatalities and injuries by event type and get the top 10 event types which are the most harmful to public health:
# group fatalities by event type and sort event types by number of fatalities in decreasing order
aggrfatalities <- aggregate(FATALITIES~EVTYPE, data = impact, "sum")
fatalities <- aggrfatalities[order(-aggrfatalities$FATALITIES), ][1:10, ]
fatalities$EVTYPE <- factor(fatalities$EVTYPE, levels = fatalities$EVTYPE)
fatalities
## EVTYPE FATALITIES
## 10 Excessive Heat 3092
## 105 Tornado 1545
## 11 Flood 1450
## 19 High Wind 1269
## 23 Lightning 730
## 33 Ripurrent 564
## 1 Avalanche 267
## 113 Winter Storm 195
## 18 High Surf 163
## 20 Hurricane 133
# group injuries by event type and sort event types by number of injuries in decreasing order
aggrinjuries <- aggregate(INJURIES~EVTYPE, data = impact, "sum")
injuries <- aggrinjuries[order(-aggrinjuries$INJURIES), ][1:10, ]
injuries$EVTYPE <- factor(injuries$EVTYPE, levels = injuries$EVTYPE)
injuries
## EVTYPE INJURIES
## 105 Tornado 21783
## 10 Excessive Heat 9107
## 11 Flood 8659
## 19 High Wind 7366
## 23 Lightning 4633
## 111 Wildfire 1458
## 20 Hurricane 1330
## 113 Winter Storm 1298
## 14 Hail 1021
## 4 Dense Fog 994
As we can see from the chart below, Excessive Heat is to blame for the largest number of fatalities, while Tornado caused the highest number of injuries.
# fatalities plot
fatalities_plot <- ggplot(fatalities, aes(x=EVTYPE, y=FATALITIES, fill=FATALITIES)) + geom_bar(stat = "identity") +
ylab("Number of Fatalities") +
ggtitle ("Total fatalities per event type") +
theme(axis.text=element_text(size=11),
axis.title.x=element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = "lightgray"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
# injuries plot
injuries_plot <- ggplot(injuries, aes(x=EVTYPE, y=INJURIES, fill=INJURIES)) + geom_bar(stat = "identity") +
ylab("Number of Injuries") +
ggtitle ("Total injuries per event type") +
theme(axis.text=element_text(size=11),
axis.title.x=element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = "lightgray"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
# arrange plots in one chart
grid.arrange(fatalities_plot, injuries_plot, nrow=2)
Fig.2 Impact of Weather Events on Public Health from 1995 to 2011
We can apply the same logic to identify the events which cause the greatest economic impacts. In order to extract the top 10 events which cause the most economic damages, we will use the total_economic_damage variable, which included both damage to properties and to crops.
# group economic damage by event type and sort event types by economic impact in decreasing order
aggr_economic_impact <- aggregate(total_economic_damage~EVTYPE, data = impact, "sum")
economic_impact <- aggr_economic_impact [order(-aggr_economic_impact$total_economic_damage), ][1:10, ]
economic_impact$EVTYPE <- factor(economic_impact$EVTYPE, levels = economic_impact$EVTYPE)
economic_impact
## EVTYPE total_economic_damage
## 11 Flood 172908397730
## 20 Hurricane 90655952810
## 37 Storm Surge/Tide 43193541000
## 105 Tornado 25237340320
## 14 Hail 18035235127
## 19 High Wind 17806783671
## 6 Drought 14970175380
## 107 Tropical Storm 8352221550
## 111 Wildfire 8163274130
## 16 Heavy Rain 3895108240
Then we plot the top 10 events with the greatest economic impact.
# economic impact plot
economic_plot <- ggplot(economic_impact, aes(x=EVTYPE, y=total_economic_damage/1000000000, fill=total_economic_damage/1000000000)) + geom_bar(stat = "identity") +
ylab("USD (Billions)") +
ggtitle ("Total economic damage per event type") +
scale_fill_continuous(name="USD (Billions)") +
theme(axis.text=element_text(size=10),
axis.title.x=element_blank(),
panel.background = element_blank(),
panel.border = element_blank(),
panel.grid.major = element_line(colour = "lightgray"),
panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank())
economic_plot
Fig.3 Impact of Weather Events on Economy from 1995 to 2011
As we can see from Fig.3, the weather events which caused the greatest economic damage from 1995 to 2011 were Floods (~$173 billions), followed by Hurricanes (~$91 billions), and Storm Surges (~$43 billions).