This data analysis looks at the severe weather events data of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database in the period from 1950 to 2011. The data contains information on fatalities and injuries as well as property and crop damages that resulted from severe weather conditions. The informations was grouped and aggregated to present the total health and total damage figures over the period of the most significant events. In this analysis it was considered to cluster the weather events based on the 48 standard NOAA weather event types, but a standard “pmatch” function together with small adjustments to improve the grouping of key events showed that our top 10 events covers well over 98% on the total victims and damages. Hence more sophisticated clustering was deemed unnecessary. Tornadoes, thunderstorms and excessive heat are the key causes of weather victims. While floods are the most significant cause of damage, followed by hurricanes and tornadoes.
Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Download, decompress and read data.
# Download dataset
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("StormData.csv.bz2")) {
download.file(fileUrl, destfile = "StormData.csv.bz2", method = "curl")
message("Storm data downloaded on: ", date())
}
# Unzip dataset and read into dataframe
connection <- bzfile("StormData.csv.bz2", "r")
stormData <- read.table(connection, sep = ",", header = TRUE, fill = TRUE)
close(connection)
# Show dimensions of data
dim(stormData)
## [1] 902297 37
Load all required libraries and define short function to format large numbers
library(dplyr)
library(stringr)
library(scales)
library(tidyr)
library(ggplot2)
Print <- function(x) formatC(x, decimal.mark="", big.mark=",", digits = 0, format = "f")
Showing the health and economic effects with the current list of event types (EVTYPE) is not very effective as this list contains 985 different weather events.
To reduce and normalize this list the following steps are taken:
healthEffect <- filter(stormData, FATALITIES > 0 | INJURIES > 0)
damageEffect <- filter(stormData, PROPDMG > 0 | CROPDMG > 0)
Calculate the real damage by multiplying with the exponent. Records with invalid exponents are a small minority and are ignored to prevent making invalid damage assumptions.
for (i in seq_along(damageEffect$EVTYPE)) {
# Property damage calculation
if (damageEffect$PROPDMGEXP[i] == "K" | damageEffect$PROPDMGEXP[i] == "k") {
damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000
} else if (damageEffect$PROPDMGEXP[i] == "M" | damageEffect$PROPDMGEXP[i] == "m") {
damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000000
} else if (damageEffect$PROPDMGEXP[i] == "B" | damageEffect$PROPDMGEXP[i] == "b") {
damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i] * 1000000000
} else if (damageEffect$PROPDMGEXP[i] == "") {
damageEffect$PROPDAMAGE[i] <- damageEffect$PROPDMG[i]
}
# Crop damage calculation
if (damageEffect$CROPDMGEXP[i] == "K" | damageEffect$CROPDMGEXP[i] == "k") {
damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000
} else if (damageEffect$CROPDMGEXP[i] == "M" | damageEffect$CROPDMGEXP[i] == "m") {
damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000000
} else if (damageEffect$CROPDMGEXP[i] == "B" | damageEffect$CROPDMGEXP[i] == "b") {
damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i] * 1000000000
} else if (damageEffect$CROPDMGEXP[i] == "") {
damageEffect$CROPDAMAGE[i] <- damageEffect$CROPDMG[i]
}
}
The total property damage in the measured period is: 427,324,917,627 dollar. The total crop damage in the measured period is: 49,104,191,921 dollar.
The number of number of servere weather event types is still unmanageable large. So for comprehensive reporting we need to normalize the events to the 48 standard Storm Data Events as defined in paragraph 2.1.1 of above referenced Storm Data Documentation. These standards types are defined in the provided dataset: stormEventTable.csv. We match the weather event types against these standard weather events types to cluster them.
eventTypes <- read.table("stormEventTable.csv", sep = ",", header = TRUE)
eventTypes
## weather
## 1 Astronomical Low Tide
## 2 Avalanche
## 3 Blizzard
## 4 Coastal Flood
## 5 Cold/Wind Chill
## 6 Debris Flow
## 7 Dense Fog
## 8 Dense Smoke
## 9 Drought
## 10 Dust Devil
## 11 Dust Storm
## 12 Excessive Heat
## 13 Extreme Cold/Wind Chill
## 14 Flash Flood
## 15 Flood
## 16 Frost/Freeze
## 17 Funnel Cloud
## 18 Freezing Fog
## 19 Hail
## 20 Heat
## 21 Heavy Rain
## 22 Heavy Snow
## 23 High Surf
## 24 High Wind
## 25 Hurricane (Typhoon)
## 26 Ice Storm
## 27 Lake-Effect Snow
## 28 Lakeshore Flood
## 29 Lightning
## 30 Marine Hail
## 31 Marine High Wind
## 32 Marine Strong Wind
## 33 Marine Thunderstorm Wind
## 34 Rip Current
## 35 Seiche
## 36 Sleet
## 37 Storm Surge/Tide
## 38 Strong Wind
## 39 Thunderstorm Wind
## 40 Tornado
## 41 Tropical Depression
## 42 Tropical Storm
## 43 Tsunami
## 44 Volcanic Ash
## 45 Waterspout
## 46 Wildfire
## 47 Winter Storm
## 48 Winter Weather
Match the weather events in the records against the standard weather types as defined by NOAA. To improve matching the following changes have been made to the event types:
# First match health data
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "TSTM", "THUNDERSTORM")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "HURRICANE/TYPHOON", "Hurricane (Typhoon)")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "HURRICANE OPAL", "Hurricane (Typhoon)")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "RIVER FLOOD", "FLOOD")
healthEffect$EVTYPE <- str_replace_all(healthEffect$EVTYPE, "TORNADOES", "TORNADO ")
healthEffect$STDTYP <- pmatch(toupper(healthEffect$EVTYPE), toupper(eventTypes$weather), duplicates.ok = TRUE)
# Next match damage data
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "TSTM", "THUNDERSTORM")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "HURRICANE/TYPHOON", "Hurricane (Typhoon)")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "HURRICANE OPAL", "Hurricane (Typhoon)")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "RIVER FLOOD", "FLOOD")
damageEffect$EVTYPE <- str_replace_all(damageEffect$EVTYPE, "TORNADOES", "TORNADO ")
damageEffect$STDTYP <- pmatch(toupper(damageEffect$EVTYPE), toupper(eventTypes$weather), duplicates.ok = TRUE)
The matching left a percentage of 7.03% unmatched to standard weather types.
Calculate the total number of victims and present them by (standard) weather event type in decending order. Variable Index gives the index in the standard weather event table. If the standard event type could not be found, then the variable Index is NA. In that case additional matching rules are required.
mostVCT <- group_by(healthEffect, EVTYPE)
mostVCT <- summarise(mostVCT, Index = min(STDTYP), Fatalities = sum(FATALITIES), Injuries = sum(INJURIES), Victims = sum(FATALITIES + INJURIES))
mostVCT <- arrange(mostVCT, desc(Victims))
head(mostVCT, n=10)
## Source: local data frame [10 x 5]
##
## EVTYPE Index Fatalities Injuries Victims
## (chr) (int) (dbl) (dbl) (dbl)
## 1 TORNADO 40 5633 91346 96979
## 2 THUNDERSTORM WIND 39 637 8445 9082
## 3 EXCESSIVE HEAT 12 1903 6525 8428
## 4 FLOOD 15 472 6791 7263
## 5 LIGHTNING 29 816 5230 6046
## 6 HEAT 20 937 2100 3037
## 7 FLASH FLOOD 14 978 1777 2755
## 8 ICE STORM 26 89 1975 2064
## 9 WINTER STORM 47 206 1321 1527
## 10 HIGH WIND 24 248 1137 1385
The total number of victims (fatalities plus injuries) is 155,673 people. Above Top 10 list covers the following percentage of the total victims: 99% hence further modelling to cluster event types to standard types was not really necessary.
top10vct <- gather(mostVCT[1:10, c(1,3,4)], "Impact", "Victims", convert = TRUE, Fatalities, Injuries)
ggplot(top10vct, aes(x = reorder(EVTYPE, -Victims), Victims, fill = factor(Impact))) +
geom_bar(stat = "identity", position = "dodge") +
theme(text = element_text(size = 15, face = "bold"),
legend.text = element_text(size = 12, face = "plain"),
axis.text.x = element_text(size = 12, face = "plain", angle = 60, vjust = 0.5)) +
guides(fill = guide_legend(title = "Type of victims")) +
ggtitle("Top 10 Severe weather impact on health") +
scale_x_discrete("Severe weather event") +
scale_y_log10("Total victims")
mostDMG <- group_by(damageEffect, EVTYPE)
mostDMG <- summarise(mostDMG, Index = min(STDTYP), Properties = sum(PROPDAMAGE), Crop = sum(CROPDAMAGE), Total.damage = sum(PROPDAMAGE + CROPDAMAGE))
mostDMG <- arrange(mostDMG, desc(Total.damage))
head(mostDMG, n=10)
## Source: local data frame [10 x 5]
##
## EVTYPE Index Properties Crop Total.damage
## (chr) (int) (dbl) (dbl) (dbl)
## 1 FLOOD 15 149776655307 10691427450 160468082757
## 2 Hurricane (Typhoon) 25 72478686000 2626872800 75105558800
## 3 TORNADO 40 56937435483 414953110 57352388593
## 4 STORM SURGE 37 43323536000 5000 43323541000
## 5 HAIL 19 15732591777 3025954453 18758546230
## 6 FLASH FLOOD 14 16141136717 1421317100 17562453817
## 7 DROUGHT 9 1046106000 13972566000 15018672000
## 8 HURRICANE 25 11868319010 2741910000 14610229010
## 9 ICE STORM 26 3944952810 5022113500 8967066310
## 10 THUNDERSTORM WIND 39 7968149582 968850400 8936999982
The total damage in (properties plus crop) is 476,429,109,548 US$. Above Top 10 list covers the following percentage of the total damage: 98.1% hence further modelling to cluster event types to standard types was not really necessary.
top10dmg <- gather(mostDMG[1:10, c(1,3,4)], "Impact", "Damage", convert = TRUE, Properties, Crop)
ggplot(top10dmg, aes(x = reorder(EVTYPE, -Damage), Damage, fill = factor(Impact))) +
geom_bar(stat = "identity", position = "dodge") +
theme(text = element_text(size = 15, face = "bold"),
legend.text = element_text(size = 12, face = "plain"),
axis.text.x = element_text(size = 12, face = "plain", angle = 60, vjust = 0.5)) +
guides(fill = guide_legend(title = "Type of damage")) +
ggtitle("Top 10 Severe weather impact on damage [in US$]") +
scale_x_discrete("Severe weather event") +
scale_y_log10("Total damage")
Note: With a few minor steps to improve custering towards the standard weather types the percentage victims or damage can easily be increase to above 99.5% but it won’t impact the the top list. E.g. Group all Flood or Hail events together.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.11.4 (El Capitan)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_2.1.0 tidyr_0.4.1 scales_0.4.0 stringr_1.0.0 dplyr_0.4.3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.3 digest_0.6.8 assertthat_0.1 grid_3.2.2
## [5] plyr_1.8.3 R6_2.1.2 gtable_0.2.0 DBI_0.3.1
## [9] formatR_1.2.1 magrittr_1.5 evaluate_0.8 stringi_1.0-1
## [13] lazyeval_0.1.10 rmarkdown_0.9.5 tools_3.2.2 munsell_0.4.3
## [17] yaml_2.1.13 parallel_3.2.2 colorspace_1.2-6 htmltools_0.2.6
## [21] knitr_1.11