Severe weather events have population health and economic consequences. This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer basic questions about severe weather events. In particular the database allows us to analyze severe weather events across the United States and population health consequences through fatalities and injuries, and economic consequences through property and crop damage (in USD), which will be combined for total economic damage in US dollars (USD).
The below analysis demonstrates that across the United States tornados are most harmful with respect to population health have the greatest economic consequence. It is hoped that these results might be helpful to a government or municipal manager responsible for preparing for severe weather events as they prioritize resources for different types of events. However, this report does not make any specific recommendations.
For reproducibility all steps to the analysis pipeline are included in this report, from obtaining the data from NOAA, processing code and analysis code. Additionally, an appendix includes session information for the reader regarding the software environment on which this analysis was performed.
setwd("~/severe_weather")
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:stats':
##
## filter
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(reshape2)
Note this step is only done once so after the initial download this code chunk was commented out, it is here for reproducibility
#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","severe_weather_data.csv.bz2")
stormData<-read.csv("severe_weather_data.csv.bz2")
stormConsequences <- subset(stormData,FATALITIES > 0 | INJURIES > 0, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,CROPDMG,CROPDMGEXP))
It is clear that there are EVTYPE entries that are entered incorrectly and a duplication of levels (event types).
lvl1<-nlevels(stormConsequences$EVTYPE)
There are 985 levels of EVTYPE (factor variable) prior to trying to clean.
Based on the NOAA Storm Data Documentation there are only 48 Storm Data Event Table (Section 2.1.1). Using this table we can select key words from the character data in the EVTYPE variable for cleaning.
The first step to cleaning this includes making all event types upper case, and trimming white space.
# Change to all upper case
stormConsequences$EVTYPE <- toupper(stormConsequences$EVTYPE)
# Remove symbol
stormConsequences$EVTYPE <- gsub(pattern = "[^A-Z]+", " ", stormConsequences$EVTYPE)
# Remove plurals
stormConsequences$EVTYPE <- gsub(pattern = "S\\b", " ", stormConsequences$EVTYPE)
# Fixing flood
stormConsequences$EVTYPE <- gsub(pattern = "FLOODING|FLDG|FLD|FLASH FLOOD", "FLOOD",
stormConsequences$EVTYPE)
stormConsequences[grep("\\bFLOOD\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "FLOOD"
# Remove single symbols
stormConsequences$EVTYPE <- gsub(pattern = "\\b[A-Z]\\b", " ", stormConsequences$EVTYPE)
# Fixing HEAT
stormConsequences[grep("\\bheat\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "HEAT"
# Fixing WIND
stormConsequences[grep("\\bwind\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "WIND"
# Fixing HURRICANE
stormConsequences[grep("\\bhurricane\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "HURRICANE"
# Fixing STORM
stormConsequences[grep("\\bSTORM\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "STORM"
# Fixing FIRE
stormConsequences[grep("\\bfire\\b|wildfire", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "FIRE"
# Fixing RAIN
stormConsequences[grep("\\bRAIN\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T),
"EVTYPE"] <- "RAIN"
# Remove multiple spaces
stormConsequences$EVTYPE <- gsub(pattern = " +", " ", stormConsequences$EVTYPE)
# Remove leading and trailing white space
stormConsequences$EVTYPE <- trimws(stormConsequences$EVTYPE)
#Change EVTYPE back to a factor variable
stormConsequences$EVTYPE <- as.factor(stormConsequences$EVTYPE)
lvl2<-nlevels(stormConsequences$EVTYPE)
There are now 103 levels of EVTYPE (factor variable).
This is improved - still not the 48 levels of EVTYPE based on the NOAA Storm Data Documentation, but for now will be considered sufficient to proceed. Additional cleaning would require assumptions regarding how to reclassify data. For example, we would have to assume variations of EVTYPE for some levels are not meaningful. Combining such levels may lead to an overestimate of the number of events based on the assumptions of our coding to combine levels.
PROPDMGEXP and CROPDMGEXP include codes for magnitude of damage in USD (H - hundreds, K - thousands, M - Millions, B - Billions)
unique(stormConsequences$PROPDMGEXP)
## [1] K M B 5 0 m - 7 H
## Levels: - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(stormConsequences$CROPDMGEXP)
## [1] M K 0 B
## Levels: ? 0 2 B k K m M
To determine economic damage in total USD, we first cleaned PROPDMGEXP and CROPDMGEXP. This process does not correct for provblems of missing values.
# trim and uppercase
stormConsequences$PROPDMGEXP <- trimws(toupper(stormConsequences$PROPDMGEXP))
# substitute symbols with numeric values
stormConsequences[stormConsequences$PROPDMGEXP == "" | stormConsequences$PROPDMGEXP == "+" | stormConsequences$PROPDMGEXP ==
"-" | stormConsequences$PROPDMGEXP == "?", "PROPDMGEXP"] = "0"
stormConsequences[stormConsequences$PROPDMGEXP == "B", "PROPDMGEXP"] = "9"
stormConsequences[stormConsequences$PROPDMGEXP == "H", "PROPDMGEXP"] = "2"
stormConsequences[stormConsequences$PROPDMGEXP == "K", "PROPDMGEXP"] = "3"
stormConsequences[stormConsequences$PROPDMGEXP == "M", "PROPDMGEXP"] = "6"
# trim and uppercase
stormConsequences$CROPDMGEXP <- trimws(toupper(stormConsequences$CROPDMGEXP))
# substitute symbols with numeric values
stormConsequences[stormConsequences$CROPDMGEXP == "" | stormConsequences$CROPDMGEXP == "+" | stormConsequences$CROPDMGEXP ==
"-" | stormConsequences$CROPDMGEXP == "?", "CROPDMGEXP"] = "0"
stormConsequences[stormConsequences$CROPDMGEXP == "B", "CROPDMGEXP"] = "9"
stormConsequences[stormConsequences$CROPDMGEXP == "H", "CROPDMGEXP"] = "2"
stormConsequences[stormConsequences$CROPDMGEXP == "K", "CROPDMGEXP"] = "3"
stormConsequences[stormConsequences$CROPDMGEXP == "M", "CROPDMGEXP"] = "6"
# Convert columns to numeric data type
stormConsequences$PROPDMG <- as.numeric(stormConsequences$PROPDMG)
stormConsequences$PROPDMGEXP <- as.numeric(stormConsequences$PROPDMGEXP)
stormConsequences$CROPDMG <- as.numeric(stormConsequences$CROPDMG)
stormConsequences$CROPDMGEXP <- as.numeric(stormConsequences$CROPDMGEXP)
# Create numeric vector with magnitude applied
PROPDMG <- 10^stormConsequences$PROPDMGEXP * stormConsequences$PROPDMG
CROPDMG <- 10^stormConsequences$CROPDMGEXP * stormConsequences$CROPDMG
# Apply vectors to data set and create calculated column with total damage
stormConsequences$PROPDMG <- PROPDMG
stormConsequences$CROPDMG <- CROPDMG
stormConsequences$ECONOMICDMG_TOTAL <- PROPDMG + CROPDMG
Economic damage total (USD) is an appropriate combination since both property and crop damage are measured using the same metric (USD), therefore they are comparable.
Discovered that the data set has NA values for stormConsequences$ECONOMICDMG_TOTAL, so this code makes a (major) assumption that NAs were intentionally entered when there was no damage (= 0), and replaces the (12) NA values with zero.
stormConsequences$ECONOMICDMG_TOTAL[is.na(stormConsequences$ECONOMICDMG_TOTAL)] <- 0
stormConsequences$FATALITIES <- as.numeric(stormConsequences$FATALITIES)
stormConsequences$INJURIES <- as.numeric(stormConsequences$INJURIES)
Note that we did not create a total for population health. Even though both fatalities and injuries are both counts, they represent fundamentally different metrics so combining them would require too many assumptions for this analysis.
sumFATALITIES <- stormConsequences %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES))
sumINJURIES <- stormConsequences %>% group_by(EVTYPE) %>% summarise(INJURIES = sum(INJURIES))
sumECONOMICDMG <- stormConsequences %>% group_by(EVTYPE) %>% summarise(ECONOMICDMG = sum(ECONOMICDMG_TOTAL))
Results are presented based on the primary analysis questions.
Fatalities
The following table and plot present the 10 most damaging events from the perspective of fatalities.
topFatalities <- head(sumFATALITIES[order(sumFATALITIES$FATALITIES, decreasing = T),], n = 10)[, c(1, 2)]
topFatalities
## Source: local data frame [10 x 2]
##
## EVTYPE FATALITIES
## 1 TORNADO 5633
## 2 HEAT 3138
## 3 FLOOD 1533
## 4 WIND 1433
## 5 LIGHTNING 817
## 6 RIP CURRENT 572
## 7 STORM 420
## 8 AVALANCHE 224
## 9 EXTREME COLD 162
## 10 HURRICANE 133
topFATALITIESPlot <- melt(topFatalities)
## Using EVTYPE as id variables
ggplot(topFATALITIESPlot,aes(x = factor(topFATALITIESPlot$EVTYPE), y = topFATALITIESPlot$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", fill="red", colour="red") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Number of Fatalities") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Fatalities (top 10)")
Injuries
The following table and plot present the 10 most damaging events from the perspective of injuries
topInjuries <- head(sumINJURIES[order(sumINJURIES$INJURIES, decreasing = T),], n = 10)[, c(1, 2)]
topInjuries
## Source: local data frame [10 x 2]
##
## EVTYPE INJURIES
## 1 TORNADO 91364
## 2 WIND 11489
## 3 HEAT 9224
## 4 FLOOD 8675
## 5 LIGHTNING 5230
## 6 STORM 4196
## 7 FIRE 1608
## 8 HAIL 1361
## 9 HURRICANE 1328
## 10 HEAVY SNOW 1021
topInjuriesPlot <- melt(topInjuries)
## Using EVTYPE as id variables
ggplot(topInjuriesPlot,aes(x = factor(topInjuriesPlot$EVTYPE), y = topInjuriesPlot$value, fill = variable)) +
geom_bar(stat = "identity", position = "dodge", fill="blue", colour="blue") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Number of Injuries") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Injuries (top 10)")
The following table and plot present the 10 most damaging events from the perspective of the economy total - both property and crops combined
topECONOMICDMG <- head(sumECONOMICDMG[order(sumECONOMICDMG$ECONOMICDMG, decreasing = T),], n = 10)[, c(1, 2)]
topECONOMICDMG
## Source: local data frame [10 x 2]
##
## EVTYPE ECONOMICDMG
## 1 TORNADO 42028680105
## 2 HURRICANE 41529410800
## 3 STORM 16806219500
## 4 FLOOD 11044632690
## 5 WIND 8286339433
## 6 FIRE 4855739200
## 7 HAIL 3672085701
## 8 BLIZZARD 638806000
## 9 HEAT 505957550
## 10 TYPHOON 500100000
topECONOMICDMGPlot <- melt(topECONOMICDMG)
## Using EVTYPE as id variables
ggplot(topECONOMICDMGPlot,aes(x = factor(topECONOMICDMGPlot$EVTYPE), y = topECONOMICDMGPlot$value, fill = variable)) + geom_bar(stat = "identity", position = "dodge", fill="GREEN", colour="GREEN") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Cost of Damage (USD)") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Economic Damage (top 10)")
Tornados are the severe weather event withthe greatest number of fatalities, injuries and the greatest economic damage as considered cost in USD. Beyond the overall greatest impact of tornados, there is variation in the remaining top 10. For example, the second greatest event for fatalities is heat, for injuries is wind, and for economic damage is flooding. We cannot rule out the impact of our cleaning process on these results. Many of these top 10 events were part of our cleaning process and we are not sure how a more thorough cleaning process might change the distribution - particularly for the 2nd - 10th highest rated events. Tornado’s are very damaging (overall) and seem to be so in a marked manner (note the overall height of the Tornado fatalities, injuries and USD). It is unlikely that more thorough cleaning would change the position of the tornado event as #1 overall for population health; and only mildly likely that flood or wind would overtake tornado for economic impact (particularl since both wind and flood were cleaned using a similiar process as was used for tornado.
sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] reshape2_1.4.1 dplyr_0.4.1 ggplot2_1.0.1
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.11.6 knitr_1.12.3 magrittr_1.5 MASS_7.3-43
## [5] munsell_0.4.2 colorspace_1.2-6 stringr_1.0.0 plyr_1.8.2
## [9] tools_3.2.2 parallel_3.2.2 grid_3.2.2 gtable_0.1.2
## [13] DBI_0.3.1 htmltools_0.3 yaml_2.1.13 lazyeval_0.1.10
## [17] digest_0.6.8 assertthat_0.1 formatR_1.3 evaluate_0.8.3
## [21] rmarkdown_0.9.5 labeling_0.3 stringi_0.4-1 scales_0.2.4
## [25] proto_0.3-10