Severe weather events have population health and economic consequences. This analysis explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database to answer basic questions about severe weather events. In particular the database allows us to analyze severe weather events across the United States and population health consequences through fatalities and injuries, and economic consequences through property and crop damage (in USD), which will be combined for total economic damage in US dollars (USD).

The below analysis demonstrates that across the United States tornados are most harmful with respect to population health have the greatest economic consequence. It is hoped that these results might be helpful to a government or municipal manager responsible for preparing for severe weather events as they prioritize resources for different types of events. However, this report does not make any specific recommendations.

For reproducibility all steps to the analysis pipeline are included in this report, from obtaining the data from NOAA, processing code and analysis code. Additionally, an appendix includes session information for the reader regarding the software environment on which this analysis was performed.

Set working directory and load required R packages

setwd("~/severe_weather")
library(ggplot2)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:stats':
## 
##     filter
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(reshape2)

Data Processing

Downloading the data

Note this step is only done once so after the initial download this code chunk was commented out, it is here for reproducibility

#download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2","severe_weather_data.csv.bz2")

Load the data into R and subset only the necessary data for this analysis

stormData<-read.csv("severe_weather_data.csv.bz2")

Subset only the necessary data for this analysis

stormConsequences <- subset(stormData,FATALITIES > 0 | INJURIES > 0, select = c(EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,CROPDMG,CROPDMGEXP))

Cleaning EVTYPE factors

It is clear that there are EVTYPE entries that are entered incorrectly and a duplication of levels (event types).

lvl1<-nlevels(stormConsequences$EVTYPE)

There are 985 levels of EVTYPE (factor variable) prior to trying to clean.

Based on the NOAA Storm Data Documentation there are only 48 Storm Data Event Table (Section 2.1.1). Using this table we can select key words from the character data in the EVTYPE variable for cleaning.

The first step to cleaning this includes making all event types upper case, and trimming white space.

# Change to all upper case
stormConsequences$EVTYPE <- toupper(stormConsequences$EVTYPE)

# Remove symbol
stormConsequences$EVTYPE <- gsub(pattern = "[^A-Z]+", " ", stormConsequences$EVTYPE)
# Remove plurals
stormConsequences$EVTYPE <- gsub(pattern = "S\\b", " ", stormConsequences$EVTYPE)
# Fixing flood
stormConsequences$EVTYPE <- gsub(pattern = "FLOODING|FLDG|FLD|FLASH FLOOD", "FLOOD", 
    stormConsequences$EVTYPE)
stormConsequences[grep("\\bFLOOD\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "FLOOD"
# Remove single symbols
stormConsequences$EVTYPE <- gsub(pattern = "\\b[A-Z]\\b", " ", stormConsequences$EVTYPE)
# Fixing HEAT
stormConsequences[grep("\\bheat\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "HEAT"
# Fixing WIND
stormConsequences[grep("\\bwind\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "WIND"
# Fixing HURRICANE
stormConsequences[grep("\\bhurricane\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "HURRICANE"
# Fixing STORM
stormConsequences[grep("\\bSTORM\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "STORM"
# Fixing FIRE
stormConsequences[grep("\\bfire\\b|wildfire", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "FIRE"
# Fixing RAIN
stormConsequences[grep("\\bRAIN\\b", stormConsequences$EVTYPE, perl = T, ignore.case = T), 
    "EVTYPE"] <- "RAIN"
# Remove multiple spaces
stormConsequences$EVTYPE <- gsub(pattern = "  +", " ", stormConsequences$EVTYPE)
# Remove leading and trailing white space
stormConsequences$EVTYPE <- trimws(stormConsequences$EVTYPE)
#Change EVTYPE back to a factor variable
stormConsequences$EVTYPE <- as.factor(stormConsequences$EVTYPE)
lvl2<-nlevels(stormConsequences$EVTYPE)

There are now 103 levels of EVTYPE (factor variable).

This is improved - still not the 48 levels of EVTYPE based on the NOAA Storm Data Documentation, but for now will be considered sufficient to proceed. Additional cleaning would require assumptions regarding how to reclassify data. For example, we would have to assume variations of EVTYPE for some levels are not meaningful. Combining such levels may lead to an overestimate of the number of events based on the assumptions of our coding to combine levels.

Cleaning PROPDMGEXP / CROPDMGEXP / PROPDMG / CROPDMG

PROPDMGEXP and CROPDMGEXP include codes for magnitude of damage in USD (H - hundreds, K - thousands, M - Millions, B - Billions)

unique(stormConsequences$PROPDMGEXP)
##  [1] K M   B 5 0 m - 7 H
## Levels:  - ? + 0 1 2 3 4 5 6 7 8 B h H K m M
unique(stormConsequences$CROPDMGEXP)
## [1]   M K 0 B
## Levels:  ? 0 2 B k K m M

To determine economic damage in total USD, we first cleaned PROPDMGEXP and CROPDMGEXP. This process does not correct for provblems of missing values.

# trim and uppercase
stormConsequences$PROPDMGEXP <- trimws(toupper(stormConsequences$PROPDMGEXP))
# substitute symbols with numeric values
stormConsequences[stormConsequences$PROPDMGEXP == "" | stormConsequences$PROPDMGEXP == "+" | stormConsequences$PROPDMGEXP == 
    "-" | stormConsequences$PROPDMGEXP == "?", "PROPDMGEXP"] = "0"
stormConsequences[stormConsequences$PROPDMGEXP == "B", "PROPDMGEXP"] = "9"
stormConsequences[stormConsequences$PROPDMGEXP == "H", "PROPDMGEXP"] = "2"
stormConsequences[stormConsequences$PROPDMGEXP == "K", "PROPDMGEXP"] = "3"
stormConsequences[stormConsequences$PROPDMGEXP == "M", "PROPDMGEXP"] = "6"

# trim and uppercase
stormConsequences$CROPDMGEXP <- trimws(toupper(stormConsequences$CROPDMGEXP))
# substitute symbols with numeric values
stormConsequences[stormConsequences$CROPDMGEXP == "" | stormConsequences$CROPDMGEXP == "+" | stormConsequences$CROPDMGEXP == 
    "-" | stormConsequences$CROPDMGEXP == "?", "CROPDMGEXP"] = "0"
stormConsequences[stormConsequences$CROPDMGEXP == "B", "CROPDMGEXP"] = "9"
stormConsequences[stormConsequences$CROPDMGEXP == "H", "CROPDMGEXP"] = "2"
stormConsequences[stormConsequences$CROPDMGEXP == "K", "CROPDMGEXP"] = "3"
stormConsequences[stormConsequences$CROPDMGEXP == "M", "CROPDMGEXP"] = "6"

Convert cleaned data into numeric data to allow creating numeric values of damage in USD

# Convert columns to numeric data type
stormConsequences$PROPDMG <- as.numeric(stormConsequences$PROPDMG)
stormConsequences$PROPDMGEXP <- as.numeric(stormConsequences$PROPDMGEXP)
stormConsequences$CROPDMG <- as.numeric(stormConsequences$CROPDMG)
stormConsequences$CROPDMGEXP <- as.numeric(stormConsequences$CROPDMGEXP)

# Create numeric vector with magnitude applied
PROPDMG <- 10^stormConsequences$PROPDMGEXP * stormConsequences$PROPDMG
CROPDMG <- 10^stormConsequences$CROPDMGEXP * stormConsequences$CROPDMG

# Apply vectors to data set and create calculated column with total damage
stormConsequences$PROPDMG <- PROPDMG
stormConsequences$CROPDMG <- CROPDMG
stormConsequences$ECONOMICDMG_TOTAL <- PROPDMG + CROPDMG

Economic damage total (USD) is an appropriate combination since both property and crop damage are measured using the same metric (USD), therefore they are comparable.

Discovered that the data set has NA values for stormConsequences$ECONOMICDMG_TOTAL, so this code makes a (major) assumption that NAs were intentionally entered when there was no damage (= 0), and replaces the (12) NA values with zero.

stormConsequences$ECONOMICDMG_TOTAL[is.na(stormConsequences$ECONOMICDMG_TOTAL)] <- 0

Convert FATALITIES and INJURIES to numeric data

stormConsequences$FATALITIES <- as.numeric(stormConsequences$FATALITIES)
stormConsequences$INJURIES <- as.numeric(stormConsequences$INJURIES)

Note that we did not create a total for population health. Even though both fatalities and injuries are both counts, they represent fundamentally different metrics so combining them would require too many assumptions for this analysis.

Aggregate sums of damage and population health data by event type

sumFATALITIES <- stormConsequences %>% group_by(EVTYPE) %>% summarise(FATALITIES = sum(FATALITIES))
sumINJURIES <- stormConsequences %>% group_by(EVTYPE) %>% summarise(INJURIES = sum(INJURIES))
sumECONOMICDMG <- stormConsequences %>% group_by(EVTYPE) %>% summarise(ECONOMICDMG = sum(ECONOMICDMG_TOTAL))

Results

Results are presented based on the primary analysis questions.

Which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Fatalities
The following table and plot present the 10 most damaging events from the perspective of fatalities.

topFatalities <- head(sumFATALITIES[order(sumFATALITIES$FATALITIES, decreasing = T),], n = 10)[, c(1, 2)]
topFatalities
## Source: local data frame [10 x 2]
## 
##          EVTYPE FATALITIES
## 1       TORNADO       5633
## 2          HEAT       3138
## 3         FLOOD       1533
## 4          WIND       1433
## 5     LIGHTNING        817
## 6   RIP CURRENT        572
## 7         STORM        420
## 8     AVALANCHE        224
## 9  EXTREME COLD        162
## 10    HURRICANE        133
topFATALITIESPlot <- melt(topFatalities)
## Using EVTYPE as id variables
ggplot(topFATALITIESPlot,aes(x = factor(topFATALITIESPlot$EVTYPE), y = topFATALITIESPlot$value, fill = variable)) + 
    geom_bar(stat = "identity", position = "dodge", fill="red", colour="red") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Number of Fatalities") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Fatalities (top 10)")

Injuries
The following table and plot present the 10 most damaging events from the perspective of injuries

topInjuries <- head(sumINJURIES[order(sumINJURIES$INJURIES, decreasing = T),], n = 10)[, c(1, 2)]
topInjuries
## Source: local data frame [10 x 2]
## 
##        EVTYPE INJURIES
## 1     TORNADO    91364
## 2        WIND    11489
## 3        HEAT     9224
## 4       FLOOD     8675
## 5   LIGHTNING     5230
## 6       STORM     4196
## 7        FIRE     1608
## 8        HAIL     1361
## 9   HURRICANE     1328
## 10 HEAVY SNOW     1021
topInjuriesPlot <- melt(topInjuries)
## Using EVTYPE as id variables
ggplot(topInjuriesPlot,aes(x = factor(topInjuriesPlot$EVTYPE), y = topInjuriesPlot$value, fill = variable)) + 
    geom_bar(stat = "identity", position = "dodge", fill="blue", colour="blue") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Number of Injuries") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Injuries (top 10)")

Which types of events have the greatest economic consequences?

The following table and plot present the 10 most damaging events from the perspective of the economy total - both property and crops combined

topECONOMICDMG <- head(sumECONOMICDMG[order(sumECONOMICDMG$ECONOMICDMG, decreasing = T),], n = 10)[, c(1, 2)]
topECONOMICDMG
## Source: local data frame [10 x 2]
## 
##       EVTYPE ECONOMICDMG
## 1    TORNADO 42028680105
## 2  HURRICANE 41529410800
## 3      STORM 16806219500
## 4      FLOOD 11044632690
## 5       WIND  8286339433
## 6       FIRE  4855739200
## 7       HAIL  3672085701
## 8   BLIZZARD   638806000
## 9       HEAT   505957550
## 10   TYPHOON   500100000
topECONOMICDMGPlot <- melt(topECONOMICDMG)
## Using EVTYPE as id variables
ggplot(topECONOMICDMGPlot,aes(x = factor(topECONOMICDMGPlot$EVTYPE), y = topECONOMICDMGPlot$value, fill = variable)) + geom_bar(stat = "identity", position = "dodge", fill="GREEN", colour="GREEN") + theme(axis.text.x = element_text(angle = -270), plot.title = element_text(face = "bold")) + labs(x = "Severe Weather Event", y = "Cost of Damage (USD)") + theme(legend.position = "none") + ggtitle("Impact of Severe Weather on Economic Damage (top 10)")

Brief Discussion / Summary

Tornados are the severe weather event withthe greatest number of fatalities, injuries and the greatest economic damage as considered cost in USD. Beyond the overall greatest impact of tornados, there is variation in the remaining top 10. For example, the second greatest event for fatalities is heat, for injuries is wind, and for economic damage is flooding. We cannot rule out the impact of our cleaning process on these results. Many of these top 10 events were part of our cleaning process and we are not sure how a more thorough cleaning process might change the distribution - particularly for the 2nd - 10th highest rated events. Tornado’s are very damaging (overall) and seem to be so in a marked manner (note the overall height of the Tornado fatalities, injuries and USD). It is unlikely that more thorough cleaning would change the position of the tornado event as #1 overall for population health; and only mildly likely that flood or wind would overtake tornado for economic impact (particularl since both wind and flood were cleaned using a similiar process as was used for tornado.

Appendix

Software environment on which this analysis was performed

sessionInfo()
## R version 3.2.2 (2015-08-14)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] reshape2_1.4.1 dplyr_0.4.1    ggplot2_1.0.1 
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_0.11.6      knitr_1.12.3     magrittr_1.5     MASS_7.3-43     
##  [5] munsell_0.4.2    colorspace_1.2-6 stringr_1.0.0    plyr_1.8.2      
##  [9] tools_3.2.2      parallel_3.2.2   grid_3.2.2       gtable_0.1.2    
## [13] DBI_0.3.1        htmltools_0.3    yaml_2.1.13      lazyeval_0.1.10 
## [17] digest_0.6.8     assertthat_0.1   formatR_1.3      evaluate_0.8.3  
## [21] rmarkdown_0.9.5  labeling_0.3     stringi_0.4-1    scales_0.2.4    
## [25] proto_0.3-10