Synopsis / Executive Summary

Using the NOAA storm database, this analysis answers the questions of

  1. Does the analysis address the question of which types of events are most harmful to population health? ANSWER: TORNADO.

  2. Does the analysis address the question of which types of events have the greatest economic consequences? ANSWER: FLOOD

Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This markdown document explores the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The analysis presented below seeks to answer two questions:

  1. Does the analysis address the question of which types of events are most harmful to population health?

  2. Does the analysis address the question of which types of events have the greatest economic consequences?

Data Processing

To start, we download the data from the internet in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size.

There is also some documentation of the database that shows how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records.

Due to the iterative nature of data processing and exploration, I included code to insure the file was only downloaded once and kept locally. This saves significant time over the course of the analysis.

#Load needed libraries
library(ggplot2)
library(data.table)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
#Download and read data file once to save time
data_file <- "StormData.csv.bz2"

if (!file.exists(data_file)) {
  download_url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
  download.file(download_url, destfile = data_file)
  #unzip (zipfile = data_file)
}

if (!exists("RawStormData")) {
  RawStormData <- read.csv("StormData.csv.bz2") 
}  
  
#Initial look at data
str(RawStormData)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

This initial look at the data lets us know there are 902297 observations of 37 variables. Since we are only interested in determining the answer to two questions at this time, we can subset most of the information to retain only the information pertaining to event types and health or economic impacts.

The data below subsets the data needed and converts some columns into usable numbers. The property and crop exponent columns are order of magnitude scalars that need to be changed into numbers that can be multiplied to the property and crop damage numbers so a comparison can be made.

#Subset only the needed values (>0) and columns (relevant info)
relevantcolumns <- c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")
StormData <- RawStormData[, relevantcolumns]
StormData <- as.data.table(StormData)
StormData <- StormData[((INJURIES > 0 | FATALITIES > 0 | PROPDMG > 0 | CROPDMG > 0)), 
             c("EVTYPE", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")]


#Make all exponents the same to minimize SCALE conversions, some were entered lower case
StormData$PROPDMGEXP <- toupper(StormData$PROPDMGEXP)
StormData$CROPDMGEXP <- toupper(StormData$CROPDMGEXP)

#Add column to cange all crop exponents into a number able to multiple with value
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "")] <- 10^0
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "?")] <- 10^0
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "0")] <- 10^0
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "2")] <- 10^2
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "K")] <- 10^3
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "M")] <- 10^6
StormData$CROPDMGSCALE[(StormData$CROPDMGEXP == "B")] <- 10^9

#Add column to change all property exponents into a number able to multiple with value
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "")] <- 10^0
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "-")] <- 10^0
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "?")] <- 10^0
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "+")] <- 10^0
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "0")] <- 10^0
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "1")] <- 10^1
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "2")] <- 10^2
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "3")] <- 10^3
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "4")] <- 10^4
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "5")] <- 10^5
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "6")] <- 10^6
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "7")] <- 10^7
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "8")] <- 10^8
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "H")] <- 10^2
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "K")] <- 10^3
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "M")] <- 10^6
StormData$PROPDMGSCALE[(StormData$PROPDMGEXP == "B")] <- 10^9

#Add columns to sum health and economic impacts
StormData <- mutate(StormData, TOTALHEALTH = FATALITIES + INJURIES)
StormData <- mutate(StormData, TOTALECON = PROPDMG * PROPDMGSCALE + CROPDMG * CROPDMGSCALE)

#Look at modified data to verify changes
str(StormData)
## Classes 'data.table' and 'data.frame':   254633 obs. of  11 variables:
##  $ EVTYPE      : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ FATALITIES  : num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES    : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG     : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP  : chr  "K" "K" "K" "K" ...
##  $ CROPDMG     : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP  : chr  "" "" "" "" ...
##  $ CROPDMGSCALE: num  1 1 1 1 1 1 1 1 1 1 ...
##  $ PROPDMGSCALE: num  1000 1000 1000 1000 1000 1000 1000 1000 1000 1000 ...
##  $ TOTALHEALTH : num  15 0 2 2 2 6 1 0 15 0 ...
##  $ TOTALECON   : num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Now we see that there are 254633 observations of 11 variables, including the new columns of total health impacts and total economic impact.

NOTE: Because there was no preference given for valuing a fatality over an injury, or property damage over crop damage, these values were simply summed assuming a total count would suffice. A glance at the data shows the fatality and crop damage counts are well below the injury and property damage counts as would be expected, so unless a very large multiplier was added to these values, the result would be the same for the leading event types.

The next step in the analysis was to group the data by event type and sum the values to determine which events were most impactful.

#Group and add health impacts from events
HealthCost <- StormData %>% 
  group_by(EVTYPE) %>% 
  summarise(TOTALHEALTH = sum(TOTALHEALTH)) %>% 
  arrange(desc(TOTALHEALTH))
  
#Group and add economic impactts from events
EconCost <- StormData %>% 
  group_by(EVTYPE) %>% 
  summarise(TOTALECON = sum(TOTALECON)) %>% 
  arrange(desc(TOTALECON))

Results

The code and plot below shows that Tornado are by far the leader in injuries/fatalities by any event type. If the totals were closer in magnitude, there could have been a need for additional processing - there appear to be duplicate event types under different names, such as TSTM WIND and THUNDERSTORM WIND, or FLOOD and FLASH FLOOD, etc. Tornados being the leading cause of injuries/fatalities seems likely as most other weather events are either less severe amd/or you have plenty of time to prepare for them.

#Plot Top 10 events HealthCost
HealthPlot <- ggplot(HealthCost[1:10,], aes(x=reorder(EVTYPE, -TOTALHEALTH),y=TOTALHEALTH,color=EVTYPE)) + 
  geom_bar(stat="identity", fill="white") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  xlab("Event Type") + ylab("Number of Injuries (inc. Fatal)") +
  theme(legend.position="none") +
  ggtitle("Injuries/Fatalities in the US From Severe Weather Events")
print(HealthPlot)

The below code and plot show that Floods are the leading weather event type for economic impacts. Like the injury/fatality plot, there are duplicates here such as FLOOD and FLASH FLOOD and RIVER FLOOD, or HURRICANE/TYPHOON and HURRICANE, but the magnitude of differences in the leading two evnet types make further processing unnecessary.

#Plot Top 10 events EconCost
EconPlot <- ggplot(EconCost[1:10,], aes(x=reorder(EVTYPE, -TOTALECON),y=TOTALECON,color=EVTYPE)) + 
  geom_bar(stat="identity", fill="white") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  xlab("Event Type") + ylab("Economic cost ($s)") +
  theme(legend.position="none") +
  ggtitle("Economic Loss in the US From Severe Weather Events")
print(EconPlot)