Synopsis

Data collected in the NOAA Storm Database gives key insights into how different kinds of storms impact lives and property. Based on the research conducted, it is important to note that, while storms that occur most often have the highest aggregate impact on people’s lives and their property, certain events can be more devastating on a case-by-case basis. It is the researcher’s recommendation to public safety officials that consistent, passive policies be put in place to counteract the effects of frequently occuring events (ie regulations for buildings that make them more resistant to tornados) while more devastating events receive more emergency resource allocation (ie policing of areas impacted by a riptide).

Preprocessing

The data is retrieved from the NOAA Storm Database.

# Setup
library(ggplot2)
library(gridExtra)
library(tidyr)


# Pre-Processing

# Directory
setwd('C:\\Users\\Brendan\\Documents\\Coursera\\Course 4\\Project2')

# Download Data
file <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

# Import
if(!file.exists("StormData.csv")) download.file(file, "StormData.csv")

StormData_raw <- read.csv("StormData.csv")
StormData <- StormData_raw

The first thing we need to transform in the data is the Damage Exponent variable. They have given a key through which the Damage variables need to be multiplied in order to get the correct scale of the damage.

# Multiply out the Exponential Crop Damage
table(StormData$CROPDMGEXP)
## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994
StormData$CROPDMGEXP <- as.character(StormData$CROPDMGEXP)
StormData[StormData$CROPDMGEXP == 'H' | StormData$CROPDMGEXP == 'h',"CROPDMGEXP"] <- '100'
StormData[StormData$CROPDMGEXP == 'K' | StormData$CROPDMGEXP == 'k',"CROPDMGEXP"] <- '1000'
StormData[StormData$CROPDMGEXP == 'M' | StormData$CROPDMGEXP == 'm',"CROPDMGEXP"] <- '1000000'
StormData[StormData$CROPDMGEXP == 'B' | StormData$CROPDMGEXP == 'b',"CROPDMGEXP"] <- '1000000000'
StormData[substr(StormData$CROPDMGEXP, 1, 1) != '1',"CROPDMGEXP"] <- '1'
StormData$CROPDMGEXP <- as.numeric(StormData$CROPDMGEXP)
StormData$CROPDMG <- StormData$CROPDMG * StormData$CROPDMGEXP

# Multiply out the Exponential Crop Damage
table(StormData$PROPDMGEXP)
## 
##             -      ?      +      0      1      2      3      4      5 
## 465934      1      8      5    216     25     13      4      4     28 
##      6      7      8      B      h      H      K      m      M 
##      4      5      1     40      1      6 424665      7  11330
StormData$PROPDMGEXP <- as.character(StormData$PROPDMGEXP)
StormData[StormData$PROPDMGEXP == 'H' | StormData$PROPDMGEXP == 'h',"PROPDMGEXP"] <- '100'
StormData[StormData$PROPDMGEXP == 'K' | StormData$PROPDMGEXP == 'k',"PROPDMGEXP"] <- '1000'
StormData[StormData$PROPDMGEXP == 'M' | StormData$PROPDMGEXP == 'm',"PROPDMGEXP"] <- '1000000'
StormData[StormData$PROPDMGEXP == 'B' | StormData$PROPDMGEXP == 'b',"PROPDMGEXP"] <- '1000000000'
StormData[substr(StormData$PROPDMGEXP, 1, 1) != '1',"PROPDMGEXP"] <- '1'
StormData$PROPDMGEXP <- as.numeric(StormData$PROPDMGEXP)
StormData$PROPDMG <- StormData$PROPDMG * StormData$PROPDMGEXP

There are some basic variable types we need to adjust, as they were imported incorrectly.

# Set Variable Types
StormData$STATE__ <- factor(StormData$STATE__)
StormData$BGN_DATE <- as.Date(StormData$BGN_DATE, format = '%m/%d/%Y')
StormData$COUNTY <- factor(StormData$COUNTY)
StormData$COUNTY_END <- factor(StormData$COUNTY_END)

The types of events data was collected on varied throughout the time period of this dataset. As detailed by this page from NOAA… https://www.ncdc.noaa.gov/stormevents/details.jsp?type=eventtype they only started collecting the full suite of 48 events in 1996. We therefore need to restrict the dataset to 1996 and beyond.

StormData <- StormData[StormData$BGN_DATE >= "1996-01-01",]

We run into a problem inherent in the data collection process. There does not appear to be restrictions on the input of the event types; there are 103 distinct variables, when according to NOAA there should only be 48. Recategorizing these would be highly subjective and hard to reproduce, so the researcher will settle for looking at the most frequently entered 50 variables used.

# There are 103 variables, when there should only be 48
length(unique(table(StormData$EVTYPE)))
## [1] 106
# The EVTYPE variable is not clean.  Efforts to clean are complicated and subjective, so we're just going to stick to an analysis of the most recorded events
StormData$EVTYPE <- as.character(StormData$EVTYPE)
StormData$EVTYPE <- tolower(StormData$EVTYPE)
StormData$EVTYPE <- gsub("[[:punct:]]", " ", StormData$EVTYPE)

ev_words <- as.data.frame(table(StormData$EVTYPE))
ev_words$rank <- rank(-ev_words$Freq, ties.method = "first")
ev_words <- ev_words[ev_words$rank <= 50,]

StormData <- StormData[StormData$EVTYPE %in% ev_words$Var1,]

Results

Now that we have our final dataset, we will look at the frequency and severity of the different events.

Frequency

## Get a count of the frequency of these events
# Reorder by Count
StormData <- within(StormData, 
                   EVTYPE <- factor(EVTYPE, 
                                      levels=names(sort(table(EVTYPE), 
                                                        decreasing=TRUE))))

# Bar Chart of EVTYPEs
p1 <- ggplot(data = StormData, aes(x = EVTYPE)) + geom_bar() + labs(title="# Of Events", x = NA, y = NA) + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), axis.title.y=element_blank())
p2 <- ggplot(data = StormData, aes(x = EVTYPE)) + geom_bar() + scale_y_log10() + labs(title="Log # Of Events", x = NA, y = NA) + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), axis.title.y=element_blank())

We plot both the total number of events and the log number of events, so we can get an understanding of the relative occurance of the kinds of storms.

grid.arrange(p1, p2, top = "Frequency of Events", bottom = "Events")

Clearly, the storms that occur most often are not necessarily ones that you expect to cause much damage. Hail, wind, and thunderstorms probably occur somewhat light damage and very rarely impact human life. It is surprising to see how commonly flash floods occur.

Aggregate Damage

#### SUM
# Sum by event type
evtype_damage_sum <- with(StormData, aggregate(list(FATALITIES, INJURIES, PROPDMG, CROPDMG),
                                           by = list(EVTYPE), sum))
colnames(evtype_damage_sum) <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'CROPDMG')
evtype_damage_sum <- gather(evtype_damage_sum, measure, damage, FATALITIES:CROPDMG, factor_key = FALSE)
evtype_damage_sum_people <- evtype_damage_sum[evtype_damage_sum$measure %in% c('FATALITIES', 'INJURIES') & evtype_damage_sum$damage != 0,]
evtype_damage_sum_objects <- evtype_damage_sum[!(evtype_damage_sum$measure %in% c('FATALITIES', 'INJURIES')) & evtype_damage_sum$damage != 0,]

# Develop plots
# People
p1 <- ggplot(data = evtype_damage_sum_people,
             aes(x=reorder(EVTYPE, damage, function(x){ -sum(log(x)) }), y = damage, fill = measure))
p1 <- p1 + geom_bar(stat = 'identity') + scale_y_log10()
p1 <- p1 + labs(title="People Affected",
                x = NA, y = "Log People Affected")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), legend.position = "bottom")
# Property
p2 <- ggplot(data = evtype_damage_sum_objects,
             aes(x=reorder(EVTYPE, damage, function(x){ -sum(log(x)) }), y = damage, fill = measure))
p2 <- p2 + geom_bar(stat = 'identity') + scale_y_log10()
p2 <- p2 + labs(title="Property Damage",
                x = NA, y = "Log Damage to Property")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), legend.position = "bottom")

We look at the log damage so we can get a better measure of relative damage amongst the lower-tier events.

grid.arrange(p1, p2, ncol = 2, bottom = 'Event')

Flash floods do, in fact, damage a significant amound of property, much more so than the other most frequently occurring events. It also ranks somewhat highly as damaging to human life. Improving early detection of flash floods may be a worthwhile use of resources, given the damage it might mitigate and the ease with which a detection system might be built, given that there is a high amount of historical examples upon which to analyze and test on.

Average Damage

##### AVERAGE
# Average by event type
evtype_damage_mean <- with(StormData, aggregate(list(FATALITIES, INJURIES, PROPDMG, CROPDMG),
                                               by = list(EVTYPE), mean))
colnames(evtype_damage_mean) <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'CROPDMG')
evtype_damage_mean <- gather(evtype_damage_mean, measure, damage, FATALITIES:CROPDMG, factor_key = FALSE)
evtype_damage_mean_people <- evtype_damage_mean[evtype_damage_mean$measure %in% c('FATALITIES', 'INJURIES') & evtype_damage_mean$damage != 0,]
evtype_damage_mean_objects <- evtype_damage_mean[!(evtype_damage_mean$measure %in% c('FATALITIES', 'INJURIES')) & evtype_damage_mean$damage != 0,]

# Develop plots
# People
p1 <- ggplot(data = evtype_damage_mean_people,
             aes(x=reorder(EVTYPE, damage, function(x){ -sum(x) }), y = damage, fill = measure))
p1 <- p1 + geom_bar(stat = 'identity')
p1 <- p1 + labs(title="People Affected",
                x = NA, y = "People Affected")
p1 <- p1 + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), legend.position = "bottom")
# Property
p2 <- ggplot(data = evtype_damage_mean_objects,
             aes(x=reorder(EVTYPE, damage, function(x){ -sum(log(x)) }), y = damage, fill = measure))
p2 <- p2 + geom_bar(stat = 'identity') + scale_y_log10()
p2 <- p2 + labs(title="Property Damage",
               x = NA, y = "Log Damage to Property")
p2 <- p2 + theme(axis.text.x = element_text(angle = 90), axis.title.x=element_blank(), legend.position = "bottom")

We take the log of the damage to property to adjust our scale, but there is no need to do so when comparing the damage to people.

grid.arrange(p1, p2, ncol = 2, bottom = 'Event')

The damage to property is most devastated by individual events like tropical storms in events propogated by dry climate conditions like wildfire and drought. The researcher recommends that amplifying resources during dry seasons and hurricane seasons would most effectively deter damage to crops and property.

Excessive heat is by far the most dangerous event people face and stations be set up during heat waves that will allow people to recuperate. It is also vital to note that the relative risk of fatality versus injury is much higher for riptides and avalanches. Because the risk of fatality is so high, preventative measures need to be put in place, such as patroling coastal areas during riptides or levying severe fines for trespassing on avalanche reisk zones. Because there is little chance of recovering human life when compromised, these deserve more preventative action than other events, where more post-event medical equipment can serve more use.