Introduction

The National Weather Service (NWS) tracks and records severe weather events across the United States in a Storm Events Database. This report will use the NWS dataset to investigate the types of severe weather events that, in aggregate, have the largest impact to public health, and those types of severe weather events that result on property and crop damage. This report will focus on the most recent set of data (1996-2011), where more complete records were kept for severe weather events. Since this report focuses on a consistent period of time to record all storm occurrences, the sum of all reported injuries and fatalities will be reported, as well as the sum of all property damage and storm damage. These results will give an idea of what the total health and economic impact of each storm type is relative to one another across the same timeframe, as opposed to the average impact of a single weather event.

Data Processing

Downloading and Importing the File Into the R Workspace

A link to download the original data set was provided by the instructors of the Coursera Reproducible Research course, and the file was downloaded May 19th, 2015. The URL is provided here.

https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

download.file("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2", 
              destfile="./repdata-data-StormData.csv.bz2", method="curl")
storm_data <- read.csv('./repdata-data-StormData.csv.bz2', header=T)

Data Exploration

The NWS’ webpage on the Storm Events Database (https://www.ncdc.noaa.gov/stormevents/details.jsp) was referred to in order to better understand the structure and makeup of the data. The site indicates that complete data are available from January 1996 forward for 48 categories of severe weather events, defined in Directive 10-1605. While we could include data from all years, it is not fair/accurate to compare the impact of Tornados since 1950 to the effect of other events from 1996. I have therefore elected to only use data that are collected from the 1996-2011 time period, and the data set was filtered down to contain events that happened on or after Jan 1, 1996. A total of 653530 records were retained, ranging from the Jan 1, 1996 cutoff up to November 30, 2011. The BGN_DATE column was used as the date source for the filtering, and was converted to the POSIXct format.

library(dplyr)
## 
## Attaching package: 'dplyr'
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(stringdist)
library(stringr)
library(ggplot2)

storm_data$BGN_DATE <- as.POSIXct(storm_data$BGN_DATE, format="%m/%d/%Y %H:%M:%S")
storm_data <- storm_data %>% filter(BGN_DATE >= as.POSIXct('1/1/1996', format="%m/%d/%Y")) 

Formatting of Property Damage and Crop Damage

The total property damage and crop damage are provided by a total of four columns in the database: PROPDMG and PROPDMGEXP, and CROPDMG and CROPDMGEXP. The *EXP columns represent the multiplier for the base dollar amount, and can take on one of three values in the filtered data set, K (Thousands), M (Millions), or B (Billions) of US dollars. These values were converted to the appropriate multiplier (1E3 for thousands, 1E6 for millions, or 1E9 for billions), and two new columns, PROPDMG_CASH, and CROPDMG_CASH were generated by multiplying the raw dollar amount by the multipler as follows:

storm_data <- storm_data %>% mutate(EVTYPE = str_trim(toupper(EVTYPE)), 
        PROPDMGEXP = ifelse(PROPDMGEXP == 'B',1E9,
                            ifelse(PROPDMGEXP == 'K', 1E3,
                                   ifelse(PROPDMGEXP == 'M', 1E6,0))),
        CROPDMGEXP = ifelse(CROPDMGEXP == 'B', 1E9,
                            ifelse(CROPDMGEXP == 'K', 1E3,
                                   ifelse(CROPDMGEXP == 'M', 1E6,0))))

storm_data <- storm_data %>% mutate(PROPDMG_CASH = PROPDMG*PROPDMGEXP, CROPDMG_CASH = CROPDMG*CROPDMGEXP) %>% 
  arrange(BGN_DATE)

Formatting EVTYPEs

The EVTYPES column represents the assigned name of the severe weather event, and the column was extremely messy, containing misspellings and variations on the official 48 event types. A complete and thorough encoding of all event types to the official 48 events would be a manual and difficult process using regex, and so I decided to apply string clustering to the EVTYPES, because this is supposed to be a learning experience. To cluster the strings, I used a hierarchical clustering algorithm (the hclust function in base R) on the Jaro-Winkler string distance implemented in the stringdist package. Initial investigations revealed that a cutoff of 0.16 in the Jaro-Winkler distance was a suitable cutoff for the clustering scheme to capture mispellings and subtle naming variations, and each event was assigned to its associated cluster. Cluster assignments were appended back to the 653530 EVTYPES using the merge function.

set.seed(42)
# There is no point clustering multiple instances of the same EVTYPE, so let's limit it to the unique events. 
EVTYPES <- unique(storm_data$EVTYPE)
# Create the distance matrix with the Jaro-Winkler distance metric. It's normalized on string distance so a single metric will work for strings of various lengths like we have in our data set. 
distance_matrix <- stringdistmatrix(EVTYPES,EVTYPES,method = "jw")
# We put the names back into the distance matrix so we can plot it. 
rownames(distance_matrix) <- EVTYPES
EVTYPES_hc <- hclust(as.dist(distance_matrix))

# Let's grab clusters at height 0.14 in the hierarchical tree. 
EV_cuts <- cutree(EVTYPES_hc, h=0.14)
EV_cuts <- as.data.frame(EV_cuts)
EV_cuts$Event_Type <- attr(EV_cuts, 'row.names')
colnames(EV_cuts) <- c("Cluster","Event_Type")

storm_data <- merge(storm_data, EV_cuts, by.x="EVTYPE", by.y="Event_Type", all.x=T, all.y=F)

cluster_demo <- storm_data %>% group_by(EVTYPE, Cluster) %>% summarize(COUNT = n()) %>% ungroup() %>% 
  arrange(Cluster)

To demonstrate the grouping, two clusters (Urban/Small Stream Flooding - Cluster 5, and TSTM Winds - Cluster 10) are shown here.

subset(cluster_demo, Cluster == 5)
## Source: local data frame [3 x 3]
## 
##                  EVTYPE Cluster COUNT
## 1 URBAN/SMALL STRM FLDG       5     1
## 2  URBAN/SML STREAM FLD       5  3392
## 3 URBAN/SML STREAM FLDG       5     1
subset(cluster_demo, Cluster == 10)
## Source: local data frame [6 x 3]
## 
##          EVTYPE Cluster  COUNT
## 1     TSTM WIND      10 128668
## 2  TSTM WIND 40      10      1
## 3  TSTM WIND 45      10      1
## 4 TSTM WIND G45      10      1
## 5    TSTM WINDS      10      1
## 6      TSTM WND      10      1

Now that we see how the event types will be grouped (and understand the limits of what string clustering can group), we will proceed to aggregate the clusters.

The dplyr package was used to group events based on their cluster. The name of the cluster was taken from the first EVTYPE in each cluster, using the date-ordered data set. (Following the code and process in this report, this will lead to reproducible cluster names in the final output.) The sum() function in dplyr’s summarize() was used to calculate the total number of fatalities and injuries, as well as the total cost of property and crop damage for each cluster.

storm_data_summarized <- storm_data %>% group_by(Cluster) %>% 
  summarize(EVTYPE = first(EVTYPE), COUNT = n(), FATALITIES = sum(FATALITIES, na.rm=T), 
            INJURIES = sum(INJURIES, na.rm=T), PROPDMG = sum(PROPDMG_CASH, na.rm=T), 
            CROPDMG = sum(CROPDMG_CASH, na.rm=T)) %>% ungroup()

Results

Largest Impact to Public Health

First we will look at those storm types that have the largest effects on population health using the total number of fatalities and injuries. The top ten will be reported for fatalities and injuries.

## These reorderings are just to ensure that the EVTYPES are plotted based on the total number of fatalities or injuries.  The default qplot call was resulting in an alphabetical reordering of the top 10. 
storm_data_sum_fatalityorder <- transform(storm_data_summarized, EVTYPE = reorder(EVTYPE,FATALITIES)) %>% 
  arrange(desc(FATALITIES))
storm_data_sum_injuryorder <- transform(storm_data_summarized, EVTYPE = reorder(EVTYPE,INJURIES)) %>% 
  arrange(desc(INJURIES))

qplot(EVTYPE, FATALITIES, data=storm_data_sum_fatalityorder[1:10,], geom="bar", stat="identity", 
      ylab="Total Fatalities", xlab="Event Type", main="Top 10 Fatal Storm Types") + coord_flip()
qplot(EVTYPE, INJURIES, data=storm_data_sum_injuryorder[1:10,], geom="bar", stat="identity", 
      ylab="Total Injuries", xlab="Event Type", main="Top 10 Injurious Storm Types") + coord_flip()

From these plots we can see that Excessive Heat is the deadliest severe weather event on record between the years of 1996-2011. This is interesting because people rarely think of heat as deadly. It is likely that if more information were provided on these deaths, we would see a high proportion of elderly individuals that pass away during heat waves. The second most deadly storm type are Tornados, followed by Flash Floods, Lightning, and Rip Currents.

The top 5 most injurious severe weather events look very similar to the deadly events, although the data suggests that Tornados outpace any other event by roughly a factor of 3x. Flooding, Excessive Heat, Lightning, and TSTM WIND (Thunderstorm Winds) rouhd out the top 5.

Impact of Severe Weather On Property and Crops

Next we will look at the total dollars lost to storms, with contributions from property damage, and crop damage.

## Once again these reorderings are just to get around qplot rearranging the storm types in alphabetical order when plotting
storm_data_sum_croporder <- transform(storm_data_summarized, EVTYPE = reorder(EVTYPE,CROPDMG)) %>% 
  arrange(desc(CROPDMG))
storm_data_sum_proporder <- transform(storm_data_summarized, EVTYPE = reorder(EVTYPE,PROPDMG)) %>% 
  arrange(desc(PROPDMG))

qplot(EVTYPE, PROPDMG, data=storm_data_sum_proporder[1:10,], geom="bar", stat="identity", 
      ylab="US Dollars", xlab="Event Type", main="Total Property Damage") + coord_flip()
qplot(EVTYPE, CROPDMG, data=storm_data_sum_croporder[1:10,], geom="bar", stat="identity", 
      ylab="US Dollars", xlab="Event Type", main="Total Crop Damage") + coord_flip() 

The highest amounts of property damage originate from severe weather events that any American would be familiar with. The top five are flooding, hurricanes, storm surge, tornados, and flash flooding. It is fairly clear from this list that most damage comes from large amounts of water being deposited in places where people live, and storm-associated winds.

On the other hand, the most damaging weather events associated with crop damage are drought, followed closely by flooding and Hurricanes (represented by the third and fourth bars). In addition to other water-related events (Flash Flood, Heavy Rain, and Tropical Storms), we also see Extreme Cold and Frost/?reeze contributing to Crop Damage

As always, thank you to my Coursera classmates for taking the time to read and review!