Synopsis

The following analysis uses the Storm Data dataset to investigate which types of weather events have shown the largest impact, in terms of both economic impact and impact to human health. The data cleaning process is demonstrated, in which only the most recent records are kept (a justification is given in the appendix). The cleaned data show that tornadoes are the most harmful in both respects, showing high rates of fatalities, injuries and property damage. However tornadoes are not the only significant source of harm identified in the dataset.

Data Processing

The following section shows preprocessing performed on the raw Storm Data dataset.

Dependencies for this analysis:

library(dplyr)
library(ggplot2)
library(lubridate)
library(xtable)
library(knitr)

Lets download the storm data from the course website:

zippedFilename <- 'data/stormData.csv.bz2'
dir.create('data', showWarnings = FALSE)
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', destfile = zippedFilename)

(Date downloaded: 22 November 2015)

… and read the data into R:

#small function for reading the data file, which will be used here as well as in the appendix
readDatafile <- function(){
  stormData <- read.csv(zippedFilename)
  stormData$BGN_DATE  <-  as.POSIXct(stormData$BGN_DATE, format = "%m/%d/%Y")
  stormData$Event.Type <- stormData$EVTYPE #ne-named for easier reading
  stormData
}

stormData <- readDatafile()

I’ve decided that it is most straight-forward to only use data from 2007 to present, as these data are much cleaner than prior data. See the appendix for further justification of this.

stormData <-filter(stormData, BGN_DATE >= ymd("2007-01-01"))
stormData <- droplevels(stormData)

For value of property and crop damage, values are stored as dollar-multiplier pairs, which are stored in separate columns. The multipliers are:

levels(stormData$PROPDMGEXP)
## [1] "0" "B" "K" "M"

where K = thousands, M = millions, B = billions and ‘0’ corresponds to a single event where damages were recorded as zero.

For assessing economic consequences, create a ‘Total.Damages’ column, where these multipliers have been applied, and crop and property damages have been combined.

multiplyDamages <- function(value, multiplier){
  switch(as.character(multiplier),
    K = value * 1e3,
    M = value * 1e6,
    B = value * 1e9,
    '0' = value * 0
    )
}

stormData <- 
  stormData %>%
  rowwise() %>%
  mutate(Total.Damages = multiplyDamages(PROPDMG, PROPDMGEXP)
                       + multiplyDamages(CROPDMG, CROPDMGEXP))

Now the data are clean and in a convenient format, produce a summary table of the statistics of interest for each type of event. It is from this table we will draw our results.

In this table, compute the total damage, fatalities and injuries for each event type:

impactSummary <- 
  stormData %>% 
  ungroup() %>%
  group_by(Event.Type) %>% 
  summarise(Total.Damages = sum(Total.Damages), 
            Fatalities = sum(FATALITIES), 
            Injuries = sum(INJURIES)) %>%
  arrange(desc(Total.Damages))

An example excerpt from the generated summary table:

kable(head(impactSummary))
Event.Type Total.Damages Fatalities Injuries
FLOOD 16855305800 160 171
TORNADO 14732283740 863 9608
HAIL 6967790600 2 180
FLASH FLOOD 5752594130 293 316
STORM SURGE/TIDE 4641493000 11 5
THUNDERSTORM WIND 3771557690 130 1391

Results

This section addresses the specific questions put forth for this assignment.

Q1: Across the United States, which types of events (as indicated in the Event.Type variable) are most harmful with respect to population health?

This question is answered by investigating total fatalities and injuries for each event type.

The following plot shows total fatalities for each event type over the selected period (from 2007 up to 2012):

qplot(
  data = mutate(impactSummary, Event.Type = reorder(Event.Type, desc(Fatalities))), 
  x = Event.Type, 
  y = Fatalities, 
  geom = 'bar', 
  stat = 'identity',
  main = 'Total number of fatalities aggregated over all event types, 2007-2012',
  xlab = 'Event Type',
  ylab = 'Number of fatalities'
) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

This plot shows tornadoes to be by far the largest cause of fatalities in the present dataset.

The dataset also provides figures for number of injuries for each event. The following plot shows the relationship between fatalities and injuries for the six event types with the highest fatality counts:

qplot(
  data = impactSummary %>% arrange(desc(Fatalities)) %>% slice(1:6), 
  x = Fatalities, 
  y = Injuries, 
  shape = Event.Type, 
  color = Event.Type, 
  size = I(5),
  main = 'Number of fatalities vs. number of injuries for six event\n types with highest fatality rates, 2007-2012',
  xlab = 'Number of fatalities',
  ylab = 'Number of injuries'
) + theme(legend.title=element_blank())

As the plot indicates, the relationship between injuries and fatalities is not strictly linear for these event types. For instance, flash flooding has a relatively high fatality rate compared to injury rate, while lightning is the opposite.

Which event type can be said to have the worst effect on public health depends partly a given fatality/injury ‘exchange rate’, or how much significance is placed on a single injury with respect to a single fatality. Determining this rate is outside of the scope of this analysis.

That said, it is clear to see that in the given dataset tornadoes have the greatest impact on public health, as they have contributed to far more injuries and fatalities than any other type of event.

For reference, here are the 5 largest causes of fatalities:

kable(
  impactSummary %>% 
    arrange(desc(Fatalities)) %>% 
    select(Event.Type, Fatalities) %>% 
    slice(1:5)
)
Event.Type Fatalities
TORNADO 863
FLASH FLOOD 293
RIP CURRENT 207
HEAT 182
FLOOD 160

And here are the 5 largest causes of injuries:

kable(
  impactSummary %>% 
    arrange(desc(Injuries)) %>% 
    select(Event.Type, Injuries) %>% 
    slice(1:5)
)
Event.Type Injuries
TORNADO 9608
THUNDERSTORM WIND 1391
LIGHTNING 923
EXCESSIVE HEAT 880
HEAT 702

Q2: Across the United States, which types of events have the greatest economic consequences?

This question is answered by investigating the sum of all property and crop damages for each event type.

The following plot shows the total of all property and crop damages over the selected period (from 2007 up to 2012).

qplot(
  data = mutate(impactSummary, Event.Type = reorder(Event.Type, desc(Total.Damages))), 
  x = Event.Type, 
  y = Total.Damages, 
  geom = 'bar', 
  stat = 'identity',
  main = 'Total damages (dollars) aggregated over all event types, 2007-2012',
  xlab = 'Event Type',
  ylab = 'Total Damages (dollars)'
) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))

From this we can see that a fairly small number of event types account for a large majority of total economic damages. For instance, the top three event types contribute greater aggregate damages than all other event types combined:

with(impactSummary, sum(Total.Damages[1:3]) > sum(Total.Damages[-(1:3)]))
## [1] TRUE

These top three event types, which are responsible for the majority of economic damages in the dataset, are shown below:

kable(impactSummary[1:3,c('Event.Type', "Total.Damages")])
Event.Type Total.Damages
FLOOD 16855305800
TORNADO 14732283740
HAIL 6967790600

Apendix

Justification for subsetting from 2007 to present

According to http://www.ncdc.noaa.gov/stormevents/details.jsp, only a limited number of event types were recorded prior to 1996. Hence it seems that to avoid over-representing certain event types, only data from 1996 onward should be used for this analysis.

Investigating further, and in light of the discussion presented in The History of the Storm Events Database, I had a look into how event codes have been assigned throughout the dataset. The aforementioned document states that the cleanest available data is from 2006 to the present. The following steps investigate this:

Here I create a function which, for a given year and dataset field, will tell me how many unique values there are for that field, from the given year to the present. Specifically I’m looking to see if there is a proliferation of event codes as we head backwards in time.

stormData <- readDatafile()

categoriesSinceYear <- function(category, givenYear) {
  subsetSinceYear <- 
    stormData %>% 
    filter(BGN_DATE >= ymd(paste0(givenYear,"-01-01")))
  
  length(unique(subsetSinceYear[,category]))
  }

This function can then be used to look at how many unique event codes exist if we subset the data from a given year onwards:

nEvTypes <- sapply(1990:2010, function(x) {categoriesSinceYear('Event.Type', x)})
data.frame(year = 1990:2010, nEvTypes = nEvTypes)
##    year nEvTypes
## 1  1990      985
## 2  1991      985
## 3  1992      985
## 4  1993      985
## 5  1994      930
## 6  1995      799
## 7  1996      515
## 8  1997      357
## 9  1998      278
## 10 1999      236
## 11 2000      196
## 12 2001      169
## 13 2002      119
## 14 2003       70
## 15 2004       57
## 16 2005       57
## 17 2006       54
## 18 2007       48
## 19 2008       48
## 20 2009       47
## 21 2010       46

From this you can see that if we include all data from 1990 to today, our dataset would have almost a thousand different event codes. These codes are hard to make sense of, because of the potential for overlap and duplication (e.g. there are codes for both ‘STRONG WIND’ and ‘HIGH WIND’) Keep in mind also that there are only 48 event types officially defined in NWS Directive 10-1605.

On the other hand, if we subset our data from 2006 onwards, we will have a much tidier 54 different codes, or if we subset from 2007 we have the official number of 48 different events.

In light of this, I’ve chosen to subset from 2007 onwards, with the justification that this leaves a much tidier dataset to work with. Note that this comes with limitations, for instance I am less likely to have accounted for more rare, high-impact weather events. Notably, hurricane Katrina is not included in the subset of data used for this analysis.

Source for this presentation

This presentation was written in R markdown with knitr. The source is available in my github repository.