The following analysis uses the Storm Data dataset to investigate which types of weather events have shown the largest impact, in terms of both economic impact and impact to human health. The data cleaning process is demonstrated, in which only the most recent records are kept (a justification is given in the appendix). The cleaned data show that tornadoes are the most harmful in both respects, showing high rates of fatalities, injuries and property damage. However tornadoes are not the only significant source of harm identified in the dataset.
The following section shows preprocessing performed on the raw Storm Data dataset.
Dependencies for this analysis:
library(dplyr)
library(ggplot2)
library(lubridate)
library(xtable)
library(knitr)
Lets download the storm data from the course website:
zippedFilename <- 'data/stormData.csv.bz2'
dir.create('data', showWarnings = FALSE)
download.file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2', destfile = zippedFilename)
(Date downloaded: 22 November 2015)
… and read the data into R:
#small function for reading the data file, which will be used here as well as in the appendix
readDatafile <- function(){
stormData <- read.csv(zippedFilename)
stormData$BGN_DATE <- as.POSIXct(stormData$BGN_DATE, format = "%m/%d/%Y")
stormData$Event.Type <- stormData$EVTYPE #ne-named for easier reading
stormData
}
stormData <- readDatafile()
I’ve decided that it is most straight-forward to only use data from 2007 to present, as these data are much cleaner than prior data. See the appendix for further justification of this.
stormData <-filter(stormData, BGN_DATE >= ymd("2007-01-01"))
stormData <- droplevels(stormData)
For value of property and crop damage, values are stored as dollar-multiplier pairs, which are stored in separate columns. The multipliers are:
levels(stormData$PROPDMGEXP)
## [1] "0" "B" "K" "M"
where K = thousands, M = millions, B = billions and ‘0’ corresponds to a single event where damages were recorded as zero.
For assessing economic consequences, create a ‘Total.Damages’ column, where these multipliers have been applied, and crop and property damages have been combined.
multiplyDamages <- function(value, multiplier){
switch(as.character(multiplier),
K = value * 1e3,
M = value * 1e6,
B = value * 1e9,
'0' = value * 0
)
}
stormData <-
stormData %>%
rowwise() %>%
mutate(Total.Damages = multiplyDamages(PROPDMG, PROPDMGEXP)
+ multiplyDamages(CROPDMG, CROPDMGEXP))
Now the data are clean and in a convenient format, produce a summary table of the statistics of interest for each type of event. It is from this table we will draw our results.
In this table, compute the total damage, fatalities and injuries for each event type:
impactSummary <-
stormData %>%
ungroup() %>%
group_by(Event.Type) %>%
summarise(Total.Damages = sum(Total.Damages),
Fatalities = sum(FATALITIES),
Injuries = sum(INJURIES)) %>%
arrange(desc(Total.Damages))
An example excerpt from the generated summary table:
kable(head(impactSummary))
| Event.Type | Total.Damages | Fatalities | Injuries |
|---|---|---|---|
| FLOOD | 16855305800 | 160 | 171 |
| TORNADO | 14732283740 | 863 | 9608 |
| HAIL | 6967790600 | 2 | 180 |
| FLASH FLOOD | 5752594130 | 293 | 316 |
| STORM SURGE/TIDE | 4641493000 | 11 | 5 |
| THUNDERSTORM WIND | 3771557690 | 130 | 1391 |
This section addresses the specific questions put forth for this assignment.
This question is answered by investigating total fatalities and injuries for each event type.
The following plot shows total fatalities for each event type over the selected period (from 2007 up to 2012):
qplot(
data = mutate(impactSummary, Event.Type = reorder(Event.Type, desc(Fatalities))),
x = Event.Type,
y = Fatalities,
geom = 'bar',
stat = 'identity',
main = 'Total number of fatalities aggregated over all event types, 2007-2012',
xlab = 'Event Type',
ylab = 'Number of fatalities'
) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
This plot shows tornadoes to be by far the largest cause of fatalities in the present dataset.
The dataset also provides figures for number of injuries for each event. The following plot shows the relationship between fatalities and injuries for the six event types with the highest fatality counts:
qplot(
data = impactSummary %>% arrange(desc(Fatalities)) %>% slice(1:6),
x = Fatalities,
y = Injuries,
shape = Event.Type,
color = Event.Type,
size = I(5),
main = 'Number of fatalities vs. number of injuries for six event\n types with highest fatality rates, 2007-2012',
xlab = 'Number of fatalities',
ylab = 'Number of injuries'
) + theme(legend.title=element_blank())
As the plot indicates, the relationship between injuries and fatalities is not strictly linear for these event types. For instance, flash flooding has a relatively high fatality rate compared to injury rate, while lightning is the opposite.
Which event type can be said to have the worst effect on public health depends partly a given fatality/injury ‘exchange rate’, or how much significance is placed on a single injury with respect to a single fatality. Determining this rate is outside of the scope of this analysis.
That said, it is clear to see that in the given dataset tornadoes have the greatest impact on public health, as they have contributed to far more injuries and fatalities than any other type of event.
For reference, here are the 5 largest causes of fatalities:
kable(
impactSummary %>%
arrange(desc(Fatalities)) %>%
select(Event.Type, Fatalities) %>%
slice(1:5)
)
| Event.Type | Fatalities |
|---|---|
| TORNADO | 863 |
| FLASH FLOOD | 293 |
| RIP CURRENT | 207 |
| HEAT | 182 |
| FLOOD | 160 |
And here are the 5 largest causes of injuries:
kable(
impactSummary %>%
arrange(desc(Injuries)) %>%
select(Event.Type, Injuries) %>%
slice(1:5)
)
| Event.Type | Injuries |
|---|---|
| TORNADO | 9608 |
| THUNDERSTORM WIND | 1391 |
| LIGHTNING | 923 |
| EXCESSIVE HEAT | 880 |
| HEAT | 702 |
This question is answered by investigating the sum of all property and crop damages for each event type.
The following plot shows the total of all property and crop damages over the selected period (from 2007 up to 2012).
qplot(
data = mutate(impactSummary, Event.Type = reorder(Event.Type, desc(Total.Damages))),
x = Event.Type,
y = Total.Damages,
geom = 'bar',
stat = 'identity',
main = 'Total damages (dollars) aggregated over all event types, 2007-2012',
xlab = 'Event Type',
ylab = 'Total Damages (dollars)'
) + theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5))
From this we can see that a fairly small number of event types account for a large majority of total economic damages. For instance, the top three event types contribute greater aggregate damages than all other event types combined:
with(impactSummary, sum(Total.Damages[1:3]) > sum(Total.Damages[-(1:3)]))
## [1] TRUE
These top three event types, which are responsible for the majority of economic damages in the dataset, are shown below:
kable(impactSummary[1:3,c('Event.Type', "Total.Damages")])
| Event.Type | Total.Damages |
|---|---|
| FLOOD | 16855305800 |
| TORNADO | 14732283740 |
| HAIL | 6967790600 |
According to http://www.ncdc.noaa.gov/stormevents/details.jsp, only a limited number of event types were recorded prior to 1996. Hence it seems that to avoid over-representing certain event types, only data from 1996 onward should be used for this analysis.
Investigating further, and in light of the discussion presented in The History of the Storm Events Database, I had a look into how event codes have been assigned throughout the dataset. The aforementioned document states that the cleanest available data is from 2006 to the present. The following steps investigate this:
Here I create a function which, for a given year and dataset field, will tell me how many unique values there are for that field, from the given year to the present. Specifically I’m looking to see if there is a proliferation of event codes as we head backwards in time.
stormData <- readDatafile()
categoriesSinceYear <- function(category, givenYear) {
subsetSinceYear <-
stormData %>%
filter(BGN_DATE >= ymd(paste0(givenYear,"-01-01")))
length(unique(subsetSinceYear[,category]))
}
This function can then be used to look at how many unique event codes exist if we subset the data from a given year onwards:
nEvTypes <- sapply(1990:2010, function(x) {categoriesSinceYear('Event.Type', x)})
data.frame(year = 1990:2010, nEvTypes = nEvTypes)
## year nEvTypes
## 1 1990 985
## 2 1991 985
## 3 1992 985
## 4 1993 985
## 5 1994 930
## 6 1995 799
## 7 1996 515
## 8 1997 357
## 9 1998 278
## 10 1999 236
## 11 2000 196
## 12 2001 169
## 13 2002 119
## 14 2003 70
## 15 2004 57
## 16 2005 57
## 17 2006 54
## 18 2007 48
## 19 2008 48
## 20 2009 47
## 21 2010 46
From this you can see that if we include all data from 1990 to today, our dataset would have almost a thousand different event codes. These codes are hard to make sense of, because of the potential for overlap and duplication (e.g. there are codes for both ‘STRONG WIND’ and ‘HIGH WIND’) Keep in mind also that there are only 48 event types officially defined in NWS Directive 10-1605.
On the other hand, if we subset our data from 2006 onwards, we will have a much tidier 54 different codes, or if we subset from 2007 we have the official number of 48 different events.
In light of this, I’ve chosen to subset from 2007 onwards, with the justification that this leaves a much tidier dataset to work with. Note that this comes with limitations, for instance I am less likely to have accounted for more rare, high-impact weather events. Notably, hurricane Katrina is not included in the subset of data used for this analysis.
This presentation was written in R markdown with knitr. The source is available in my github repository.