Synopsis

This report is an analysis of this dataset of storm and weather events, published by the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, in order to identify the most significant factors in terms of health and economic impact.

The weather event causing most fatalities in the US between 1995 and 2011 is Excessive Heat, closely followed by Tornado. Tornados cause by far the most injuries of any weather event.

In terms of economic cost, Flood has by far the most signficant impact, then Hurricane/Typhoon, followed by Storm Surge and Tornado.

The fine granularity of weather event categories could have an impact on the results. Further analysis of this is recommended.


Data Processing

Sourcing the Data

filename <- "StormData.csv.bz2"
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"

if (!file.exists(filename)) {
    download.file(url=url, destfile = filename, method="curl")
}
raw.data <- read.csv(
    filename, header=TRUE, na.strings=c(""))

First peek at the data

Having downloaded the data and read the file into variable raw.data, let’s look at its structure:

dim(raw.data)
## [1] 902297     37
str(raw.data)
## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : Factor w/ 16335 levels "1/1/1966 0:00:00",..: 6523 6523 4242 11116 2224 2224 2260 383 3980 3980 ...
##  $ BGN_TIME  : Factor w/ 3608 levels "00:00:00 AM",..: 272 287 2705 1683 2584 3186 242 1683 3186 3186 ...
##  $ TIME_ZONE : Factor w/ 22 levels "ADT","AKS","AST",..: 7 7 7 7 7 7 7 7 7 7 ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: Factor w/ 29600 levels "5NM E OF MACKINAC BRIDGE TO PRESQUE ISLE LT MI",..: 13512 1872 4597 10591 4371 10093 1972 23872 24417 4597 ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 834 834 834 834 834 834 834 834 834 834 ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : Factor w/ 34 levels "  N"," NW","E",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ BGN_LOCATI: Factor w/ 54428 levels " Christiansburg",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_DATE  : Factor w/ 6662 levels "1/1/1993 0:00:00",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_TIME  : Factor w/ 3646 levels " 0900CST"," 200CST",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : Factor w/ 23 levels "E","ENE","ESE",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ END_LOCATI: Factor w/ 34505 levels " CANTON"," TULIA",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 18 levels "-","?","+","0",..: 16 16 16 16 16 16 16 16 16 16 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 8 levels "?","0","2","B",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ WFO       : Factor w/ 541 levels " CI","%SD","$AC",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ STATEOFFIC: Factor w/ 249 levels "ALABAMA, Central",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ ZONENAMES : Factor w/ 25111 levels "                                                                                                                               "| __truncated__,..: NA NA NA NA NA NA NA NA NA NA ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : Factor w/ 436780 levels "\t","\t\t","\t\t\t\t",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

In summary, this dataset contains 902297 observations of 37 variables.

How complete is the data? Let’s look for NAs:

nas <- c()
for (i in 1:37) nas[i] <- sum(is.na(raw.data[,i]))
names(nas) <- names(raw.data)
print(nas[nas>0])
## COUNTYNAME    BGN_AZI BGN_LOCATI   END_DATE   END_TIME COUNTYENDN 
##       1589     547332     287743     243411     238978     902297 
##    END_AZI END_LOCATI          F PROPDMGEXP CROPDMGEXP        WFO 
##     724837     499225     843563     465934     618413     142069 
## STATEOFFIC  ZONENAMES   LATITUDE LATITUDE_E    REMARKS 
##     248769     594029         47         40     287433

There are quite a lot of NAs but not in variables I’m interested in for this report. I can handle Property/Crop damage exponents (PROPDMGEXP and CROPDMGEXP) by treating the associated damage number as an absolute value (i.e. no exponent).

Data Formatting

Firstly, let’s remove columns we’re not interested in:

dtc <- raw.data[,
                -c(1,3,4,5,6,9,10,11,12,13,14,15,16,17,18,19,20,21,22,
                   29,30,31,32,33,34,35,36,37)]

The BGN_DATE variable is a factor, so this should be formatted as a Date; we can then create a new variable containing just the year:

date_format <- "%m/%d/%Y %H:%M:%S"
dtc[,"BGN_DATE"] <- as.Date(dtc[,"BGN_DATE"], date_format)
dtc$year <- as.POSIXlt(dtc[,"BGN_DATE"])$year+1900

The EVTYPE variable contains unnecessary whitespace, let’s remove it:

library(stringr)
dtc$EVTYPE <- factor(str_trim(dtc$EVTYPE))

Property and Crop damage values come in a pair of columns, containing a number and optional exponent component, ‘K’ (thousands), ‘M’ (millions) or ‘B’ (billions), respectively. Assessing property damage requires turning the numerical and exponential variables into a purely numerical value that can be used arithmetically. However, the values in the PROPDMGEXP and CROPDMGEXP variables are not limited to the expected values:

table(dtc$PROPDMGEXP)
## 
##      -      ?      +      0      1      2      3      4      5      6 
##      1      8      5    216     25     13      4      4     28      4 
##      7      8      B      h      H      K      m      M 
##      5      1     40      1      6 424665      7  11330
table(dtc$CROPDMGEXP)
## 
##      ?      0      2      B      k      K      m      M 
##      7     19      1      9     21 281832      1   1994
exp <- c("k", "K", "m", "M", "b", "B", "", NA)

table(dtc$CROPDMGEXP %in% exp)
## 
##  FALSE   TRUE 
##     27 902270
table(dtc$PROPDMGEXP %in% exp)
## 
##  FALSE   TRUE 
##    321 901976

The majority of the exponent variables contain expected values, but some don’t. I’ll remove these and then use a helper function (makenum) to create new columns (CROPDMGNUM and PROPDMGNUM) with absolute numbers from the raw number and exponent variables (i.e ‘1’, ‘B’ becomes ‘1,000,000,000’):

dtc <- dtc[dtc$PROPDMGEXP %in% exp &
           dtc$CROPDMGEXP %in% exp,]
dtc$PROPDMGEXP <- factor(dtc$PROPDMGEXP)
dtc$CROPDMGEXP <- factor(dtc$CROPDMGEXP)

source("functions.R")
dtc$PROPDMGNUM <- mapply(makenum, dtc$PROPDMG, dtc$PROPDMGEXP)
dtc$CROPDMGNUM <- mapply(makenum, dtc$CROPDMG, dtc$CROPDMGEXP)

The source of functions.R can be viewed here.

The processed data now look like this:

str(dtc)
## 'data.frame':    901949 obs. of  12 variables:
##  $ BGN_DATE  : Date, format: "1950-04-18" "1950-04-18" ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 977 levels "?","ABNORMAL WARMTH",..: 826 826 826 826 826 826 826 826 826 826 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: Factor w/ 4 levels "B","K","m","M": 2 2 2 2 2 2 2 2 2 2 ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 5 levels "B","k","K","m",..: NA NA NA NA NA NA NA NA NA NA ...
##  $ year      : num  1950 1950 1951 1951 1951 ...
##  $ PROPDMGNUM: num  25000 2500 25000 2500 2500 2500 2500 2500 25000 25000 ...
##  $ CROPDMGNUM: num  0 0 0 0 0 0 0 0 0 0 ...
head(dtc)
##     BGN_DATE STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950-04-18    AL TORNADO          0       15    25.0          K       0
## 2 1950-04-18    AL TORNADO          0        0     2.5          K       0
## 3 1951-02-20    AL TORNADO          0        2    25.0          K       0
## 4 1951-06-08    AL TORNADO          0        2     2.5          K       0
## 5 1951-11-15    AL TORNADO          0        2     2.5          K       0
## 6 1951-11-15    AL TORNADO          0        6     2.5          K       0
##   CROPDMGEXP year PROPDMGNUM CROPDMGNUM
## 1       <NA> 1950      25000          0
## 2       <NA> 1950       2500          0
## 3       <NA> 1951      25000          0
## 4       <NA> 1951       2500          0
## 5       <NA> 1951       2500          0
## 6       <NA> 1951       2500          0

Initial Questions

The following questions were addressed in order to get some familiarity with the data.

  1. How many events were recorded per year?
yd <- as.data.frame(tapply(dtc$year, dtc$year, length))
yd$year <- row.names(yd)
row.names(yd) <- NULL
names(yd) <- c("count", "year")

library(ggplot2)
library(grid)
ye <- ggplot(yd, aes(x=year, y=count))
ye <- ye + geom_bar(stat="identity")
ye <- ye + xlab("Year") + ylab("Observations")
ye <- ye + ggtitle("Figure 1: Total Observations per Year")
ye <- ye + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(ye)

print(min(dtc$year)); nrow(dtc[dtc$year==min(dtc$year),])
## [1] 1950
## [1] 223
print(max(dtc$year)); nrow(dtc[dtc$year==max(dtc$year),])
## [1] 2011
## [1] 62173

This shows that there were very few events (223) recorded in the earliest year (1950) contained in the dataset, rising slowly during the 70s, 80s and early 90s before suddenly increasing in the mid 90s and then rising more rapidly to a peak of 62173 in 2011

Given the low rate of data collection in the earlier years and the increased relevance (and presumably lower sampling error rate) of later data, I shall limit this analysis to data recorded between 1995 and 2011 (inclusive).

dtc <- dtc[dtc$year >=1995,]
  1. What are the most frequently occurring event types
by.evtype <- tapply(dtc$year, dtc$EVTYPE, length)
by.evtype <- as.data.frame(by.evtype[order(by.evtype, decreasing=TRUE)])
names(by.evtype) <- "count"
head(by.evtype)
##                    count
## HAIL              215898
## TSTM WIND         128927
## THUNDERSTORM WIND  81743
## FLASH FLOOD        52667
## FLOOD              24640
## TORNADO            24320

The most frequently occuring event over the period is Hail, with nearly ten times as many recorded events as Flood and Tornado.


Results

Most Harmful Weather Event

Here I group injuries and fatalities data by event type, summing up the number of each per event type. I then display the top ten of each in a panel plot.

byinjuries <- as.data.frame(tapply(dtc$INJURIES, dtc$EVTYPE, sum))
byinjuries$EVTYPE<-row.names(byinjuries)
byinjuries <- byinjuries[order(byinjuries[1], decreasing=TRUE),]
names(byinjuries) <- c("injuries", "evtype")
row.names(byinjuries) <- NULL

byfatalities <- as.data.frame(tapply(dtc$FATALITIES, dtc$EVTYPE, sum))
byfatalities$EVTYPE<-row.names(byfatalities)
byfatalities <- byfatalities[order(byfatalities[1], decreasing=TRUE),]
names(byfatalities) <- c("fatalities", "evtype")
row.names(byfatalities) <- NULL

library(ggplot2)
library(gridExtra)
gf <- ggplot(byfatalities[1:10,], aes(x=evtype, y=fatalities))
gf <- gf + geom_bar(stat="identity")
gf <- gf + xlab("") + ylab("Fatalities")
gf <- gf + theme(axis.text.x = element_text(angle = 45, hjust = 1))

gi <- ggplot(byinjuries[1:10,], aes(x=evtype, y=injuries))
gi <- gi + geom_bar(stat="identity")
gi <- gi + xlab("Weather Event Type") + ylab("Injuries")
gi <- gi + theme(axis.text.x = element_text(angle = 45, hjust = 1))

grid.arrange(gf, gi, nrow=2,
             main="Figure 2: Health impact of US weather events (1995-2011)")

This figure plots the top 10 causes of fatalities and injuries, with the total number of each in the period 1995-2011. It shows that Excessive Heat causes the most fatalities, followed by Tornado. Tornados cause the most injuries by a significant margin.


Most Economically Damaging Weather Event

Create a new variable TOTALDMG which is the sum of the CROPDMGNUM and PROPDMGNUM variables. Then group the data by event type, recording the sum of TOTALDMG of all observations of that weather event type.

dtc$TOTALDMG <- dtc$CROPDMGNUM + dtc$PROPDMGNUM
bydmg <- as.data.frame(tapply(dtc$TOTALDMG, dtc$EVTYPE, sum))
bydmg$EVTYPE<-row.names(bydmg)
bydmg <- bydmg[order(bydmg[1], decreasing=TRUE),]
names(bydmg) <- c("cost", "evtype")

bn <- 10^9
scale <- signif(max(bydmg$cost, na.rm=TRUE), 1) / bn

library(ggplot2)
gc <- ggplot(bydmg[1:10,], aes(x=factor(evtype), y=cost))
gc <- gc + geom_bar(stat="identity")
gc <- gc + xlab("Event Type") + ylab("Total Cost ($)")
gc <- gc + ggtitle("Figure 3: Total cost of US weather events (1995-2011)")
gc <- gc + scale_y_continuous(
    breaks=bn * seq(0,150, by=25),
    labels=paste0(seq(0,150, by=25), "bn"),
    limits=bn * c(0,155)
)
gc <- gc + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(gc)

This figure shows the total cost of the top ten weather event types. It shows that the most costly type of weather event is Flood, followed by Hurricane/Typhoon, then Storm Surge and Tornado.

The results indicate that the granularity of weather event categorisation could affect the stated impact of certain types of weather event. For instance, perhaps the results would be different if High Wind, Hurricane, Hurricane/Typhoon, Tornado & Tropical Storm were all treated as the same type of event.


Appendix 1 - Session Info

This report was produced on the following hardware setup:

Attribute Detail
OS OS X Yosemite 10.10.1
Processor 3.4 GHz Intel Core i7
Memory 16GB 1333 MHz DDR3
Graphics AMD Radeon HD 6970M 2048 MB

The R environment is as follows:

sessionInfo()
## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] gridExtra_0.9.1 ggplot2_1.0.0   stringr_0.6.2   knitr_1.8      
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_1.0     
##  [5] gtable_0.1.2     htmltools_0.2.6  labeling_0.3     lattice_0.20-29 
##  [9] MASS_7.3-35      munsell_0.4.2    plyr_1.8.1       proto_0.3-10    
## [13] Rcpp_0.11.3      reshape2_1.4     rmarkdown_0.3.10 scales_0.2.4    
## [17] tools_3.1.2      yaml_2.1.13

R Studio version used is:

Version 0.98.1091 – © 2009-2014 RStudio, Inc.
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/600.2.5 (KHTML, like Gecko)