Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

Using the “Storm Data” published by the National Oceanic and Atmospheric Administration (NOAA), this paper looks at the human and economic impact of extreme weather conditions.
We use fatality and injury counts to assess the human impact, and crop and property damage estimates to assess the economic impact.

Tornadoes have have the biggest impact on people by an order of magnitude, accounting for 65% of all injuries and 37% of fatalities. Excessive Heat and Flooding also account for a significant proportion of fatalities.

In terms of economic impact, FLOODS, HURRICANE/TYPHOON, TORNADO and STORM SURGE come top, accounting for more than 67% of all economic damage between them.

Data Processing

Load required libraries:

library(plyr)     # For 'arrange', to sort our data
library(reshape2) # For melting wide to tall data
library(ggplot2)  # For awsome plotting

Introduction

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Downloading

Download and decompress using bash rather than R - it seems much faster! the R fragment below uses the alternate knitr engine by specifying engine='bash', which allows us to execute bash commands instead of R!

# If the bz2 file already exists, no need to wget it
[ -f StormData.csv.bz2 ] || wget -O StormData.csv.bz2 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
# If the csv file already exists, no need to bunzip it
[ -f StormData.csv ] || bunzip2 -k StormData.csv.bz2

Reading and initial profiling

The data has a particularly complex format, involving multi-line quoted strings with embeded self-escaping quotation marks. Having tried using data.table’s fread for increased performance, we found it was unable to cope with the complexity of the file format. The records towards the end of the file are significanlty more complex in this respect that the early ones.

When reading in the data frame, we have trimmed the columns down to just read in those we are interested in by reading in the first few rows to get the names and inferred types. Then change the types of the unwanted columns to NULL (meaning ignore) and force the two EXP columns to be factors, and finally read in the data with a reduced number of columns and an explicit number of rows.

Having read in the data, we parse out a date.begin from the text of DATE_BGN, we will not look at time granularity, not county level granulatiy in this investigation due to time constraints.

localUnbzippedFile <- "StormData.csv"
wanted.cols <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP', 'BGN_DATE', 'STATE')
data <- read.csv(localUnbzippedFile, nrows=100) 
# Pull the guessed set of classes to use as our colClasses in the full run
classes <- sapply(data, class)
classes[!(names(classes) %in% wanted.cols)] <- 'NULL'
classes[c('PROPDMGEXP', 'CROPDMGEXP')] <- 'factor'
data <- read.csv(localUnbzippedFile, nrows=902297, colClasses=classes)
data$date.begin <- as.Date(data$BGN_DATE, format='%m/%d/%Y')
print(object.size(data), units='auto')
## 53 Mb

Data Summarisation

We use the plyr library to summarise the entire data set by EVTYPE.

data.sums <- ddply(data, .(EVTYPE), numcolwise(sum, na.rm = TRUE))

Results

We will examine the economic impact, injuries and fatalities sepeartely.

There are 985 EVTYPEs present in the data. To deal with this, we will reclassify some oth the lower values into an “Other” collection so that results can be presented effectively. The cut off is chosen such that the “Other” category constains less than 1% of the impact.

Additionally, we refactor the EVTYPE factor so that its levels are in order of significance, and add a percentage column.

proc_summary <- function(df, measure) {
  # NB: df is already summarised by EVTYPE
  df <- arrange(df, desc(df[[measure]]))

  colTotal <- sum(df[[measure]], rm.na=TRUE)
  percent <- df[[measure]] / colTotal
  density <- cumsum(percent)
  cutOff <- which(density>0.99)[[1]]
  result <- rbind(
    cbind(EVTYPE='Other', numcolwise(sum, rm.na=TRUE)(df[-(1:cutOff),])),
    df[1:cutOff,]
  )
  # Re-sort to get 'Other' into the correct location
  result <- arrange(result, desc(result[[measure]]))
  # Re-factor EVTYPE so that most significan item appear first in plots
  result$EVTYPE <- factor(as.character(result$EVTYPE),levels=as.character(result$EVTYPE));
  result$Percent <- paste0(formatC(100 * result[[measure]] / colTotal, digits=3),'%')
  result
}

Economic Impacts

data.damage <- proc_summary(data.sums[,c(1,4:6)], 'damage.total')
# Put some nicer column names on
colnames(data.damage)[2:4] <- c('Crops', 'Property', 'Total')
# Take a look at the top few records
head(data.damage,n=20)
##               EVTYPE     Crops  Property     Total Percent
## 1              FLOOD 5.662e+09 1.447e+11 1.503e+11   31.5%
## 2  HURRICANE/TYPHOON 2.608e+09 6.931e+10 7.191e+10   15.1%
## 3            TORNADO 4.150e+08 5.695e+10 5.736e+10     12%
## 4        STORM SURGE 5.000e+03 4.332e+10 4.332e+10   9.08%
## 5               HAIL 3.026e+09 1.574e+10 1.876e+10   3.93%
## 6        FLASH FLOOD 1.421e+09 1.682e+10 1.824e+10   3.82%
## 7            DROUGHT 1.397e+10 1.046e+09 1.502e+10   3.15%
## 8          HURRICANE 2.742e+09 1.187e+10 1.461e+10   3.06%
## 9        RIVER FLOOD 5.029e+09 5.119e+09 1.015e+10   2.13%
## 10         ICE STORM 5.022e+09 3.945e+09 8.967e+09   1.88%
## 11    TROPICAL STORM 6.783e+08 7.704e+09 8.382e+09   1.76%
## 12      WINTER STORM 2.694e+07 6.688e+09 6.715e+09   1.41%
## 13         HIGH WIND 6.386e+08 5.270e+09 5.909e+09   1.24%
## 14          WILDFIRE 2.955e+08 4.765e+09 5.061e+09   1.06%
## 15         TSTM WIND 5.540e+08 4.485e+09 5.039e+09   1.06%
## 16  STORM SURGE/TIDE 8.500e+05 4.641e+09 4.642e+09  0.973%
## 17             Other 1.361e+09 3.274e+09 4.635e+09  0.971%
## 18 THUNDERSTORM WIND 4.148e+08 3.483e+09 3.898e+09  0.817%
## 19    HURRICANE OPAL 1.900e+07 3.173e+09 3.192e+09  0.669%
## 20  WILD/FOREST FIRE 1.068e+08 3.002e+09 3.109e+09  0.651%
# And finally plot
data.damage.tall <- melt(data.damage[,1:4], id='EVTYPE') # Melt to produce multiple data series for plotting
names(data.damage.tall)[2:3] <- c('Damage Type', 'Cost') # Nicer names
qplot(data=data.damage.tall, x=EVTYPE, y=Cost, colour=`Damage Type`,
      main="Total Costs of Weather Events\n1950-2011",
      ylab="Cost in US$") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_log10(limits=10^c(0,12),breaks=10^c(3,6,9,12),labels=c('$1k','$1M', '$1B', '$1T')) +
  geom_point(size = 5, alpha=I(0.4))

plot of chunk economic

Human Impacts

Fatalities

fatalities <- proc_summary(data.sums[,1:2], 'FATALITIES')
head(fatalities,n=20)
##                     EVTYPE FATALITIES Percent
## 1                  TORNADO       5633   37.2%
## 2           EXCESSIVE HEAT       1903   12.6%
## 3              FLASH FLOOD        978   6.46%
## 4                     HEAT        937   6.19%
## 5                LIGHTNING        816   5.39%
## 6                TSTM WIND        504   3.33%
## 7                    FLOOD        470    3.1%
## 8              RIP CURRENT        368   2.43%
## 9                HIGH WIND        248   1.64%
## 10               AVALANCHE        224   1.48%
## 11            WINTER STORM        206   1.36%
## 12            RIP CURRENTS        204   1.35%
## 13               HEAT WAVE        172   1.14%
## 14            EXTREME COLD        160   1.06%
## 15                   Other        151  0.997%
## 16       THUNDERSTORM WIND        133  0.878%
## 17              HEAVY SNOW        127  0.839%
## 18 EXTREME COLD/WIND CHILL        125  0.825%
## 19             STRONG WIND        103   0.68%
## 20                BLIZZARD        101  0.667%

In the plot below, we deliberately leave off the most significant cause of death so as to allow better separation of other items in the plot.

qplot(data=fatalities[2:20,], x=EVTYPE, y=FATALITIES,
      main="Top 20 Weather conditions fatalities in 1950-2011\n(With top item removed for better scaling)") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,NA))

plot of chunk damage.human.fatalities.plot

Injuries

injuries <- proc_summary(data.sums[,c(1,3)], 'INJURIES')
head(injuries,n=20)
##                EVTYPE INJURIES Percent
## 1             TORNADO    91346     65%
## 2           TSTM WIND     6957   4.95%
## 3               FLOOD     6789   4.83%
## 4      EXCESSIVE HEAT     6525   4.64%
## 5           LIGHTNING     5230   3.72%
## 6                HEAT     2100   1.49%
## 7           ICE STORM     1975   1.41%
## 8         FLASH FLOOD     1777   1.26%
## 9   THUNDERSTORM WIND     1488   1.06%
## 10              Other     1384  0.985%
## 11               HAIL     1361  0.968%
## 12       WINTER STORM     1321   0.94%
## 13  HURRICANE/TYPHOON     1275  0.907%
## 14          HIGH WIND     1137  0.809%
## 15         HEAVY SNOW     1021  0.727%
## 16           WILDFIRE      911  0.648%
## 17 THUNDERSTORM WINDS      908  0.646%
## 18           BLIZZARD      805  0.573%
## 19                FOG      734  0.522%
## 20   WILD/FOREST FIRE      545  0.388%

Again, in the plot below, we deliberately leave off the most significant cause of injuries so as to allow better separation of other items in the plot.

qplot(data=injuries[2:20,], x=EVTYPE, y=INJURIES, 
      main="Top 20 Weather conditions causing injuries in 1950-2011\n(With top item removed for better scaling)") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,NA))

plot of chunk damage.human.injuries.plot

Further improvement

First, colapsing the less that 1% into ‘other’ could be improved upon by examining the definitive set of EVTYPEs per the documentation and then using the edit-distance string metric to reclassify other EVTYPEs.

Second, the economic costs should really be adjusted for inflation, given the long tmie period over which the data was collected.

Appendix A - Operating Environment

The analysis was carried out under the following R environment.

sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8   
##  [6] LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0 reshape2_1.4  plyr_1.8.1   
## 
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-9  colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_1.0      grid_3.1.1       gtable_0.1.2     htmltools_0.2.6  knitr_1.6       
## [10] labeling_0.3     MASS_7.3-35      munsell_0.4.2    proto_0.3-10     Rcpp_0.11.3      rmarkdown_0.3.3  scales_0.2.4     stringr_0.6.2    tools_3.1.1     
## [19] yaml_2.1.13