Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

Using the “Storm Data” published by the National Oceanic and Atmospheric Administration (NOAA), this paper looks at the human and economic impact of extreme weather conditions.
We use fatality and injury counts to assess the human impact, and crop and property damage estimates to assess the economic impact.

Tornadoes have have the biggest impact on people by an order of magnitude, accounting for 65% of all injuries and 37% of fatalities. Excessive Heat and Flooding also account for a significant proportion of fatalities.

In terms of economic impact, FLOODS, HURRICANE/TYPHOON, TORNADO and STORM SURGE come top, accounting for more than 67% of all economic damage between them.

Data Processing

Load required libraries:

library(plyr)     # For 'arrange', to sort our data
library(reshape2) # For melting wide to tall data
library(ggplot2)  # For awsome plotting

Introduction

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Storm Data [47Mb]

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Downloading

Download and decompress using bash rather than R - it seems much faster! the R fragment below uses the alternate knitr engine by specifying engine='bash', which allows us to execute bash commands instead of R!

# If the bz2 file already exists, no need to wget it
[ -f StormData.csv.bz2 ] || wget -O StormData.csv.bz2 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
# If the csv file already exists, no need to bunzip it
[ -f StormData.csv ] || bunzip2 -k StormData.csv.bz2

Reading and initial profiling

The data has a particularly complex format, involving multi-line quoted strings with embeded self-escaping quotation marks. Having tried using data.table’s fread for increased performance, we found it was unable to cope with the complexity of the file format. The records towards the end of the file are significanlty more complex in this respect that the early ones.

When reading in the data frame, we have trimmed the columns down to just read in those we are interested in by reading in the first few rows to get the names and inferred types. Then change the types of the unwanted columns to NULL (meaning ignore) and force the two EXP columns to be factors, and finally read in the data with a reduced number of columns and an explicit number of rows.

Having read in the data, we parse out a date.begin from the text of DATE_BGN, we will not look at time granularity, not county level granulatiy in this investigation due to time constraints.

localUnbzippedFile <- "StormData.csv"
wanted.cols <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP', 'BGN_DATE', 'STATE')
data <- read.csv(localUnbzippedFile, nrows=100) 
# Pull the guessed set of classes to use as our colClasses in the full run
classes <- sapply(data, class)
classes[!(names(classes) %in% wanted.cols)] <- 'NULL'
classes[c('PROPDMGEXP', 'CROPDMGEXP')] <- 'factor'
data <- read.csv(localUnbzippedFile, nrows=902297, colClasses=classes)
data$date.begin <- as.Date(data$BGN_DATE, format='%m/%d/%Y')
print(object.size(data), units='auto')

## 53 Mb

Cleaning Damange Related Columns

Economic consequences are inferred from the financial damage estimates captured in the data, separated into “Property Damage”" and “Crop Damage”. (See §2.7, pp.12-13 of Storm Data Documentation)

The damage related figures are spread over four fields in the input data. Two for Crop Damange and two for Property Damage. In both cases, the significant figures have been seperated from a magnitude multiplier (“Exponent”, denoted with an EXP suffix in the data) of “M”, “K”, “B”, etc.. The use of these two fields needs to be checked for consistency, and then recomposed back into values that can be used for computation.

Crop Damage

known.suffixes <- c('K','M','B','k','m','b')
# Look at the Exp field
crop.exp <- table(data$CROPDMGEXP)
crop.exp

## 
##             ?      0      2      B      k      K      m      M 
## 618413      7     19      1      9     21 281832      1   1994

# Nerveous about the blanks, '?', '2', and '0'; want to see what the individual summaries look like for these
for (exp in setdiff(names(crop.exp),known.suffixes)) {
  cat('\nFor CROPDMGEXP == "',exp,'"\n',sep='')
  print(table(data$CROPDMG[data$CROPDMGEXP==exp]))
}

## 
## For CROPDMGEXP == ""
## 
##      0      3      4 
## 618410      1      2 
## 
## For CROPDMGEXP == "?"
## 
## 0 
## 7 
## 
## For CROPDMGEXP == "0"
## 
##  0  5 20 25 50 60 
##  7  6  1  2  2  1 
## 
## For CROPDMGEXP == "2"
## 
## 0 
## 1

Around half-a million zero CROPDMG have a blank CROPDMGEXP, this is entirely reasonable. More worryingly is how to interptret the other CROPDMGs values when CROPDMGEXP is blank. Rather have these as incorrect values, we choose to exclude them from the analysis, and will have them mapped to NAs and later excluded.

Property Damage

# Look at the Exp field
prop.exp <- table(data$PROPDMGEXP)
prop.exp

## 
##             -      ?      +      0      1      2      3      4      5      6      7      8      B      h      H      K      m      M 
## 465934      1      8      5    216     25     13      4      4     28      4      5      1     40      1      6 424665      7  11330

# Nerveous about the blanks, '?', '-', and '+'; want to see what the individual summaries look like for these
for (exp in setdiff(names(prop.exp),c(known.suffixes,0:9))) {
  cat('\nFor PROPDMGEXP == "',exp,'"\n',sep='')
  print(table(data$PROPDMG[data$PROPDMGEXP==exp]))
}

## 
## For PROPDMGEXP == ""
## 
##      0   0.41      1      2      3      4      5      6      7      8      9     10     20     35     75 
## 465858      1      4      7     16      9     11      6      3      2      3      8      4      1      1 
## 
## For PROPDMGEXP == "-"
## 
## 15 
##  1 
## 
## For PROPDMGEXP == "?"
## 
## 0 
## 8 
## 
## For PROPDMGEXP == "+"
## 
##  2 15 20 60 
##  1  1  2  1 
## 
## For PROPDMGEXP == "h"
## 
## 2 
## 1 
## 
## For PROPDMGEXP == "H"
## 
## 2 3 5 
## 1 1 4

The property set has many instances where a numerical value is being used for the EXP field. It seems reasonable to think this is being used by some as a “scientific notation” and the intention is 10^EXP. H is a recognised SI prefix for “hecto” == 100, and while it is not in common usage, it seems appropriate to accept it in the data as being such.

A 0 in the significant field would seem to not require any exponent to be enterred, and an exponent of ? seems to be being used to mean not available / unknown. Given that this data pertains to extreme weather conditions and that, by definition, these are “rare” events, I do not wish to impute missing values, and would rather eliminate data points that can not be realisticly interpreted.

Overall exponent strategy

For the purposes of data cleaning, we accept anything with a zero significant as meaning “zero”, any of the SI prefixes or a numeric exponent for use as a multiplicand, and will convert any other combinations to NA’s, which we will later filter. Specifically, we are therefore filtering out “exponents” of “+”, and “-”.

exp.list <- 0:9
names(exp.list) <- 0:9
exp.list <- c(exp.list,h=2,H=2,k=3,K=3,m=6,M=6,b=9,B=9)
exp_clean <- function(singificand, exponent) {
  # Re-order based on factor labels
  exp.lookup <- exp.list[levels(exponent)]
  # Then lookup based on factor values
  result <- singificand * 10^exp.lookup[exponent]
  # If significand saw zero, we don't mind how 'bad' the exponent was
  result[singificand==0] <- 0
  result
}
data$damage.crops <- with(data=data, exp_clean(CROPDMG,CROPDMGEXP))
data$damage.property <- with(data=data, exp_clean(PROPDMG,PROPDMGEXP))
data$damage.total <- data$damage.crops + data$damage.property

# Drop the original columns for our "narrow" version
data[,c('PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')] <- list(NULL)

# Check we have not too many na's (as a %-age)
100 * sum(is.na(data$damage.total)) / length(data$damage.total)

## [1] 0.00942

Data Summarisation

We use the plyr library to summarise the entire data set by EVTYPE.

data.sums <- ddply(data, .(EVTYPE), numcolwise(sum, na.rm = TRUE))

Results

We will examine the economic impact, injuries and fatalities sepeartely.

There are 985 EVTYPEs present in the data. To deal with this, we will reclassify some oth the lower values into an “Other” collection so that results can be presented effectively. The cut off is chosen such that the “Other” category constains less than 1% of the impact.

Additionally, we refactor the EVTYPE factor so that its levels are in order of significance, and add a percentage column.

proc_summary <- function(df, measure) {
  # NB: df is already summarised by EVTYPE
  df <- arrange(df, desc(df[[measure]]))

  colTotal <- sum(df[[measure]], rm.na=TRUE)
  percent <- df[[measure]] / colTotal
  density <- cumsum(percent)
  cutOff <- which(density>0.99)[[1]]
  result <- rbind(
    cbind(EVTYPE='Other', numcolwise(sum, rm.na=TRUE)(df[-(1:cutOff),])),
    df[1:cutOff,]
  )
  # Re-sort to get 'Other' into the correct location
  result <- arrange(result, desc(result[[measure]]))
  # Re-factor EVTYPE so that most significan item appear first in plots
  result$EVTYPE <- factor(as.character(result$EVTYPE),levels=as.character(result$EVTYPE));
  result$Percent <- paste0(formatC(100 * result[[measure]] / colTotal, digits=3),'%')
  result
}

Economic Impacts

data.damage <- proc_summary(data.sums[,c(1,4:6)], 'damage.total')
# Put some nicer column names on
colnames(data.damage)[2:4] <- c('Crops', 'Property', 'Total')
# Take a look at the top few records
head(data.damage,n=20)

##               EVTYPE     Crops  Property     Total Percent
## 1              FLOOD 5.662e+09 1.447e+11 1.503e+11   31.5%
## 2  HURRICANE/TYPHOON 2.608e+09 6.931e+10 7.191e+10   15.1%
## 3            TORNADO 4.150e+08 5.695e+10 5.736e+10     12%
## 4        STORM SURGE 5.000e+03 4.332e+10 4.332e+10   9.08%
## 5               HAIL 3.026e+09 1.574e+10 1.876e+10   3.93%
## 6        FLASH FLOOD 1.421e+09 1.682e+10 1.824e+10   3.82%
## 7            DROUGHT 1.397e+10 1.046e+09 1.502e+10   3.15%
## 8          HURRICANE 2.742e+09 1.187e+10 1.461e+10   3.06%
## 9        RIVER FLOOD 5.029e+09 5.119e+09 1.015e+10   2.13%
## 10         ICE STORM 5.022e+09 3.945e+09 8.967e+09   1.88%
## 11    TROPICAL STORM 6.783e+08 7.704e+09 8.382e+09   1.76%
## 12      WINTER STORM 2.694e+07 6.688e+09 6.715e+09   1.41%
## 13         HIGH WIND 6.386e+08 5.270e+09 5.909e+09   1.24%
## 14          WILDFIRE 2.955e+08 4.765e+09 5.061e+09   1.06%
## 15         TSTM WIND 5.540e+08 4.485e+09 5.039e+09   1.06%
## 16  STORM SURGE/TIDE 8.500e+05 4.641e+09 4.642e+09  0.973%
## 17             Other 1.361e+09 3.274e+09 4.635e+09  0.971%
## 18 THUNDERSTORM WIND 4.148e+08 3.483e+09 3.898e+09  0.817%
## 19    HURRICANE OPAL 1.900e+07 3.173e+09 3.192e+09  0.669%
## 20  WILD/FOREST FIRE 1.068e+08 3.002e+09 3.109e+09  0.651%

# And finally plot
data.damage.tall <- melt(data.damage[,1:4], id='EVTYPE') # Melt to produce multiple data series for plotting
names(data.damage.tall)[2:3] <- c('Damage Type', 'Cost') # Nicer names
qplot(data=data.damage.tall, x=EVTYPE, y=Cost, colour=`Damage Type`,
      main="Total Costs of Weather Events\n1950-2011",
      ylab="Cost in US$") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  scale_y_log10(limits=10^c(0,12),breaks=10^c(3,6,9,12),labels=c('$1k','$1M', '$1B', '$1T')) +
  geom_point(size = 5, alpha=I(0.4))

plot of chunk economic

Human Impacts

Fatalities

fatalities <- proc_summary(data.sums[,1:2], 'FATALITIES')
head(fatalities,n=20)

##                     EVTYPE FATALITIES Percent
## 1                  TORNADO       5633   37.2%
## 2           EXCESSIVE HEAT       1903   12.6%
## 3              FLASH FLOOD        978   6.46%
## 4                     HEAT        937   6.19%
## 5                LIGHTNING        816   5.39%
## 6                TSTM WIND        504   3.33%
## 7                    FLOOD        470    3.1%
## 8              RIP CURRENT        368   2.43%
## 9                HIGH WIND        248   1.64%
## 10               AVALANCHE        224   1.48%
## 11            WINTER STORM        206   1.36%
## 12            RIP CURRENTS        204   1.35%
## 13               HEAT WAVE        172   1.14%
## 14            EXTREME COLD        160   1.06%
## 15                   Other        151  0.997%
## 16       THUNDERSTORM WIND        133  0.878%
## 17              HEAVY SNOW        127  0.839%
## 18 EXTREME COLD/WIND CHILL        125  0.825%
## 19             STRONG WIND        103   0.68%
## 20                BLIZZARD        101  0.667%

In the plot below, we deliberately leave off the most significant cause of death so as to allow better separation of other items in the plot.

qplot(data=fatalities[2:20,], x=EVTYPE, y=FATALITIES,
      main="Top 20 Weather conditions fatalities in 1950-2011\n(With top item removed for better scaling)") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,NA))

plot of chunk damage.human.fatalities.plot

Injuries

injuries <- proc_summary(data.sums[,c(1,3)], 'INJURIES')
head(injuries,n=20)

##                EVTYPE INJURIES Percent
## 1             TORNADO    91346     65%
## 2           TSTM WIND     6957   4.95%
## 3               FLOOD     6789   4.83%
## 4      EXCESSIVE HEAT     6525   4.64%
## 5           LIGHTNING     5230   3.72%
## 6                HEAT     2100   1.49%
## 7           ICE STORM     1975   1.41%
## 8         FLASH FLOOD     1777   1.26%
## 9   THUNDERSTORM WIND     1488   1.06%
## 10              Other     1384  0.985%
## 11               HAIL     1361  0.968%
## 12       WINTER STORM     1321   0.94%
## 13  HURRICANE/TYPHOON     1275  0.907%
## 14          HIGH WIND     1137  0.809%
## 15         HEAVY SNOW     1021  0.727%
## 16           WILDFIRE      911  0.648%
## 17 THUNDERSTORM WINDS      908  0.646%
## 18           BLIZZARD      805  0.573%
## 19                FOG      734  0.522%
## 20   WILD/FOREST FIRE      545  0.388%

Again, in the plot below, we deliberately leave off the most significant cause of injuries so as to allow better separation of other items in the plot.

qplot(data=injuries[2:20,], x=EVTYPE, y=INJURIES, 
      main="Top 20 Weather conditions causing injuries in 1950-2011\n(With top item removed for better scaling)") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  ylim(c(0,NA))

plot of chunk damage.human.injuries.plot

Further improvement

First, colapsing the less that 1% into ‘other’ could be improved upon by examining the definitive set of EVTYPEs per the documentation and then using the edit-distance string metric to reclassify other EVTYPEs.

Second, the economic costs should really be adjusted for inflation, given the long tmie period over which the data was collected.

Appendix A - Operating Environment

The analysis was carried out under the following R environment.

sessionInfo()

## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8   
##  [6] LC_MESSAGES=en_AU.UTF-8    LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] ggplot2_1.0.0 reshape2_1.4  plyr_1.8.1   
## 
## loaded via a namespace (and not attached):
##  [1] codetools_0.2-9  colorspace_1.2-4 digest_0.6.4     evaluate_0.5.5   formatR_1.0      grid_3.1.1       gtable_0.1.2     htmltools_0.2.6  knitr_1.6       
## [10] labeling_0.3     MASS_7.3-35      munsell_0.4.2    proto_0.3-10     Rcpp_0.11.3      rmarkdown_0.3.3  scales_0.2.4     stringr_0.6.2    tools_3.1.1     
## [19] yaml_2.1.13

The Human and Economic Impacts of Extreme Weather

Derek Slone-Zhen

2014-10-27 08:35:04