Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
Using the “Storm Data” published by the National Oceanic and Atmospheric Administration (NOAA), this paper looks at the human and economic impact of extreme weather conditions.
We use fatality and injury counts to assess the human impact, and crop and property damage estimates to assess the economic impact.
Tornadoes have have the biggest impact on people by an order of magnitude, accounting for 65% of all injuries and 37% of fatalities. Excessive Heat and Flooding also account for a significant proportion of fatalities.
In terms of economic impact, FLOODS, HURRICANE/TYPHOON, TORNADO and STORM SURGE come top, accounting for more than 67% of all economic damage between them.
Load required libraries:
library(plyr) # For 'arrange', to sort our data
library(reshape2) # For melting wide to tall data
library(ggplot2) # For awsome plotting
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site: Storm Data [47Mb]
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events FAQ
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
Download and decompress using bash rather than R - it seems much faster! the R fragment below uses the alternate knitr engine by specifying engine='bash', which allows us to execute bash commands instead of R!
# If the bz2 file already exists, no need to wget it
[ -f StormData.csv.bz2 ] || wget -O StormData.csv.bz2 'https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2'
# If the csv file already exists, no need to bunzip it
[ -f StormData.csv ] || bunzip2 -k StormData.csv.bz2
The data has a particularly complex format, involving multi-line quoted strings with embeded self-escaping quotation marks. Having tried using data.table’s fread for increased performance, we found it was unable to cope with the complexity of the file format. The records towards the end of the file are significanlty more complex in this respect that the early ones.
When reading in the data frame, we have trimmed the columns down to just read in those we are interested in by reading in the first few rows to get the names and inferred types. Then change the types of the unwanted columns to NULL (meaning ignore) and force the two EXP columns to be factors, and finally read in the data with a reduced number of columns and an explicit number of rows.
Having read in the data, we parse out a date.begin from the text of DATE_BGN, we will not look at time granularity, not county level granulatiy in this investigation due to time constraints.
localUnbzippedFile <- "StormData.csv"
wanted.cols <- c('EVTYPE', 'FATALITIES', 'INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP', 'BGN_DATE', 'STATE')
data <- read.csv(localUnbzippedFile, nrows=100)
# Pull the guessed set of classes to use as our colClasses in the full run
classes <- sapply(data, class)
classes[!(names(classes) %in% wanted.cols)] <- 'NULL'
classes[c('PROPDMGEXP', 'CROPDMGEXP')] <- 'factor'
data <- read.csv(localUnbzippedFile, nrows=902297, colClasses=classes)
data$date.begin <- as.Date(data$BGN_DATE, format='%m/%d/%Y')
print(object.size(data), units='auto')
## 53 Mb
We use the plyr library to summarise the entire data set by EVTYPE.
data.sums <- ddply(data, .(EVTYPE), numcolwise(sum, na.rm = TRUE))
We will examine the economic impact, injuries and fatalities sepeartely.
There are 985 EVTYPEs present in the data. To deal with this, we will reclassify some oth the lower values into an “Other” collection so that results can be presented effectively. The cut off is chosen such that the “Other” category constains less than 1% of the impact.
Additionally, we refactor the EVTYPE factor so that its levels are in order of significance, and add a percentage column.
proc_summary <- function(df, measure) {
# NB: df is already summarised by EVTYPE
df <- arrange(df, desc(df[[measure]]))
colTotal <- sum(df[[measure]], rm.na=TRUE)
percent <- df[[measure]] / colTotal
density <- cumsum(percent)
cutOff <- which(density>0.99)[[1]]
result <- rbind(
cbind(EVTYPE='Other', numcolwise(sum, rm.na=TRUE)(df[-(1:cutOff),])),
df[1:cutOff,]
)
# Re-sort to get 'Other' into the correct location
result <- arrange(result, desc(result[[measure]]))
# Re-factor EVTYPE so that most significan item appear first in plots
result$EVTYPE <- factor(as.character(result$EVTYPE),levels=as.character(result$EVTYPE));
result$Percent <- paste0(formatC(100 * result[[measure]] / colTotal, digits=3),'%')
result
}
data.damage <- proc_summary(data.sums[,c(1,4:6)], 'damage.total')
# Put some nicer column names on
colnames(data.damage)[2:4] <- c('Crops', 'Property', 'Total')
# Take a look at the top few records
head(data.damage,n=20)
## EVTYPE Crops Property Total Percent
## 1 FLOOD 5.662e+09 1.447e+11 1.503e+11 31.5%
## 2 HURRICANE/TYPHOON 2.608e+09 6.931e+10 7.191e+10 15.1%
## 3 TORNADO 4.150e+08 5.695e+10 5.736e+10 12%
## 4 STORM SURGE 5.000e+03 4.332e+10 4.332e+10 9.08%
## 5 HAIL 3.026e+09 1.574e+10 1.876e+10 3.93%
## 6 FLASH FLOOD 1.421e+09 1.682e+10 1.824e+10 3.82%
## 7 DROUGHT 1.397e+10 1.046e+09 1.502e+10 3.15%
## 8 HURRICANE 2.742e+09 1.187e+10 1.461e+10 3.06%
## 9 RIVER FLOOD 5.029e+09 5.119e+09 1.015e+10 2.13%
## 10 ICE STORM 5.022e+09 3.945e+09 8.967e+09 1.88%
## 11 TROPICAL STORM 6.783e+08 7.704e+09 8.382e+09 1.76%
## 12 WINTER STORM 2.694e+07 6.688e+09 6.715e+09 1.41%
## 13 HIGH WIND 6.386e+08 5.270e+09 5.909e+09 1.24%
## 14 WILDFIRE 2.955e+08 4.765e+09 5.061e+09 1.06%
## 15 TSTM WIND 5.540e+08 4.485e+09 5.039e+09 1.06%
## 16 STORM SURGE/TIDE 8.500e+05 4.641e+09 4.642e+09 0.973%
## 17 Other 1.361e+09 3.274e+09 4.635e+09 0.971%
## 18 THUNDERSTORM WIND 4.148e+08 3.483e+09 3.898e+09 0.817%
## 19 HURRICANE OPAL 1.900e+07 3.173e+09 3.192e+09 0.669%
## 20 WILD/FOREST FIRE 1.068e+08 3.002e+09 3.109e+09 0.651%
# And finally plot
data.damage.tall <- melt(data.damage[,1:4], id='EVTYPE') # Melt to produce multiple data series for plotting
names(data.damage.tall)[2:3] <- c('Damage Type', 'Cost') # Nicer names
qplot(data=data.damage.tall, x=EVTYPE, y=Cost, colour=`Damage Type`,
main="Total Costs of Weather Events\n1950-2011",
ylab="Cost in US$") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
scale_y_log10(limits=10^c(0,12),breaks=10^c(3,6,9,12),labels=c('$1k','$1M', '$1B', '$1T')) +
geom_point(size = 5, alpha=I(0.4))
fatalities <- proc_summary(data.sums[,1:2], 'FATALITIES')
head(fatalities,n=20)
## EVTYPE FATALITIES Percent
## 1 TORNADO 5633 37.2%
## 2 EXCESSIVE HEAT 1903 12.6%
## 3 FLASH FLOOD 978 6.46%
## 4 HEAT 937 6.19%
## 5 LIGHTNING 816 5.39%
## 6 TSTM WIND 504 3.33%
## 7 FLOOD 470 3.1%
## 8 RIP CURRENT 368 2.43%
## 9 HIGH WIND 248 1.64%
## 10 AVALANCHE 224 1.48%
## 11 WINTER STORM 206 1.36%
## 12 RIP CURRENTS 204 1.35%
## 13 HEAT WAVE 172 1.14%
## 14 EXTREME COLD 160 1.06%
## 15 Other 151 0.997%
## 16 THUNDERSTORM WIND 133 0.878%
## 17 HEAVY SNOW 127 0.839%
## 18 EXTREME COLD/WIND CHILL 125 0.825%
## 19 STRONG WIND 103 0.68%
## 20 BLIZZARD 101 0.667%
In the plot below, we deliberately leave off the most significant cause of death so as to allow better separation of other items in the plot.
qplot(data=fatalities[2:20,], x=EVTYPE, y=FATALITIES,
main="Top 20 Weather conditions fatalities in 1950-2011\n(With top item removed for better scaling)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylim(c(0,NA))
injuries <- proc_summary(data.sums[,c(1,3)], 'INJURIES')
head(injuries,n=20)
## EVTYPE INJURIES Percent
## 1 TORNADO 91346 65%
## 2 TSTM WIND 6957 4.95%
## 3 FLOOD 6789 4.83%
## 4 EXCESSIVE HEAT 6525 4.64%
## 5 LIGHTNING 5230 3.72%
## 6 HEAT 2100 1.49%
## 7 ICE STORM 1975 1.41%
## 8 FLASH FLOOD 1777 1.26%
## 9 THUNDERSTORM WIND 1488 1.06%
## 10 Other 1384 0.985%
## 11 HAIL 1361 0.968%
## 12 WINTER STORM 1321 0.94%
## 13 HURRICANE/TYPHOON 1275 0.907%
## 14 HIGH WIND 1137 0.809%
## 15 HEAVY SNOW 1021 0.727%
## 16 WILDFIRE 911 0.648%
## 17 THUNDERSTORM WINDS 908 0.646%
## 18 BLIZZARD 805 0.573%
## 19 FOG 734 0.522%
## 20 WILD/FOREST FIRE 545 0.388%
Again, in the plot below, we deliberately leave off the most significant cause of injuries so as to allow better separation of other items in the plot.
qplot(data=injuries[2:20,], x=EVTYPE, y=INJURIES,
main="Top 20 Weather conditions causing injuries in 1950-2011\n(With top item removed for better scaling)") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
ylim(c(0,NA))
First, colapsing the less that 1% into ‘other’ could be improved upon by examining the definitive set of EVTYPEs per the documentation and then using the edit-distance string metric to reclassify other EVTYPEs.
Second, the economic costs should really be adjusted for inflation, given the long tmie period over which the data was collected.
The analysis was carried out under the following R environment.
sessionInfo()
## R version 3.1.1 (2014-07-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
##
## locale:
## [1] LC_CTYPE=en_AU.UTF-8 LC_NUMERIC=C LC_TIME=en_AU.UTF-8 LC_COLLATE=en_AU.UTF-8 LC_MONETARY=en_AU.UTF-8
## [6] LC_MESSAGES=en_AU.UTF-8 LC_PAPER=en_AU.UTF-8 LC_NAME=C LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_1.0.0 reshape2_1.4 plyr_1.8.1
##
## loaded via a namespace (and not attached):
## [1] codetools_0.2-9 colorspace_1.2-4 digest_0.6.4 evaluate_0.5.5 formatR_1.0 grid_3.1.1 gtable_0.1.2 htmltools_0.2.6 knitr_1.6
## [10] labeling_0.3 MASS_7.3-35 munsell_0.4.2 proto_0.3-10 Rcpp_0.11.3 rmarkdown_0.3.3 scales_0.2.4 stringr_0.6.2 tools_3.1.1
## [19] yaml_2.1.13