Report: Number of fatalities and economic damage caused by extreme weather events increased considerably since 1993, US government database indicates.

Synopsis

Statistical analysis of a comprehensive weather database, compiled by US authority NOAA over 60 years, reveals that the number of fatalities and economic damage caused by extreme weather events are increasing. The database actually consists of two methodologically distinct subsets: those data collected from 1950s to 1993, and those data from 1993 to 2011. The pre-1993 was classified into three distinct extreme-weather events only ( “tornado” “thunderstorm wind” “hail”), with tornado causing the greatest fatalities and largest damage. In contrast the after-1993 data (and including 1993) have a fine granularity of 985 distinct event types, of which both wind-related events, and extreme heat/drought realted event types equally contribute to economic damage and number of fatalities.
In any case, whereas the order of magnitude of number of fatalities remains relatively constant, the amount of damage done by extreme weather events has risen dramatically (6-fold) over the two periods, since 1997 exceeding 5 billion dollars each year, and 10 billion dollars 3 times in the past 20 years.

Data Processing

Data acquisition

The weather database was a copy acquired from the web.
Source URL: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Date accessed: July 23rd 2014

Description: U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, containing 1.23 million records. This database tracks characteristics of major storms and weather events in the United States from the 1950s to 2011, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

There is also very specific documentation of the database available. These links describe how some of the variables are constructed/defined.

In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Software used

All calcuations were performed with R version 3.1.0 (2014-04-10).

For details about the software used, see bottom of report.

Processing steps

Load necessary libraries needed by the code below

library(plyr) # mutate function
library(XML)  # parse HTML helper table
library(lubridate) #parse character sings to datetype

## 
## Attaching package: 'lubridate'
## 
## The following object is masked from 'package:plyr':
## 
##     here

library(ggplot2)  #plots
library(Hmisc)   #contents() function, a better summary() function. can also use this to assign labels to data frame columns

## Loading required package: grid
## Loading required package: lattice
## Loading required package: survival
## Loading required package: splines
## Loading required package: Formula
## 
## Attaching package: 'Hmisc'
## 
## The following objects are masked from 'package:plyr':
## 
##     is.discrete, summarize
## 
## The following objects are masked from 'package:base':
## 
##     format.pval, round.POSIXt, trunc.POSIXt, units

Set working directory. Data files and pdf files containing documentations are stored in a subdirectory ‘projectdata’ inside this directory.

setwd("/home/knut/Documents/coursera/datascience/5-reprod-research/RepData_PeerAssessment2")

Uncompress data with bunzip2, a common tool provided by most Unix operating systems. Normally, bzip2 will not overwrite existing output files.

cmd <- 'bunzip2("projectdata/repdata_data_StormData.csv.bz2")'
system(cmd)

Try to determine how many rows there are in the data file.

filename <- "projectdata/repdata_data_StormData.csv"
nrow <- gsub('\\s\\S+$', "", system(paste("wc -l", filename), intern = TRUE), perl=TRUE)

The csv-file contains 1232705 newlines. This is just an estimation for the rowcount, because the file might contain “multiline strings”.

Set all relevant environment variables to english, supporting modern (multibyte) character sets to make sure to read in the data correctly.

Sys.setlocale('LC_ALL', 'en_US.UTF-8')

## [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=de_DE.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=de_DE.UTF-8;LC_IDENTIFICATION=C"

Inspect a subset of the data file.

initial <- read.csv(filename, na.strings=c("","<NA>"), quote="",header=TRUE, nrows=187000,  stringsAsFactors=FALSE)
classes <- sapply(initial, class)
classes

##    X.STATE__.   X.BGN_DATE.   X.BGN_TIME.  X.TIME_ZONE.     X.COUNTY. 
##     "numeric"   "character"   "character"   "character"     "numeric" 
## X.COUNTYNAME.      X.STATE.     X.EVTYPE.  X.BGN_RANGE.    X.BGN_AZI. 
##   "character"   "character"   "character"     "numeric"     "logical" 
## X.BGN_LOCATI.   X.END_DATE.   X.END_TIME. X.COUNTY_END. X.COUNTYENDN. 
##     "logical"     "logical"     "logical"     "numeric"     "logical" 
##  X.END_RANGE.    X.END_AZI. X.END_LOCATI.     X.LENGTH.      X.WIDTH. 
##     "numeric"     "logical"     "logical"     "numeric"     "numeric" 
##          X.F.        X.MAG. X.FATALITIES.   X.INJURIES.    X.PROPDMG. 
##   "character"     "numeric"     "numeric"     "numeric"     "numeric" 
## X.PROPDMGEXP.    X.CROPDMG. X.CROPDMGEXP.        X.WFO. X.STATEOFFIC. 
##   "character"     "numeric"     "logical"   "character"     "logical" 
##  X.ZONENAMES.   X.LATITUDE.  X.LONGITUDE. X.LATITUDE_E. X.LONGITUDE_. 
##     "logical"     "numeric"     "numeric"     "numeric"     "numeric" 
##    X.REMARKS.     X.REFNUM. 
##     "logical"     "numeric"

The database contains 37 columns. Not all of these are needed, though.

Read in data file, ignoring columns with “NA” values only. I found this by trial and error. Turn uppercase column names to lowercase column names

important_columns = c("NULL", "character", rep("NULL", 4), "character", "character", rep("NULL", 14), "numeric", "numeric", "numeric", "character", "numeric", "character", rep("NULL", 3), "numeric", "numeric", rep("NULL", 3), "numeric")
storms <- read.csv(filename,colClasses=important_columns)
names(storms) <- tolower(names(storms))
contents(storms)

## 
## Data frame:storms    902297 observations and 12 variables    Maximum # NAs:47
## 
##              Storage NAs
## bgn_date   character   0
## state      character   0
## evtype     character   0
## fatalities    double   0
## injuries      double   0
## propdmg       double   0
## propdmgexp character   0
## cropdmg       double   0
## cropdmgexp character   0
## latitude      double  47
## longitude     double   0
## refnum        double   0

Number of NULL values is 0 for 11 out of 12 columns, except for the nonessential “latitude” column.

Fix NAPA valley flood 115 Billion error/outlier (B miscoded as M) Source: http://pubs.usgs.gov/of/2006/1182/pdf/ofr2006-1182.pdf
The abstract claims total damage of $300M. Indeed, it looks like the “B” should have been an “M”.

storms$propdmgexp[storms$refnum == 605943] <- "M"

Converting the “date” column to Date datatype, ignoring time zones and time of day

storms$datetime <- parse_date_time(gsub('\\s\\S+$', "", storms$bgn_date, perl=TRUE), "%m/%d/%Y", truncated=3)

Results

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

This can be answered by determining the number of fatalities by year

with(storms, qplot(year(datetime), fatalities, data=storms, geom="bar", xlab="year", stat="identity"))

plot of chunk plot1

The plot indicates that there is a notable increase of fatalaties after 1993. This could be caused by more severe weather, or it could representa a systematic issue. The event classification scheme
I’ll introduce an artificial grouping variable to show a quantitative view on the data:

storms$period <- ifelse(year(storms$datetime) >= 1993, "postincl1993", "pre1993")

        xtabs(fatalities ~period, data=storms)

## period
## postincl1993      pre1993 
##        10865         4280

This shows that the number of fatalities is being classified ina much more fine-grained manner since 1993:

xtabs(fatalities ~evtype + period, data=storms[storms$fatalities > 3,])

##                             period
## evtype                       postincl1993 pre1993
##   AVALANCHE                            10       0
##   BLIZZARD                             25       0
##   COLD AND SNOW                        14       0
##   COLD/WIND CHILL                       5       0
##   DUST STORM                           10       0
##   EXCESSIVE HEAT                     1287       0
##   EXTREME COLD                         29       0
##   EXTREME COLD/WIND CHILL              34       0
##   EXTREME HEAT                         88       0
##   FLASH FLOOD                         220       0
##   FLASH FLOOD/FLOOD                     6       0
##   FLASH FLOODING                        4       0
##   FLOOD                                97       0
##   FLOOD/FLASH FLOOD                     5       0
##   FOG                                  16       0
##   GLAZE                                 5       0
##   HAIL                                  0       4
##   HEAT                                735       0
##   HEAT WAVE                           150       0
##   HEAT WAVE DROUGHT                     4       0
##   HEAT WAVES                            5       0
##   HEAVY RAIN                           23       0
##   HEAVY SNOW                           15       0
##   HEAVY SURF/HIGH SURF                  5       0
##   HIGH SURF                            22       0
##   HIGH WIND                            45       0
##   HIGH WIND/SEAS                        4       0
##   HURRICANE                            36       0
##   HURRICANE/TYPHOON                    46       0
##   ICE STORM                            20       0
##   LANDSLIDE                            28       0
##   LIGHTNING                             9       0
##   MARINE MISHAP                         6       0
##   MARINE TSTM WIND                      5       0
##   Mudslide                              4       0
##   RECORD/EXCESSIVE HEAT                17       0
##   RIP CURRENT                           6       0
##   RIP CURRENTS/HEAVY SURF               4       0
##   ROUGH SEAS                            4       0
##   STORM SURGE                          10       0
##   STORM SURGE/TIDE                     11       0
##   STRONG WIND                           5       0
##   THUNDERSTORM WIND                     7       0
##   THUNDERSTORM WINDS                    8       0
##   TORNADO                            1037    2800
##   TORNADOES, TSTM WIND, HAIL           25       0
##   TROPICAL STORM                       30       0
##   TROPICAL STORM GORDON                 8       0
##   TSTM WIND                             5      34
##   TSUNAMI                              32       0
##   UNSEASONABLY WARM                    10       0
##   UNSEASONABLY WARM AND DRY            29       0
##   WILDFIRE                             38       0
##   WILD/FOREST FIRE                      4       0
##   WINTER STORM                         35       0
##   WINTER STORMS                        10       0
##   WINTER WEATHER/MIX                   11       0

Across the United States, determine the economic consequences

Determine total amount of economic damage by year In the database, the damage-amount is encoded as “K” for thousands of dollars, “M” as millions of dollars.

storms$damageprop <- ifelse(storms$propdmgexp == "M", storms$propdmg *1000000, ifelse(storms$propdmgexp == "K",storms$propdmg *1000, 0))
storms$damagecrop <- ifelse(storms$cropdmgexp == "M", storms$cropdmg *1000000, ifelse(storms$cropdmgexp == "K",storms$cropdmg *1000, 0))
storms$damage <-storms$damagecrop + storms$damageprop

Plotting this yields this plot:

with(storms, qplot(year(datetime), storms$damageprop+ storms$damagecrop, data=storms, geom="bar", xlab="year", stat="identity",ylab="Damage to crops and property, in US Dollars "))

plot of chunk plot2

This table lists the event types, and the damage done for each event type, for each period (post-1993 and pre-1993), in US Dollars (column “Freq”):

storms$damage <- storms$damagecrop + storms$damageprop
damage <- as.data.frame(xtabs(storms$damage ~ evtype + period, data=storms))

damage[damage$Freq > 1000000000,]

##                  evtype       period      Freq
## 88              DROUGHT postincl1993 1.352e+10
## 133        EXTREME COLD postincl1993 1.361e+09
## 147         FLASH FLOOD postincl1993 1.656e+10
## 164               FLOOD postincl1993 2.793e+10
## 206        FROST/FREEZE postincl1993 1.104e+09
## 238                HAIL postincl1993 1.695e+10
## 284          HEAVY RAIN postincl1993 1.428e+09
## 304          HEAVY SNOW postincl1993 1.067e+09
## 354           HIGH WIND postincl1993 4.609e+09
## 397           HURRICANE postincl1993 8.910e+09
## 406   HURRICANE/TYPHOON postincl1993 4.904e+09
## 424           ICE STORM postincl1993 3.967e+09
## 759   THUNDERSTORM WIND postincl1993 3.898e+09
## 783  THUNDERSTORM WINDS postincl1993 1.924e+09
## 830             TORNADO postincl1993 2.144e+10
## 844      TROPICAL STORM postincl1993 3.232e+09
## 854           TSTM WIND postincl1993 5.039e+09
## 953            WILDFIRE postincl1993 4.021e+09
## 956    WILD/FOREST FIRE postincl1993 1.609e+09
## 972        WINTER STORM postincl1993 1.715e+09
## 1815            TORNADO      pre1993 3.060e+10

In the 20-year period from 1993 to 2011, each of these event types contribute as much as the tornado-related damages that occured during the 40 year period of 1953 to 1993.

Background

This document was prepared for the Coursera.com class Reproducible Research:
Peer Assessment 2

The instructions for this assignment can be found on this page: https://github.com/rdpeng/RepData_PeerAssessment1.
Please read this first for context and details.

Knut Behrends
knb@gfz-potsdam.de
July 2014

sessionInfo()

## R version 3.1.0 (2014-04-10)
## Platform: x86_64-pc-linux-gnu (64-bit)
## 
## locale:
##  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
##  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
##  [7] LC_PAPER=de_DE.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C       
## 
## attached base packages:
## [1] splines   grid      stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] Hmisc_3.14-4    Formula_1.1-1   survival_2.37-7 lattice_0.20-29
##  [5] ggplot2_1.0.0   lubridate_1.3.3 XML_3.98-1      plyr_1.8.1     
##  [9] rj_2.0.2-1      knitr_1.6       colorout_1.0-3 
## 
## loaded via a namespace (and not attached):
##  [1] cluster_1.15.2      colorspace_1.2-4    dichromat_2.0-0    
##  [4] digest_0.6.4        evaluate_0.5.3      formatR_0.10       
##  [7] gtable_0.1.2        htmltools_0.2.4     labeling_0.2       
## [10] latticeExtra_0.6-26 MASS_7.3-31         memoise_0.2.1      
## [13] munsell_0.4.2       proto_0.3-10        RColorBrewer_1.0-5 
## [16] Rcpp_0.11.1         reshape2_1.2.2      rmarkdown_0.2.49   
## [19] scales_0.2.3        stringr_0.6.2       tools_3.1.0