by Steven Muschalik, August 2015

Synopsis

Severe weather events acorss the United States can wreck havoc in communities and put significant strain on municipalities and emergency services if there is inadequate forsight and planning. Many extreme weather events can result in fatalities, injuries, crop and property damage. Preparing for such events has significant benefits for public health and economic recovery but require the appropriate allocation of resources and financial aid.

Statistical techniques allow us to estimate the damage of various significant weather events in advance so that municipalities, hospitals and emergency services can allocate resources to help save lives and property.

In this report we explore a) which types of events are most harmful with respect to population health and b) which types of events have the greatest economic consequences?

  1. The top 3 fatalities are caused by: tornados, heat & draught, winds & storms.

    The top 3 injuries are caused by: tornados, wind & storm, heat & drought.

  2. Property and crop damage costs: The top 3 are winds & storm, flooding, and tornados.

    Human cost of fatalities and injury: The top 3 are tornados, heat & drought, and wind & storm.

    The combined costs of human, property and crop damage: wind & storm, flooding, and tornados.

Data Processing

For this report, we draw on The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database [47Mb] which contains information on the major weather events in the United States from 1950 - 2011. For further information visit National Weather Service Storm Data Documentation. Questions and answers can be found here at National Climatic Data Center Storm Events FAQ

Set Global Settings

Before we an begin any statistical analysis we need to set global parameters and load the libraries.

# setwd("E:/Courses/Coursera/Reproducible Research/Week 3/PeerAssessment2")
setwd("F:/Courses/Coursera/Reproducible Research/Week 3/PeerAssessment2")

knitr::opts_chunk$set(echo = TRUE, message = FALSE, warnings = FALSE, fig.width = 10, fig.height = 6)

Load Libraries

Check if the necessary packages are already installed

mypackages <- c("ggplot2", "knitr", "plyr", "gridExtra", "pander", "pastecs")
if (length(setdiff(mypackages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(mypackages, rownames(installed.packages())))  
}
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
library(plyr)
## Warning: package 'plyr' was built under R version 3.1.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
library(pander)
## Warning: package 'pander' was built under R version 3.1.3
library(pastecs)
## Warning: package 'pastecs' was built under R version 3.1.3

Loading Data

The directory needs to be set in which the files will be downloaded. Then the data will be downloaded from the NOAA Storm Database

url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("stormdata.csv.bz2")) {
    download.file(url, destfile = "stormdata.csv.bz2")
}

# Check if the table storm exists in memory already
if (!"storm" %in% ls()) {
    #Read in 5 rows to check the class, then read in the full csv with colClasses to speed up reading
    tab5rows <- read.csv("stormdata.csv.bz2", header = TRUE, nrows = 5)
    classes <- sapply(tab5rows, class)
    storm <- read.csv("stormdata.csv.bz2", header = TRUE, sep = ",", nrows = 902297)
}

dim(storm)
## [1] 902297     38

The table storm has 902297 rows and 38 columns.

Pre-process Data

To pre-process the data we make the column names lower case.

names(storm) <- tolower(names(storm))
names(storm)
##  [1] "state__"    "bgn_date"   "bgn_time"   "time_zone"  "county"    
##  [6] "countyname" "state"      "evtype"     "bgn_range"  "bgn_azi"   
## [11] "bgn_locati" "end_date"   "end_time"   "county_end" "countyendn"
## [16] "end_range"  "end_azi"    "end_locati" "length"     "width"     
## [21] "f"          "mag"        "fatalities" "injuries"   "propdmg"   
## [26] "propdmgexp" "cropdmg"    "cropdmgexp" "wfo"        "stateoffic"
## [31] "zonenames"  "latitude"   "longitude"  "latitude_e" "longitude_"
## [36] "remarks"    "refnum"     "year"

We modify the date colum from factor to date class.

storm$bgn_date <- as.Date(storm$bgn_date, format="%m/%d/%Y")

We add an aditional year column derived from the event date and count how many records each year has. We then transpose the rows into columns for the resulting table.

storm$year <- as.numeric(format(storm$bgn_date, "%Y"))
minyear <- min(storm$year)
maxyear <- max(storm$year)
t(count(storm, "year"))
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## year 1950 1951 1952 1953 1954 1955 1956 1957 1958  1959  1960  1961  1962
## freq  223  269  272  492  609 1413 1703 2184 2213  1813  1945  2246  2389
##      [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## year  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973
## freq  1968  2348  2855  2388  2688  3312  2926  3215  3471  2168  4463
##      [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## year  1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984
## freq  5386  4975  3768  3728  3657  4279  6146  4517  7132  8322  7335
##      [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## year  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995
## freq  7979  8726  7367  7257 10410 10946 12522 13534 12607 20631 27970
##      [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## year  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006
## freq 32270 28680 38128 31289 34471 34962 36293 39752 39363 39184 44034
##      [,58] [,59] [,60] [,61] [,62]
## year  2007  2008  2009  2010  2011
## freq 43289 55663 45817 48161 62174

We have data from 1950 to 2011 so we have 62 years of data. A histrogram is generated to visualise the amount of records that has been collected for each year.

par(mfrow = c(2,1))
hist(storm$year, xlab = "Year", main = paste("Histogram of Events Per Year", minyear, "to", maxyear), col = "lightgreen", breaks=seq(minyear,maxyear,l=maxyear-minyear+1))

cutoff <- 1994

gmonth <- as.numeric(format(storm$bgn_date, "%m"))
h <- hist(gmonth, xlab = "Month", main = paste("Histogram of Events by Month", cutoff, "to", maxyear), col = "magenta", breaks = seq(1, 12, l = 12+1))
    xfit <- seq(min(gmonth), max(gmonth), length = 30) 
    yfit <- dnorm(xfit, mean = mean(gmonth), sd = sd(gmonth)) 
    yfit <- yfit * diff(h$mids[1:2]) * length(gmonth) 
    lines(xfit, yfit, col = "black", lwd = 2)

stat.desc(gmonth, basic=F)  #Using pastecs
##       median         mean      SE.mean CI.mean.0.95          var 
##  6.000000000  6.068808829  0.002554438  0.005006612  5.887624388 
##      std.dev     coef.var 
##  2.426442744  0.399821911

From the yearly histogram we can see that we the number of recorded events increases significantly from 1994 - 2011, so we will limit our data range to that period. Note also that most of the damage occurs in the middle of the year when temperatures are warmer, which generally means more energy and water in the atmosphere. The histogram of events by month resembles a normal distribution, but is slightly skewed.

Filtering for relevant events based on the storm data documentation we select the following columns:

Column Data
bgn_date Date the storm event began
evtype Type of storm event
fatalities Number directly killed
injuries Number directly injured
propdmg Property damage in whole numbers
propdmgexp Contains multipliers for propdmg as factors
cropdmg Crop damage in whole numbers
cropdmgexp Contains multipliers for cropdmg as factors

We subset the storm data after 1994 and create a new object events with 8 columns.

events <- storm[storm$year >= cutoff, c("bgn_date", "evtype", "fatalities", "injuries", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp")]
events <- events[storm$fatalities >= 0 | storm$injuries >= 0 | storm$propdmg >= 0 | storm$cropdmg >= 0, ]
dim(events)
## [1] 902297      8

The table events has 902297 rows and 8 columns.

The columns propdmgexp and cropdmgexp contain multiplier factors. To keep the column lengths variable and narrow we also apply the transpose function.

t(summary(events$propdmgexp))
##             - ? +   0  1  2 3 4  5 6 7 8  B h H      K m    M   NA's
## [1,] 306087 1 6 5 215 25 13 4 4 28 3 5 1 37 1 6 387417 7 8266 200166
t(summary(events$cropdmgexp))
##             ?  0 2 B  k      K m    M   NA's
## [1,] 419089 5 17 1 8 21 281034 1 1955 200166

Calculate Mulipliers

We create new columns proploss and croploss which now contain the multiplier as numbers instead of factors.

We need to convert the multipliers H, K, M and B into Hundred (H), Thousand (K), Million (M) and Billion (B). The other exponential multipliers 0-8 in propdmgexp and cropdmgexp are exponents e.g. 10^1, 10^2, 10^3, 10^4 etc.

# Hundred (H), Thousand (K), Million (M), Billion (B)
multiplierChar = c("?", "", "-", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "h", "H", "k", "K", "m", "M", "b", "B")
multiplier = c(0, 0, 0, 0, 0, 1e+01,  1e+02, 1e+03, 1e+04, 1e+05, 1e+06, 1e+07, 1e+08, 1e+02, 1e+02, 1e+03, 1e+03, 1e+06, 1e+06, 1e+09, 1e+09)

# use propdmgex to get property damage costs
events$proploss = multiplier[match(events$propdmgexp, multiplierChar)]

# use cropdmgexp to get property damage costs
events$croploss = multiplier[match(events$cropdmgexp, multiplierChar)]

str(events)
## 'data.frame':    902297 obs. of  10 variables:
##  $ bgn_date  : Date, format: "1995-01-06" "1995-01-22" ...
##  $ evtype    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 201 629 429 657 657 410 786 786 834 244 ...
##  $ fatalities: num  0 0 0 0 0 2 0 0 0 0 ...
##  $ injuries  : num  0 0 2 0 0 0 0 0 0 0 ...
##  $ propdmg   : num  0 0 0 0 0 0.1 50 5 500 0 ...
##  $ propdmgexp: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 14 17 19 17 1 ...
##  $ cropdmg   : num  0 0 0 0 0 10 0 500 0 0 ...
##  $ cropdmgexp: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 9 1 7 1 1 ...
##  $ proploss  : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+09 1e+03 1e+06 1e+03 0e+00 ...
##  $ croploss  : num  0e+00 0e+00 0e+00 0e+00 0e+00 1e+06 0e+00 1e+03 0e+00 0e+00 ...

Event Categories

There are many overlapping categories as the data is quite messy. There are various typos and overlapping categories which makes grouping difficult. Here is a selection of the the first 18 categories from 985 unique categories.

categories <- unique(events$evtype)
length(categories)
## [1] 931
head(categories, n=18)
##  [1] FREEZING RAIN             SNOW                     
##  [3] ICE STORM/FLASH FLOOD     SNOW/ICE                 
##  [5] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS       
##  [7] TORNADO                   HAIL                     
##  [9] RECORD COLD               HURRICANE ERIN           
## [11] HURRICANE OPAL            HEAVY RAIN               
## [13] LIGHTNING                 THUNDERSTORM WIND        
## [15] DENSE FOG                 RIP CURRENT              
## [17] THUNDERSTORM WINS         FLASH FLOOD              
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND

Here we try to reduce the number of event categories for better understanding. We end up with 11 classes of weather events in the new column damagesource, including “other”.

events$damagesource <- "Other"
events[grepl("tornado|torndao|spout|funnel|whirlwind", events$evtype, ignore.case = TRUE), "damagesource"] <- "Tornado" 
events[grepl("flood|blow-out|swells|fld|dam break|rising water|stream fl|seiche|small stream", events$evtype, ignore.case = TRUE), "damagesource"] <- "Flooding" 
events[grepl("wind|storm|wnd|hurricane|typhoon|tstm|thunderstorm|burst|lightning|lighting|ligntning|hail|turbulence|gustnado", events$evtype, ignore.case = TRUE), "damagesource"] <- "Wind & Storm" 
events[grepl("warmth|warm|heat|hot|drought|hyperthermia|temperature record|record temperature|record high", events$evtype, ignore.case = TRUE), "damagesource"] <- "Heat & Drought" 
events[grepl("seas|high water|tide|tsunami|wave|current|marine|drowning|surf|coastal surge", events$evtype, ignore.case = TRUE), "damagesource"] <- "High Seas" 
events[grepl("precipitation|rain|heavy shower|drizzle|wet|precip|depression|fog|wall cloud", events$evtype, ignore.case = TRUE), "damagesource"] <- "Precipitation & Fog" 
events[grepl("cold|cool|ice|icy|frost|freeze|snow|winter|wintry|wintery|blizzard|chill|freezing|low temp|avalanche|avalance|glaze|sleet|hypothermia", events$evtype, ignore.case = TRUE), "damagesource"] <- "Snow & Ice" 
events[grepl("slide|erosion|slump", events$evtype, ignore.case = TRUE), "damagesource"] <- "Landslide & Erosion" 
events[grepl("dust|saharan|dry", events$evtype, ignore.case = TRUE), "damagesource"] <- "Dust & Saharan Winds" 
events[grepl("fire|smoke|volcanic", events$evtype, ignore.case = TRUE), "damagesource"] <- "Fire & Volcanic Activity" 
events[grepl("other|heavy mix|urban small|urban and small|apache county|high", events$evtype, ignore.case = TRUE), "damagesource"] <- "Other"

Let’s check the new unique sources of damage in damagesource.

unique(events$damagesource)
##  [1] "Snow & Ice"               "Other"                   
##  [3] "Wind & Storm"             "Tornado"                 
##  [5] "Precipitation & Fog"      "High Seas"               
##  [7] "Flooding"                 "Heat & Drought"          
##  [9] "Dust & Saharan Winds"     "Fire & Volcanic Activity"
## [11] "Landslide & Erosion"
events$damagesource <- as.factor(events$damagesource)

Estimate value of human life lost and cost of injuries

To calculate total damages we need to estimate the value of a human life. We select $8 million as an average of these agencies who produced their estimates.

  • Environmental Protection Agency (EPA) - $9.1 million
  • Food and Drug Administration (FDA) - $7.9 million
  • Transportation Department - $6 million

We also want to estimate the cost of an average injury for a severe weather event. As a comparison we use The Cost of Injuries Sustained in Road Crashes to estimate the injury cost, which we compare to the an injury causing external damage by a car accident. The cost for hospitalisation for this injury is around $15000.

Now we create five totals columns to sum or multiply the values from existing columns.

humancost <- 8*10^6   # 8 million dollars
injurycost <- 15000
events$totalproploss <- events$propdmg * events$proploss
events$totalcroploss <- events$cropdmg * events$croploss
events$totalvalueloss <- events$totalproploss + events$totalcroploss 
events$totalhumanloss <- events$fatalities * humancost + events$injuries * injurycost
events$totaleconloss <- events$totalvalueloss + events$totalhumanloss

Results

We generate a new table with the averages for the new columns above for fatalities, injuries, property loss, crop loss, total value loss and human losses after 1994. To do this, we sum the values then divide by years. I’ve done this so that the large figures are not so overwhelming and it is easier for governments to allocate resources and funds each year to the expected damage.

years <- maxyear - cutoff
totals <- ddply(events, .(damagesource), summarize,
      totalfatal = sum(fatalities)/years,
      totalinjured = sum(injuries)/years,
      totalpropdmg = sum(totalproploss)/years,
      totalcropdmg = sum(totalcroploss)/years,
      totalvaluedmg = sum(totalvalueloss)/years,
      totalhumandmg = sum(totalhumanloss)/years,
      totalecondmg = sum(totaleconloss)/years)

totals <- totals[complete.cases(totals),]

stat.desc(totals[2:7], basic=F)  #Using pastecs
##                totalfatal totalinjured totalpropdmg totalcropdmg
## median         63.1764706 2.255882e+02 3.454309e+08 4.288492e+07
## mean           59.6529412 3.828882e+02 2.209064e+09 2.518485e+08
## SE.mean        17.6960854 1.368539e+02 1.235454e+09 1.021949e+08
## CI.mean.0.95   40.0313263 3.095851e+02 2.794790e+09 2.311808e+08
## var          3131.5143791 1.872900e+05 1.526346e+19 1.044379e+17
## std.dev        55.9599355 4.327702e+02 3.906847e+09 3.231685e+08
## coef.var        0.9380918 1.130278e+00 1.768553e+00 1.283186e+00
##              totalvaluedmg totalhumandmg
## median        6.979131e+08  5.086756e+08
## mean          2.460912e+09  4.829669e+08
## SE.mean       1.281062e+09  1.429100e+08
## CI.mean.0.95  2.897963e+09  3.232849e+08
## var           1.641119e+19  2.042327e+17
## std.dev       4.051073e+09  4.519211e+08
## coef.var      1.646167e+00  9.357186e-01

Top 10 Severe Weather Events

Ranking the weather damage costs by human damage (fatality + injury costs)

head(totals[order(-totals$totalhumandmg, -totals$totalfatal),], n=10)
##                damagesource totalfatal totalinjured totalpropdmg
## 4            Heat & Drought 173.588235   519.470588   62104217.6
## 10                  Tornado  93.823529  1330.882353 1508466218.2
## 11             Wind & Storm  93.058824   836.411765 9616807745.9
## 3                  Flooding  88.176471   507.882353 9534668642.4
## 9                Snow & Ice  74.588235   365.411765  408201538.2
## 5                 High Seas  51.764706    69.764706  282660163.5
## 8       Precipitation & Fog  10.529412    79.705882  190931128.8
## 2  Fire & Volcanic Activity   5.117647    85.764706  463719323.5
## 1      Dust & Saharan Winds   3.294118    30.058824     765278.2
## 6       Landslide & Erosion   2.588235     3.529412   22314594.1
##    totalcropdmg totalvaluedmg totalhumandmg totalecondmg
## 4   871819722.4  9.339239e+08    1396497941   2330421881
## 10   21284265.3  1.529750e+09     770551471   2300301954
## 11  620587664.1  1.023740e+10     757016765  10994412175
## 3   417095388.2  9.951764e+09     713030000  10664794031
## 9   500182820.6  9.083844e+08     602187059   1510571418
## 5      384117.6  2.830443e+08     415164118    698208399
## 8    62047400.0  2.529785e+08      85430882    338409411
## 2    23722448.8  4.874418e+08      42227647    529669419
## 1      183235.3  9.485135e+05      26803824     27752337
## 6     1177470.6  2.349206e+07      20758824     44250888

Ranking the weather damage costs by value damage (property and crops)

head(totals[order(-totals$totalvaluedmg, -totals$totalpropdmg),], n=10)
##                damagesource totalfatal totalinjured totalpropdmg
## 11             Wind & Storm  93.058824   836.411765 9616807745.9
## 3                  Flooding  88.176471   507.882353 9534668642.4
## 10                  Tornado  93.823529  1330.882353 1508466218.2
## 4            Heat & Drought 173.588235   519.470588   62104217.6
## 9                Snow & Ice  74.588235   365.411765  408201538.2
## 2  Fire & Volcanic Activity   5.117647    85.764706  463719323.5
## 5                 High Seas  51.764706    69.764706  282660163.5
## 8       Precipitation & Fog  10.529412    79.705882  190931128.8
## 6       Landslide & Erosion   2.588235     3.529412   22314594.1
## 1      Dust & Saharan Winds   3.294118    30.058824     765278.2
##    totalcropdmg totalvaluedmg totalhumandmg totalecondmg
## 11  620587664.1  1.023740e+10     757016765  10994412175
## 3   417095388.2  9.951764e+09     713030000  10664794031
## 10   21284265.3  1.529750e+09     770551471   2300301954
## 4   871819722.4  9.339239e+08    1396497941   2330421881
## 9   500182820.6  9.083844e+08     602187059   1510571418
## 2    23722448.8  4.874418e+08      42227647    529669419
## 5      384117.6  2.830443e+08     415164118    698208399
## 8    62047400.0  2.529785e+08      85430882    338409411
## 6     1177470.6  2.349206e+07      20758824     44250888
## 1      183235.3  9.485135e+05      26803824     27752337

Plots

Total Fatalities and Injuries by Event Type

The total average fatalities and injurines from different weather event types are plotted here from 1950 to 2011.

gfi1 <- ggplot(data = totals, aes(y=totalfatal, x=reorder(damagesource, totalfatal))) + 
      geom_bar(stat="identity", fill="orange") + coord_flip() +
      ylab("Total Fatalities") + 
      xlab("Event Type") +
      ggtitle(paste("Top Average Fatalities per Year\n by Event Type for", cutoff, "to", maxyear))

gfi2 <- ggplot(data = totals, aes(y=totalinjured, x=reorder(damagesource, totalinjured))) +
      geom_bar(stat="identity", fill="red") + coord_flip() +
      ylab("Total Injuries") +
      xlab("Event Type") + 
      ggtitle(paste("Top Average Injuries per Year\n by Event Type", cutoff, "to", maxyear))

grid.arrange(gfi1, gfi2, nrow = 2, ncol = 1)

Total Economic Loss by Event Type

The average economic impact per year of different weather event types is plotted here from 1950 to 2011.

units <- 1e+09
gei1 <- ggplot(data = totals, aes(y=totalvaluedmg/units, x=reorder(damagesource, totalpropdmg))) + 
      geom_bar(stat="identity", fill="orange") + coord_flip() +
      ylab("Economic Cost in Billions") + 
      xlab("Event Type") +
      ggtitle(paste("Total Average Economic Cost per Year of Property and Crop Damage\n by Event Type for", cutoff, "to", maxyear))

gei2 <- ggplot(data = totals, aes(y=totalhumandmg/units, x=reorder(damagesource, totalhumandmg))) + 
      geom_bar(stat="identity", fill="red") + coord_flip() +
      ylab("Economic Cost in Billions") +
      xlab("Event Type") + 
      ggtitle(paste("Total Average Economic Human Cost per Year in Terms of Fatalities and Injury\n by Event Type for", cutoff, "to", maxyear))

gei3 <- ggplot(data = totals, aes(y=totalecondmg/units, x=reorder(damagesource, totalecondmg))) + 
      geom_bar(stat="identity", fill="blue")+ coord_flip() + 
      ylab("Economic Cost in Billions") + 
      xlab("Event Type") + 
      ggtitle(paste("Total Average Economic Cost per Year of Human, Property and Crop Damage\n by Event Type for", cutoff, "to", maxyear))

grid.arrange(gei1, gei2, gei3, nrow = 3, ncol = 1)

Limitations and Further Study

One limitation of this study is that earlier data before 1994 tends to be more scarce, and potentially less accurate. It is also difficult to simplify the classification of weather events as some have mixed weather types. Another limitation is the accuracy of some of the collected data. There may still be hidden errors in the classification of events due to typos for event types and in dollar or multiplier values, which means some event costs may be under or overestimated by a factor of up to a billion in the worst case.

The final economic cost assumes constants for the value of a human life, and the average cost of injury. These variables may need to be adjusted depending on the assumptions one wants to use, or the available hospital cost data.

Lastly, this study does not look at the geography of the events in the United States, which would require further research to inform the different states and municipalities of weather events which are of particular risk to them locally.

In the future the dataset may also be useful in tracking climate change over the decades, for example if the locations of tornados or hurricanes have moved over years, or if their stregth has increased. It may also be important to augment the data with information on property or investment costs over geographic areas as some areas such as cities have a higher value per square meter than in the countryside.