by Steven Muschalik, August 2015
Severe weather events acorss the United States can wreck havoc in communities and put significant strain on municipalities and emergency services if there is inadequate forsight and planning. Many extreme weather events can result in fatalities, injuries, crop and property damage. Preparing for such events has significant benefits for public health and economic recovery but require the appropriate allocation of resources and financial aid.
Statistical techniques allow us to estimate the damage of various significant weather events in advance so that municipalities, hospitals and emergency services can allocate resources to help save lives and property.
In this report we explore a) which types of events are most harmful with respect to population health and b) which types of events have the greatest economic consequences?
The top 3 fatalities are caused by: tornados, heat & draught, winds & storms.
The top 3 injuries are caused by: tornados, wind & storm, heat & drought.
Property and crop damage costs: The top 3 are winds & storm, flooding, and tornados.
Human cost of fatalities and injury: The top 3 are tornados, heat & drought, and wind & storm.
The combined costs of human, property and crop damage: wind & storm, flooding, and tornados.
For this report, we draw on The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database [47Mb] which contains information on the major weather events in the United States from 1950 - 2011. For further information visit National Weather Service Storm Data Documentation. Questions and answers can be found here at National Climatic Data Center Storm Events FAQ
Before we an begin any statistical analysis we need to set global parameters and load the libraries.
# setwd("E:/Courses/Coursera/Reproducible Research/Week 3/PeerAssessment2")
setwd("F:/Courses/Coursera/Reproducible Research/Week 3/PeerAssessment2")
knitr::opts_chunk$set(echo = TRUE, message = FALSE, warnings = FALSE, fig.width = 10, fig.height = 6)
Check if the necessary packages are already installed
mypackages <- c("ggplot2", "knitr", "plyr", "gridExtra", "pander", "pastecs")
if (length(setdiff(mypackages, rownames(installed.packages()))) > 0) {
install.packages(setdiff(mypackages, rownames(installed.packages())))
}
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.1.3
library(knitr)
## Warning: package 'knitr' was built under R version 3.1.3
library(plyr)
## Warning: package 'plyr' was built under R version 3.1.3
library(gridExtra)
## Warning: package 'gridExtra' was built under R version 3.1.3
library(pander)
## Warning: package 'pander' was built under R version 3.1.3
library(pastecs)
## Warning: package 'pastecs' was built under R version 3.1.3
The directory needs to be set in which the files will be downloaded. Then the data will be downloaded from the NOAA Storm Database
url <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
if (!file.exists("stormdata.csv.bz2")) {
download.file(url, destfile = "stormdata.csv.bz2")
}
# Check if the table storm exists in memory already
if (!"storm" %in% ls()) {
#Read in 5 rows to check the class, then read in the full csv with colClasses to speed up reading
tab5rows <- read.csv("stormdata.csv.bz2", header = TRUE, nrows = 5)
classes <- sapply(tab5rows, class)
storm <- read.csv("stormdata.csv.bz2", header = TRUE, sep = ",", nrows = 902297)
}
dim(storm)
## [1] 902297 38
The table storm has 902297 rows and 38 columns.
To pre-process the data we make the column names lower case.
names(storm) <- tolower(names(storm))
names(storm)
## [1] "state__" "bgn_date" "bgn_time" "time_zone" "county"
## [6] "countyname" "state" "evtype" "bgn_range" "bgn_azi"
## [11] "bgn_locati" "end_date" "end_time" "county_end" "countyendn"
## [16] "end_range" "end_azi" "end_locati" "length" "width"
## [21] "f" "mag" "fatalities" "injuries" "propdmg"
## [26] "propdmgexp" "cropdmg" "cropdmgexp" "wfo" "stateoffic"
## [31] "zonenames" "latitude" "longitude" "latitude_e" "longitude_"
## [36] "remarks" "refnum" "year"
We modify the date colum from factor to date class.
storm$bgn_date <- as.Date(storm$bgn_date, format="%m/%d/%Y")
We add an aditional year column derived from the event date and count how many records each year has. We then transpose the rows into columns for the resulting table.
storm$year <- as.numeric(format(storm$bgn_date, "%Y"))
minyear <- min(storm$year)
maxyear <- max(storm$year)
t(count(storm, "year"))
## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13]
## year 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962
## freq 223 269 272 492 609 1413 1703 2184 2213 1813 1945 2246 2389
## [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
## year 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973
## freq 1968 2348 2855 2388 2688 3312 2926 3215 3471 2168 4463
## [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35]
## year 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984
## freq 5386 4975 3768 3728 3657 4279 6146 4517 7132 8322 7335
## [,36] [,37] [,38] [,39] [,40] [,41] [,42] [,43] [,44] [,45] [,46]
## year 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## freq 7979 8726 7367 7257 10410 10946 12522 13534 12607 20631 27970
## [,47] [,48] [,49] [,50] [,51] [,52] [,53] [,54] [,55] [,56] [,57]
## year 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006
## freq 32270 28680 38128 31289 34471 34962 36293 39752 39363 39184 44034
## [,58] [,59] [,60] [,61] [,62]
## year 2007 2008 2009 2010 2011
## freq 43289 55663 45817 48161 62174
We have data from 1950 to 2011 so we have 62 years of data. A histrogram is generated to visualise the amount of records that has been collected for each year.
par(mfrow = c(2,1))
hist(storm$year, xlab = "Year", main = paste("Histogram of Events Per Year", minyear, "to", maxyear), col = "lightgreen", breaks=seq(minyear,maxyear,l=maxyear-minyear+1))
cutoff <- 1994
gmonth <- as.numeric(format(storm$bgn_date, "%m"))
h <- hist(gmonth, xlab = "Month", main = paste("Histogram of Events by Month", cutoff, "to", maxyear), col = "magenta", breaks = seq(1, 12, l = 12+1))
xfit <- seq(min(gmonth), max(gmonth), length = 30)
yfit <- dnorm(xfit, mean = mean(gmonth), sd = sd(gmonth))
yfit <- yfit * diff(h$mids[1:2]) * length(gmonth)
lines(xfit, yfit, col = "black", lwd = 2)
stat.desc(gmonth, basic=F) #Using pastecs
## median mean SE.mean CI.mean.0.95 var
## 6.000000000 6.068808829 0.002554438 0.005006612 5.887624388
## std.dev coef.var
## 2.426442744 0.399821911
From the yearly histogram we can see that we the number of recorded events increases significantly from 1994 - 2011, so we will limit our data range to that period. Note also that most of the damage occurs in the middle of the year when temperatures are warmer, which generally means more energy and water in the atmosphere. The histogram of events by month resembles a normal distribution, but is slightly skewed.
Filtering for relevant events based on the storm data documentation we select the following columns:
| Column | Data |
|---|---|
| bgn_date | Date the storm event began |
| evtype | Type of storm event |
| fatalities | Number directly killed |
| injuries | Number directly injured |
| propdmg | Property damage in whole numbers |
| propdmgexp | Contains multipliers for propdmg as factors |
| cropdmg | Crop damage in whole numbers |
| cropdmgexp | Contains multipliers for cropdmg as factors |
We subset the storm data after 1994 and create a new object events with 8 columns.
events <- storm[storm$year >= cutoff, c("bgn_date", "evtype", "fatalities", "injuries", "propdmg", "propdmgexp", "cropdmg", "cropdmgexp")]
events <- events[storm$fatalities >= 0 | storm$injuries >= 0 | storm$propdmg >= 0 | storm$cropdmg >= 0, ]
dim(events)
## [1] 902297 8
The table events has 902297 rows and 8 columns.
The columns propdmgexp and cropdmgexp contain multiplier factors. To keep the column lengths variable and narrow we also apply the transpose function.
t(summary(events$propdmgexp))
## - ? + 0 1 2 3 4 5 6 7 8 B h H K m M NA's
## [1,] 306087 1 6 5 215 25 13 4 4 28 3 5 1 37 1 6 387417 7 8266 200166
t(summary(events$cropdmgexp))
## ? 0 2 B k K m M NA's
## [1,] 419089 5 17 1 8 21 281034 1 1955 200166
We create new columns proploss and croploss which now contain the multiplier as numbers instead of factors.
We need to convert the multipliers H, K, M and B into Hundred (H), Thousand (K), Million (M) and Billion (B). The other exponential multipliers 0-8 in propdmgexp and cropdmgexp are exponents e.g. 10^1, 10^2, 10^3, 10^4 etc.
# Hundred (H), Thousand (K), Million (M), Billion (B)
multiplierChar = c("?", "", "-", "+", "0", "1", "2", "3", "4", "5", "6", "7", "8", "h", "H", "k", "K", "m", "M", "b", "B")
multiplier = c(0, 0, 0, 0, 0, 1e+01, 1e+02, 1e+03, 1e+04, 1e+05, 1e+06, 1e+07, 1e+08, 1e+02, 1e+02, 1e+03, 1e+03, 1e+06, 1e+06, 1e+09, 1e+09)
# use propdmgex to get property damage costs
events$proploss = multiplier[match(events$propdmgexp, multiplierChar)]
# use cropdmgexp to get property damage costs
events$croploss = multiplier[match(events$cropdmgexp, multiplierChar)]
str(events)
## 'data.frame': 902297 obs. of 10 variables:
## $ bgn_date : Date, format: "1995-01-06" "1995-01-22" ...
## $ evtype : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 201 629 429 657 657 410 786 786 834 244 ...
## $ fatalities: num 0 0 0 0 0 2 0 0 0 0 ...
## $ injuries : num 0 0 2 0 0 0 0 0 0 0 ...
## $ propdmg : num 0 0 0 0 0 0.1 50 5 500 0 ...
## $ propdmgexp: Factor w/ 19 levels "","-","?","+",..: 1 1 1 1 1 14 17 19 17 1 ...
## $ cropdmg : num 0 0 0 0 0 10 0 500 0 0 ...
## $ cropdmgexp: Factor w/ 9 levels "","?","0","2",..: 1 1 1 1 1 9 1 7 1 1 ...
## $ proploss : num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+09 1e+03 1e+06 1e+03 0e+00 ...
## $ croploss : num 0e+00 0e+00 0e+00 0e+00 0e+00 1e+06 0e+00 1e+03 0e+00 0e+00 ...
There are many overlapping categories as the data is quite messy. There are various typos and overlapping categories which makes grouping difficult. Here is a selection of the the first 18 categories from 985 unique categories.
categories <- unique(events$evtype)
length(categories)
## [1] 931
head(categories, n=18)
## [1] FREEZING RAIN SNOW
## [3] ICE STORM/FLASH FLOOD SNOW/ICE
## [5] HURRICANE OPAL/HIGH WINDS THUNDERSTORM WINDS
## [7] TORNADO HAIL
## [9] RECORD COLD HURRICANE ERIN
## [11] HURRICANE OPAL HEAVY RAIN
## [13] LIGHTNING THUNDERSTORM WIND
## [15] DENSE FOG RIP CURRENT
## [17] THUNDERSTORM WINS FLASH FLOOD
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
Here we try to reduce the number of event categories for better understanding. We end up with 11 classes of weather events in the new column damagesource, including “other”.
events$damagesource <- "Other"
events[grepl("tornado|torndao|spout|funnel|whirlwind", events$evtype, ignore.case = TRUE), "damagesource"] <- "Tornado"
events[grepl("flood|blow-out|swells|fld|dam break|rising water|stream fl|seiche|small stream", events$evtype, ignore.case = TRUE), "damagesource"] <- "Flooding"
events[grepl("wind|storm|wnd|hurricane|typhoon|tstm|thunderstorm|burst|lightning|lighting|ligntning|hail|turbulence|gustnado", events$evtype, ignore.case = TRUE), "damagesource"] <- "Wind & Storm"
events[grepl("warmth|warm|heat|hot|drought|hyperthermia|temperature record|record temperature|record high", events$evtype, ignore.case = TRUE), "damagesource"] <- "Heat & Drought"
events[grepl("seas|high water|tide|tsunami|wave|current|marine|drowning|surf|coastal surge", events$evtype, ignore.case = TRUE), "damagesource"] <- "High Seas"
events[grepl("precipitation|rain|heavy shower|drizzle|wet|precip|depression|fog|wall cloud", events$evtype, ignore.case = TRUE), "damagesource"] <- "Precipitation & Fog"
events[grepl("cold|cool|ice|icy|frost|freeze|snow|winter|wintry|wintery|blizzard|chill|freezing|low temp|avalanche|avalance|glaze|sleet|hypothermia", events$evtype, ignore.case = TRUE), "damagesource"] <- "Snow & Ice"
events[grepl("slide|erosion|slump", events$evtype, ignore.case = TRUE), "damagesource"] <- "Landslide & Erosion"
events[grepl("dust|saharan|dry", events$evtype, ignore.case = TRUE), "damagesource"] <- "Dust & Saharan Winds"
events[grepl("fire|smoke|volcanic", events$evtype, ignore.case = TRUE), "damagesource"] <- "Fire & Volcanic Activity"
events[grepl("other|heavy mix|urban small|urban and small|apache county|high", events$evtype, ignore.case = TRUE), "damagesource"] <- "Other"
Let’s check the new unique sources of damage in damagesource.
unique(events$damagesource)
## [1] "Snow & Ice" "Other"
## [3] "Wind & Storm" "Tornado"
## [5] "Precipitation & Fog" "High Seas"
## [7] "Flooding" "Heat & Drought"
## [9] "Dust & Saharan Winds" "Fire & Volcanic Activity"
## [11] "Landslide & Erosion"
events$damagesource <- as.factor(events$damagesource)
To calculate total damages we need to estimate the value of a human life. We select $8 million as an average of these agencies who produced their estimates.
We also want to estimate the cost of an average injury for a severe weather event. As a comparison we use The Cost of Injuries Sustained in Road Crashes to estimate the injury cost, which we compare to the an injury causing external damage by a car accident. The cost for hospitalisation for this injury is around $15000.
Now we create five totals columns to sum or multiply the values from existing columns.
humancost <- 8*10^6 # 8 million dollars
injurycost <- 15000
events$totalproploss <- events$propdmg * events$proploss
events$totalcroploss <- events$cropdmg * events$croploss
events$totalvalueloss <- events$totalproploss + events$totalcroploss
events$totalhumanloss <- events$fatalities * humancost + events$injuries * injurycost
events$totaleconloss <- events$totalvalueloss + events$totalhumanloss
We generate a new table with the averages for the new columns above for fatalities, injuries, property loss, crop loss, total value loss and human losses after 1994. To do this, we sum the values then divide by years. I’ve done this so that the large figures are not so overwhelming and it is easier for governments to allocate resources and funds each year to the expected damage.
years <- maxyear - cutoff
totals <- ddply(events, .(damagesource), summarize,
totalfatal = sum(fatalities)/years,
totalinjured = sum(injuries)/years,
totalpropdmg = sum(totalproploss)/years,
totalcropdmg = sum(totalcroploss)/years,
totalvaluedmg = sum(totalvalueloss)/years,
totalhumandmg = sum(totalhumanloss)/years,
totalecondmg = sum(totaleconloss)/years)
totals <- totals[complete.cases(totals),]
stat.desc(totals[2:7], basic=F) #Using pastecs
## totalfatal totalinjured totalpropdmg totalcropdmg
## median 63.1764706 2.255882e+02 3.454309e+08 4.288492e+07
## mean 59.6529412 3.828882e+02 2.209064e+09 2.518485e+08
## SE.mean 17.6960854 1.368539e+02 1.235454e+09 1.021949e+08
## CI.mean.0.95 40.0313263 3.095851e+02 2.794790e+09 2.311808e+08
## var 3131.5143791 1.872900e+05 1.526346e+19 1.044379e+17
## std.dev 55.9599355 4.327702e+02 3.906847e+09 3.231685e+08
## coef.var 0.9380918 1.130278e+00 1.768553e+00 1.283186e+00
## totalvaluedmg totalhumandmg
## median 6.979131e+08 5.086756e+08
## mean 2.460912e+09 4.829669e+08
## SE.mean 1.281062e+09 1.429100e+08
## CI.mean.0.95 2.897963e+09 3.232849e+08
## var 1.641119e+19 2.042327e+17
## std.dev 4.051073e+09 4.519211e+08
## coef.var 1.646167e+00 9.357186e-01
Ranking the weather damage costs by human damage (fatality + injury costs)
head(totals[order(-totals$totalhumandmg, -totals$totalfatal),], n=10)
## damagesource totalfatal totalinjured totalpropdmg
## 4 Heat & Drought 173.588235 519.470588 62104217.6
## 10 Tornado 93.823529 1330.882353 1508466218.2
## 11 Wind & Storm 93.058824 836.411765 9616807745.9
## 3 Flooding 88.176471 507.882353 9534668642.4
## 9 Snow & Ice 74.588235 365.411765 408201538.2
## 5 High Seas 51.764706 69.764706 282660163.5
## 8 Precipitation & Fog 10.529412 79.705882 190931128.8
## 2 Fire & Volcanic Activity 5.117647 85.764706 463719323.5
## 1 Dust & Saharan Winds 3.294118 30.058824 765278.2
## 6 Landslide & Erosion 2.588235 3.529412 22314594.1
## totalcropdmg totalvaluedmg totalhumandmg totalecondmg
## 4 871819722.4 9.339239e+08 1396497941 2330421881
## 10 21284265.3 1.529750e+09 770551471 2300301954
## 11 620587664.1 1.023740e+10 757016765 10994412175
## 3 417095388.2 9.951764e+09 713030000 10664794031
## 9 500182820.6 9.083844e+08 602187059 1510571418
## 5 384117.6 2.830443e+08 415164118 698208399
## 8 62047400.0 2.529785e+08 85430882 338409411
## 2 23722448.8 4.874418e+08 42227647 529669419
## 1 183235.3 9.485135e+05 26803824 27752337
## 6 1177470.6 2.349206e+07 20758824 44250888
Ranking the weather damage costs by value damage (property and crops)
head(totals[order(-totals$totalvaluedmg, -totals$totalpropdmg),], n=10)
## damagesource totalfatal totalinjured totalpropdmg
## 11 Wind & Storm 93.058824 836.411765 9616807745.9
## 3 Flooding 88.176471 507.882353 9534668642.4
## 10 Tornado 93.823529 1330.882353 1508466218.2
## 4 Heat & Drought 173.588235 519.470588 62104217.6
## 9 Snow & Ice 74.588235 365.411765 408201538.2
## 2 Fire & Volcanic Activity 5.117647 85.764706 463719323.5
## 5 High Seas 51.764706 69.764706 282660163.5
## 8 Precipitation & Fog 10.529412 79.705882 190931128.8
## 6 Landslide & Erosion 2.588235 3.529412 22314594.1
## 1 Dust & Saharan Winds 3.294118 30.058824 765278.2
## totalcropdmg totalvaluedmg totalhumandmg totalecondmg
## 11 620587664.1 1.023740e+10 757016765 10994412175
## 3 417095388.2 9.951764e+09 713030000 10664794031
## 10 21284265.3 1.529750e+09 770551471 2300301954
## 4 871819722.4 9.339239e+08 1396497941 2330421881
## 9 500182820.6 9.083844e+08 602187059 1510571418
## 2 23722448.8 4.874418e+08 42227647 529669419
## 5 384117.6 2.830443e+08 415164118 698208399
## 8 62047400.0 2.529785e+08 85430882 338409411
## 6 1177470.6 2.349206e+07 20758824 44250888
## 1 183235.3 9.485135e+05 26803824 27752337
The total average fatalities and injurines from different weather event types are plotted here from 1950 to 2011.
gfi1 <- ggplot(data = totals, aes(y=totalfatal, x=reorder(damagesource, totalfatal))) +
geom_bar(stat="identity", fill="orange") + coord_flip() +
ylab("Total Fatalities") +
xlab("Event Type") +
ggtitle(paste("Top Average Fatalities per Year\n by Event Type for", cutoff, "to", maxyear))
gfi2 <- ggplot(data = totals, aes(y=totalinjured, x=reorder(damagesource, totalinjured))) +
geom_bar(stat="identity", fill="red") + coord_flip() +
ylab("Total Injuries") +
xlab("Event Type") +
ggtitle(paste("Top Average Injuries per Year\n by Event Type", cutoff, "to", maxyear))
grid.arrange(gfi1, gfi2, nrow = 2, ncol = 1)
The average economic impact per year of different weather event types is plotted here from 1950 to 2011.
units <- 1e+09
gei1 <- ggplot(data = totals, aes(y=totalvaluedmg/units, x=reorder(damagesource, totalpropdmg))) +
geom_bar(stat="identity", fill="orange") + coord_flip() +
ylab("Economic Cost in Billions") +
xlab("Event Type") +
ggtitle(paste("Total Average Economic Cost per Year of Property and Crop Damage\n by Event Type for", cutoff, "to", maxyear))
gei2 <- ggplot(data = totals, aes(y=totalhumandmg/units, x=reorder(damagesource, totalhumandmg))) +
geom_bar(stat="identity", fill="red") + coord_flip() +
ylab("Economic Cost in Billions") +
xlab("Event Type") +
ggtitle(paste("Total Average Economic Human Cost per Year in Terms of Fatalities and Injury\n by Event Type for", cutoff, "to", maxyear))
gei3 <- ggplot(data = totals, aes(y=totalecondmg/units, x=reorder(damagesource, totalecondmg))) +
geom_bar(stat="identity", fill="blue")+ coord_flip() +
ylab("Economic Cost in Billions") +
xlab("Event Type") +
ggtitle(paste("Total Average Economic Cost per Year of Human, Property and Crop Damage\n by Event Type for", cutoff, "to", maxyear))
grid.arrange(gei1, gei2, gei3, nrow = 3, ncol = 1)
One limitation of this study is that earlier data before 1994 tends to be more scarce, and potentially less accurate. It is also difficult to simplify the classification of weather events as some have mixed weather types. Another limitation is the accuracy of some of the collected data. There may still be hidden errors in the classification of events due to typos for event types and in dollar or multiplier values, which means some event costs may be under or overestimated by a factor of up to a billion in the worst case.
The final economic cost assumes constants for the value of a human life, and the average cost of injury. These variables may need to be adjusted depending on the assumptions one wants to use, or the available hospital cost data.
Lastly, this study does not look at the geography of the events in the United States, which would require further research to inform the different states and municipalities of weather events which are of particular risk to them locally.
In the future the dataset may also be useful in tracking climate change over the decades, for example if the locations of tornados or hurricanes have moved over years, or if their stregth has increased. It may also be important to augment the data with information on property or investment costs over geographic areas as some areas such as cities have a higher value per square meter than in the countryside.