In this report I analyze the impact of severe weather events in the United States recorded using data from 1950 until 2011 in terms of health and economic impact. My goal was to get a simple and broad overview without going into too many details. I selected the data to analyze, performed some minor data cleaning.
I chose to only look at data after 1996 for reasons explained in detail in the document.
The worst three types of event in terms of health impact are tornados (22,178 deaths and injuries), excessive heat (8,188 deaths and injuries) and floods (7,172 deaths and injuries) since 1996.
The worst three types of event in terms of economic damage are floods (~149 billion dollars), hurricanes/typhoons (~72 billion dollars) and storm surges (~43 billion dollars) since 1996.
More details and the top 20 event types in both categories are available in the results section.
library(ggplot2)
Sys.setlocale("LC_ALL", "English")
# When using an HTTPS URL we cannot use the standard download method, see
# http://stackoverflow.com/questions/19890633/r-produces-unsupported-url-scheme-error-when-getting-data-from-https-sites
# I changed the URL to HTTP (no SSL), which in this case is the most simple solution.
if (! file.exists("StormData.csv.bz2")) {
download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "StormData.csv.bz2",
method = "curl")
}
# reading a compressed file: See http://stackoverflow.com/questions/25948777/extract-bz2-file-in-r
data <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)
The documentation of the data is available as document “STORM DATA PREPARATION” at http://www.nws.noaa.gov/directives/ by searching for “STORM DATA PREPARATION”. The current URL (as of 22 Apr, 2015) is http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf.
I decided to use only these columns because all others contain data not relevant to the questions I set out to answer. For example, I don’t need the exact location because all data is from the US and the questions are about the US as a whole too. Of the date and length columns I only use BGN_DATE (start date of the event). The exact timme is not relevant, nor is the duration and the area over which an event extends.
I only care about type of event, date, and damage to people and property, so I keep only the following columns:
data <- data[, c("BGN_DATE",
"STATE",
"EVTYPE",
"FATALITIES",
"INJURIES",
"PROPDMG", "PROPDMGEXP",
"CROPDMG", "CROPDMGEXP")]
Columns PROPDMGEXP and CROPDMGEXP will be used to adjust the damage dollar amounts and then discarded, see below.
I have to trim the EVTYPE strings because some have leading and/or trailing whitespace:
# Define a helper function to trim leading or trailing whitespace from strings
trim <- function (x) gsub("^\\s+|\\s+$", "", x)
EVTYPE_trimmed <- unique(trim(data$EVTYPE))
EVTYPE_original <- unique(data$EVTYPE)
# Are there any EVTYPE strings with leading or trailing whitespace?
EVTYPE_original[!(EVTYPE_original %in% EVTYPE_trimmed)]
## [1] " COASTAL FLOOD" " TSTM WIND" " LIGHTNING"
## [4] " TSTM WIND (G45)" " WIND" " FLASH FLOOD"
## [7] " WATERSPOUT" " HIGH SURF ADVISORY"
# Remove any leading or trailing whitespace from the EVTYPE strings before doing
# any further processing:
data$EVTYPE <- trim(data$EVTYPE)
The vast majority of EVTYPE strings are uppercase.
The number of unique EVTYPE strings before the change is:
length(unique(data$EVTYPE))
## [1] 977
Convert all EVTYPE to upper case:
data$EVTYPE <- toupper(data$EVTYPE)
The number of unique EVTYPE strings after the change is:
length(unique(data$EVTYPE))
## [1] 890
This took care of a number of duplicate entries, where one version of the string was all uppercase and the other version was not.
I will look at the contents of the EVTYPE column later, after examining if it makes sense to ignore older events.
After filtering the columns I convert date strings to dates:
data$BGN_DATE <- as.Date(data$BGN_DATE, format = "%m/%d/%Y")
Examine what years we have data for, how many events there are per year to evaluate how uneven an analysis based on yearly data is merely based on the number of measurements:
# Adjust output width upwards for the table that is the result of this chunk
options(width = 108)
# By using lists I can give names to the columns created in the result data frame
ev_by_year <- aggregate(list(Count = data$EVTYPE),
by = list(Year = 1900 + as.POSIXlt(data$BGN_DATE)$year),
FUN = length)
# For printing: use the horizontal space on the page, this item is used only
# for displaying information so we can focus on formatting and readability entirely
# The t() command returns a matrix!
ev_by_year_P <- t(ev_by_year)
# Use the first row of years as column names...
colnames(ev_by_year_P) <- paste("| ", ev_by_year_P[1,])
# ...and then throw away the original row with the years.
# This turns the matrix into a (numeric) vector!
ev_by_year_P <- ev_by_year_P[-1,]
# This creates a vector of strings.
ev_by_year_P <- format(ev_by_year_P, big.mark = ",", scientific = FALSE)
# Print the now horizontal table and apply some formatting to the numbers
noquote(ev_by_year_P)
## | 1950 | 1951 | 1952 | 1953 | 1954 | 1955 | 1956 | 1957 | 1958 | 1959 | 1960 | 1961 | 1962
## 223 269 272 492 609 1,413 1,703 2,184 2,213 1,813 1,945 2,246 2,389
## | 1963 | 1964 | 1965 | 1966 | 1967 | 1968 | 1969 | 1970 | 1971 | 1972 | 1973 | 1974 | 1975
## 1,968 2,348 2,855 2,388 2,688 3,312 2,926 3,215 3,471 2,168 4,463 5,386 4,975
## | 1976 | 1977 | 1978 | 1979 | 1980 | 1981 | 1982 | 1983 | 1984 | 1985 | 1986 | 1987 | 1988
## 3,768 3,728 3,657 4,279 6,146 4,517 7,132 8,322 7,335 7,979 8,726 7,367 7,257
## | 1989 | 1990 | 1991 | 1992 | 1993 | 1994 | 1995 | 1996 | 1997 | 1998 | 1999 | 2000 | 2001
## 10,410 10,946 12,522 13,534 12,607 20,631 27,970 32,270 28,680 38,128 31,289 34,471 34,962
## | 2002 | 2003 | 2004 | 2005 | 2006 | 2007 | 2008 | 2009 | 2010 | 2011
## 36,293 39,752 39,363 39,184 44,034 43,289 55,663 45,817 48,161 62,174
The next two plots serve to determine the implications of removing years from the analysis by checking for the distribution of the number of reported events over the years.
###############################################################################
# Plot option #1: dot plot
###############################################################################
g <- ggplot(ev_by_year, aes(Year, Count))
g <- g + geom_point()
# The default scale has too few ticks and labels. Create a more reasonable
# amount of ticks and labels - but make it dependent on the actual data.
g <- g + scale_x_continuous(breaks = round(seq(min(ev_by_year$Year), max(ev_by_year$Year), by = 5), 1))
# The calculations for the y-axis look so complicated because I don't want the
# ticks/labels to start at an uneven number, but use full thousands. Negative
# values for round() make it round to the nearest 100/1000/etc.
g <- g + scale_y_continuous(breaks = seq(round(min(ev_by_year), -floor(log10(max(ev_by_year$Count)))),
max(ev_by_year$Count),
by = 10^floor(log10(max(ev_by_year$Count)))))
# Vertical text on the x-axis allows for more labels
g <- g + theme(axis.text.x = element_text(angle = 90, vjust = .2, hjust = 1))
g <- g + labs(x = "Year")
g <- g + labs(y = "Number of Events")
g <- g + labs(title = "Weather Events Recorded per Year")
g
The above plot already shows that - the amount of weather events recorded has increased - around 1995 the magnitude of the increase per year has itself increased sharply.
Let us examine a boxplot to clearly see where the bulk of the recorded events per year is located:
###############################################################################
# Plot option #2: box plot
###############################################################################
# ggplot2's boxplot function - unlike the standard plot function boxplot() - always
# needs a factor in addition to a vector. That is why I give it a "Fake factor" that
# consists only of one element. geom_boxplot() creates a boxplot for each factor.
g <- ggplot(data, aes(y = BGN_DATE, x = factor("Events")))
g <- g + geom_boxplot()
g <- g + theme(strip.text.x = element_blank())
g <- g + labs(x = "")
g <- g + labs(y = "Year")
g <- g + labs(title = "Distribution over Time of Weather Events Recorded")
g
The amount of data collected increases throughout the entire measurement period. The last year for which data is available shows the largest jump. This can also clearly be seen in the quartile border dates reported further down.
The results from the box and dot plots (not shown, but the code is available above) of the number of events available per year show that there was a significant jump in the number of events recorded per year around 1995, which can only be attributed to a change in recording behavior since there was no such big jump in weather patterns. Therefore most of the data is concentrated in the period after 1995.
Here is a text summary of the quartile boundary dates of our data set:
as.Date(quantile(as.integer(data$BGN_DATE)), origin = "1970-01-01")
## 0% 25% 50% 75% 100%
## "1950-01-03" "1995-04-20" "2002-03-18" "2007-07-28" "2011-11-30"
The dates shown are the border dates of the respective percentiles. The 0% and the 100% percentile are the minimum (min(data$BGN_DATE): 1950-01-03) and maximum date (max(data$BGN_DATE): 2011-11-30) in our data set, respectively.
The 25%, 50% and 75% dates are the ones where 25%, 50% and 75% of all the data is before those dates.
We can see that the first 45 years (1950-1995) barely contain 25% of the available data!
Let’s prepare to examine the CROPDMGEXP and PROPDMGEXP damage value modifiers.
From page 12 of “STORM DATA PREPARATION” (available at http://www.nws.noaa.gov/directives/, search for “STORM DATA PREPARATION”, current URL (as of 22 Apr, 2015) is http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf):
Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.
################################################################################
# JUST SETUP, this section has no output
# A function is defined - but not run!
################################################################################
modifierUsageByYear <- function (range) {
# This creates a list of data frames with one list item per year
yearList <- apply(as.matrix(range), 1, FUN = function (year) {
startDate <- as.Date(paste("01/01/", year, sep = ""), format = "%d/%m/%Y")
endDate <- as.Date(paste("31/12/", year, sep = ""), format = "%d/%m/%Y")
# Treating the table result as data frame makes the next steps easier and clearer.
a <- as.data.frame(table(data[data$BGN_DATE >= startDate & data$BGN_DATE <= endDate, "PROPDMGEXP"]))
b <- as.data.frame(table(data[data$BGN_DATE >= startDate & data$BGN_DATE <= endDate, "CROPDMGEXP"]))
# Intermediate step on the way to get the total number for all modifiers
# for either crop or property damage values - place them in two columns
# but at least we now have them in the same data frame.
r <- merge(a, b, by = "Var1", all = TRUE)
# Replace all NA with 0. Those were the result of adding numbers and NA
# which in this specific case actually IS zero (NA very often is **not**
# the same as zero!)
r[is.na(r)] <- 0
# The addition of both modifier columns is what we want (not the sum() function!)
r$Count <- r$Freq.x + r$Freq.y
# Drop the two columns for PROPDMGEXP and CROPDMGEXP modifier frequency
# since we are only interested in the total
r <- r[, -c(2,3)]
# return the result, a data frame with two columns (modifier and count)
r
})
# Create a single data frame from the list of data frames. The modifier string
# column gets filled with all modifiers that appear in any of the years, and
# the numeric counter columns are appended one for each year.
yearList <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "Var1"), yearList)
# Replace all NA with 0. Those were the result of adding numbers and NA
# which in this specific case actually IS zero (NA very often is **not**
# the same as zero!)
yearList[is.na(yearList)] <- 0
# Meaningful column names: Those are years (except the 1st one)
colnames(yearList) <- c("Modifier", as.character(range))
# row names are just numeric row indexes 1,2,3,4,...
print(yearList, row.names = FALSE)
}
Now let us have a look at which modifiers are actually being used in what years. I only examine years starting with 1980, because even older data does not help me much to find out the impact of severe weather events today — see my detailed argument below after selecting the cutoff year for my analysis.
# Adjust output width upwards for the table that is the result of this chunk
options(width = 108)
# In two steps so that we get two tables, otherwise we would get a table that
# is too wide and columns wrapping around, making it harder to read since the
# first column will be missing from the wrapped-around columns.
modifierUsageByYear(1980:1995)
## Modifier 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995
## 11320 8204 13083 15649 13650 15185 16603 14039 13741 19899 20628 23836 25664 18817 30572 45350
## K 847 758 1038 867 822 676 743 644 676 803 1090 1122 1265 6055 10262 9565
## M 125 72 143 128 198 97 106 51 97 118 174 86 139 330 365 711
## ? 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 10
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 37 194
## 6 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 3
## B 0 0 0 0 0 0 0 0 0 0 0 0 0 4 1 8
## + 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 4
## 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 3
## 5 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 26
## h 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0
## m 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 7
## k 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18 3
## - 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 25
## 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14
## 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4
## 7 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5
## 8 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
## H 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6
modifierUsageByYear(1996:2011)
## Modifier 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011
## 53191 45344 59932 51053 55059 57346 59990 66107 66684 67532 67016 0 0 0 0 0
## K 10663 11486 15611 11052 13496 12159 12225 12789 11523 10379 20480 86024 110457 91117 95764 123399
## M 686 529 712 472 386 418 371 606 514 446 570 554 867 517 556 941
## B 0 1 1 1 1 1 0 2 5 11 2 0 2 0 2 7
## 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
As we can see from the results the year 1995 and a little bit less 1994 stand out because they introduce a lot of undefined damage value modifiers.
There is only one unusual value in the period after 1996, a “0”, so let’s examine the event:
print(data[(data$PROPDMGEXP == "0" | data$PROPDMGEXP == "0") &
data$BGN_DATE >= as.Date("01/01/1996", format = "%m/%d/%Y"), ],
row.names = FALSE)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 2011-07-27 FLASH FLOOD 0 0 0 0 0 K
This event does not have any fatalities/injuries or damage associated with it, so not knowing what the “0” for PROPDMGEXP is supposed to mean will have no influence on the result.
Because of
I choose to disregard all data prior to this year:
# I will ignore all data prior to this year:
CUTOFF_YEAR <- 1996
This is going to leave me with this much of the total data available:
excludedRows <- nrow(data[data$BGN_DATE < as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ])
includedRows <- nrow(data[data$BGN_DATE >= as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ])
# Percent of rows kept for analysis
noquote(paste(round(100 * includedRows / (excludedRows + includedRows), 1), "%", sep = ""))
## [1] 72.4%
So leaving out the very old data and the hard-to-interpret damage value data of 1994 and 1995 still leaves me with almost 3/4 of all data, and some of the reasons I gave like the missing adjustment for inflation and the qucikly increasing number of observations per period would suggest to put the cut off year even later.
# Remove all rows of data from years prior to the (beginning of the) chosen cutoff year
data <- data[data$BGN_DATE >= as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ]
The amount of events per year in the selected years ranges from (min to max):
noquote(format(min(ev_by_year[ev_by_year$Year >= CUTOFF_YEAR, "Count"]), big.mark = ","))
## [1] 28,680
noquote(format(max(ev_by_year[ev_by_year$Year >= CUTOFF_YEAR, "Count"]), big.mark = ","))
## [1] 62,174
Let us examine how many bad modifiers there are, i.e. their values are neither empty (“1”), “K” (1,000), “M” (1,000,000) or “B” (1,000,000,000):
There are/is 1 bad property damage value modifier(s) and 0 bad crop damage value modifier(s).
Here is an overview over all modifiers (the column without header is for “no modifier — empty”):
# Property damage
table(data$PROPDMGEXP)
##
## 0 B K M
## 276185 1 32 369938 7374
# Crop damage
table(data$CROPDMGEXP)
##
## B K M
## 373069 4 278686 1771
Modify the dollar damage amounts by
# K = Thousand
# M = Million
# B = Billion
# Initialize the multipliers for crop and property damage to 1
multiplierC <- rep(1, nrow(data))
multiplierP <- rep(1, nrow(data))
multiplierC[which(toupper(data$CROPDMGEXP) == "K")] <- 1e3
multiplierC[which(toupper(data$CROPDMGEXP) == "M")] <- 1e6
multiplierC[which(toupper(data$CROPDMGEXP) == "B")] <- 1e9
multiplierP[which(toupper(data$PROPDMGEXP) == "K")] <- 1e3
multiplierP[which(toupper(data$PROPDMGEXP) == "M")] <- 1e6
multiplierP[which(toupper(data$PROPDMGEXP) == "B")] <- 1e9
# Multiply the dollar values with the respective multiplier
data$CROPDMG <- data$CROPDMG * multiplierC
data$PROPDMG <- data$PROPDMG * multiplierP
# Create a new column with the combined total damage amount
data$DAMAGE <- data$CROPDMG + data$PROPDMG
I do not adjust damages for inflation.
Here is preliminary information for how to adjust for inflation, and why:
The farther in the past a recorded event the less comparable to today its reported damage values are due to the cumulative effect of inflation. For example, using available CPI (Consumer Price Index) data for the US we can calculate that $100 spent in 1980 are wort about $286 in 2015!
So we would have to calculate a vector that backwards-accumulates the effect of inflation (further in the past equals higher adjustments) and multiply the damage values with that vector. The adjustment depends on the date.
We can download inflation data (consumer price index) 1950-2012, downloaded from https://research.stlouisfed.org/fred2/series/CPIAUCSL/downloaddata
There also are packages for R (for traders) that allow to dwonload inflation numbers from a federal database on the fly. ```
Now that the actual damage dollar estimates are available and before going on to do the final analysis I check four outliers in the damage data to see if my result might depend on very few extremes.
I only have to look at outliers in the positive direction, since the lower limt of “0” is very common: format(sum(data$DAMAGE == 0), big.mark = ",", scientific = FALSE) shows 459,005 rows.
So ordering a) by damage and b) by “health impact” (sum of fatalities and injuries — since I want extremes it does not matter that I add those numbers) in descending order.
The top single events in terms of economic damage (Note: even though I removed a lot of rows before this point every row still is one event):
print(
format(
head(
data[
order(data$DAMAGE, decreasing = TRUE), c("BGN_DATE","EVTYPE","CROPDMG","PROPDMG","DAMAGE")],
10),
big.mark = ",", scientific = FALSE),
row.names = FALSE)
## BGN_DATE EVTYPE CROPDMG PROPDMG DAMAGE
## 2006-01-01 FLOOD 32,500,000 115,000,000,000 115,032,500,000
## 2005-08-29 STORM SURGE 0 31,300,000,000 31,300,000,000
## 2005-08-28 HURRICANE/TYPHOON 0 16,930,000,000 16,930,000,000
## 2005-08-29 STORM SURGE 0 11,260,000,000 11,260,000,000
## 2005-10-24 HURRICANE/TYPHOON 0 10,000,000,000 10,000,000,000
## 2005-08-29 HURRICANE/TYPHOON 1,510,000,000 5,880,000,000 7,390,000,000
## 2005-08-28 HURRICANE/TYPHOON 0 7,350,000,000 7,350,000,000
## 2004-08-13 HURRICANE/TYPHOON 285,000,000 5,420,000,000 5,705,000,000
## 2001-06-05 TROPICAL STORM 0 5,150,000,000 5,150,000,000
## 2004-09-04 HURRICANE/TYPHOON 93,200,000 4,830,000,000 4,923,200,000
The top single events in terms of “harmful for population health”:
print(
format(
head(
data[
order(data$FATALITIES + data$INJURIES, decreasing = TRUE), c("BGN_DATE","EVTYPE","FATALITIES","INJURIES")],
10),
big.mark = ",", scientific = FALSE),
row.names = FALSE)
## BGN_DATE EVTYPE FATALITIES INJURIES
## 2011-05-22 TORNADO 158 1,150
## 2011-04-27 TORNADO 44 800
## 1998-10-17 FLOOD 2 800
## 2004-08-13 HURRICANE/TYPHOON 7 780
## 1998-10-17 FLOOD 0 750
## 2011-04-27 TORNADO 20 700
## 1998-10-17 FLOOD 11 600
## 1998-10-17 FLOOD 0 550
## 2007-08-04 EXCESSIVE HEAT 2 519
## 1998-10-17 FLOOD 6 500
There does seem to be a single event that stands out in the economic damages category.
That event is the NAPA Valley flood from December 2005 to January 2006. All sources report no more than 100-200 million dollars in damages:
Therefore this is clearly an erroneous data entry for “PROPDMG”, mistakenly using a “B” (billions) modifier instead of an “M” (millions).
Since this one error very heavily skews the result for economic damages I am going to manually change this datum from 115 BILLION in damages to 115 MILLION:
To check what changed I print the changed row before and after the change below.
# Create a logical vector to find the row with the outlier - from the above I know it is
# exactly one row.
v <- data$BGN_DATE == as.Date("2006-01-01") & data$EVTYPE == "FLOOD" & data$DAMAGE == max(data$DAMAGE)
# Check BEFORE the change - this should result in one row being printed:
# The first column is the row ID, which I don't suppress this time because it helps
# verify that the correct row has been modified (see "after" output below too)
format(data[v, ], big.mark = ",", scientific = FALSE)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 605953 2006-01-01 FLOOD 0 0 115,000,000,000 B 32,500,000 M
## DAMAGE
## 605953 115,032,500,000
# Replace the DAMAGE column value in the outlier row by the same value divided by 1,000
data[v, c("PROPDMG","DAMAGE")] <- data[v, c("PROPDMG","DAMAGE")] / 1000
# Unused, but for cosmetics and in case something is added to this code later:
# Fix the modifier too, change it from "B" to "M"
data[v, "PROPDMGEXP"] <- "M"
# Check the result (complete row):
format(data[v, ], big.mark = ",", scientific = FALSE)
## BGN_DATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP DAMAGE
## 605953 2006-01-01 FLOOD 0 0 115,000,000 M 32,500,000 M 115,032,500
How many unique EVTYPE strings do we have?
length(unique(data$EVTYPE))
## [1] 430
That is a lot — but we only need to consider those EVTYPE strings that had adverse health or economic effects:
# How does one weigh a death vs. "only" an injury, especially if an injury could be anything
# from total paralysis from the neck down to minor skin abrasions?
# To answer "...which types of events are most harmful with respect to population health?"
# I add deaths and injuries and order by the resulting total, descending. This lumps
# incomparable things into one number, but I have to in order to comme up with a single
# hierarchy of the "badness" of weather event types.
data$INJURY_AND_DEATH <- data$INJURIES + data$FATALITIES
# Sum up the columns FATALITIES + INJURIES per unique EVTYPE string. The resulting data
# frame will have a column "Type" with the EVTYPE string and a column "InjuriesAndDeath"
# with the sum of FATALITIES and INJURIES for that EVTYPE.
outcomes <- aggregate(list(InjuriesAndDeath = data$INJURY_AND_DEATH),
by = list(Type = data$EVTYPE),
FUN = sum)
# Sum up the column DAMAGE per unique EVTYPE string. The resulting data frame will have
# a column "Type" with the EVTYPE string and a column "Damage" with the damage dollar amount.
# Merge the resulting data frame with the previous result on "Type", which adds a
# fourth column "Damage".
outcomes <- merge(outcomes, aggregate(list(Damage = data$DAMAGE),
by = list(Type = data$EVTYPE),
FUN = sum),
by = "Type")
# Remove all outcomes that had no adverse health or property effects
outcomes <- subset(outcomes, InjuriesAndDeath > 0 | Damage > 0)
This leaves us with length(outcomes$Type) = 183 unique EVTYPE strings for events of significance for the analysis.
Main reason: Comparability of results. Highly subjective decisions about which EVTYPE strings to merge into one makes the result (even) less comparable.
For example, one could summarize several EVTYPE types under a single string. What is the difference between “FLASH FLOOD” and “FLOOD” after all?
However, that process is too subjective, different people analyzing this data will inevitably come up with different policies! One person will put the above example into one item “FLOOD”, the next person might say that flash floods are very different events, happening in very different areas from other floods. Coastal and river floods are different for some people and not for others.
This subjectivity was already present when the data was entered, I do not want to add my own on top not even for strings where it may seem “obvious” that they are one and the same, like “WINTER WEATHER MIX” and “WINTER WEATHER/MIX”. I do not really know what the people entering the data were thinking, maybe it does have a meaning to them?
More importantly, as I said, the result of this process is much more subjective than any of the changes I applied thus far. For example, is “cold rain” part of “cold” (just like the many other kinds of “cold weather” that were easy to subsume under “cold”) or part of “rain” (which also had easy subsummation cases)? When you start with the various kinds of floods, what is a flood due to heavy wind - is it part of “flood” or part of “heavy wind”? “COLD AND SNOW” - “cold weather” or “snow”? Or should I put all of snow under “cold weather”? But that’s a form of precipitation, when I have “rain” also keeping “snow” makes sense. There also is “WINTER WEATHER” (in several variants), isn’t that “COLD” or “SNOW” (but which)? Wind and hail - is that WIND or HAIL? “EXTREME WINDCHILL” - COLD or WIND? And which kinds of flood should I subsume under one type of flood? Coastal floods and flash floods (not uncommon in the desert) and river floods are VERY different things. But the latter two are actually part of “RAIN”…
A log plot does not make sense for this analysis because we explicitly want to concentrate on the events with the highest number of injuries and/or fatalities, or the highest property or crop damage values.
My analysis is for the United States as a whole, not broken down by state. I refer to this forum discussion and others.
I also disregard the time component in the data (apart from having selected only data after 1996). Making predictions and analysing past developments in time is not part of the questions, and above I discussed reasons why it may not be a good idea in any case based on this particular set of data.
- Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The 20 worst types of event since 1996 in terms of health related harms using the total sum of fatalities and injuries per event type over the entire selected period are:
I chose to add fatalities and injuries together to have just one total count, since it is not quite clear how I would weigh fatalities vs. injuries, and since my goal was to get a very broad overview.
# Subset: Ordered by number of fatalities and injuries and ignore the economic-damages column
# and all event types that had no health impact
health_outcomes <- subset(outcomes[order(outcomes$InjuriesAndDeath, decreasing = TRUE), ],
InjuriesAndDeath > 0,
c("Type", "InjuriesAndDeath"))
# Make the "Type" column an *ordered*(!) factor, using the order established just above.
# This is to prevent the plot function from doing its own ordering of the categorical
# variable "Event" used for the x-axis.
health_outcomes$Type <- factor(health_outcomes$Type, levels = health_outcomes$Type)
# Print a table and draw a bar plot of the top 20 worst health event types
print(noquote(format(head(health_outcomes, 20), big.mark = ",", scientific = FALSE)), row.names = FALSE)
## Type InjuriesAndDeath
## TORNADO 22,178
## EXCESSIVE HEAT 8,188
## FLOOD 7,172
## LIGHTNING 4,792
## TSTM WIND 3,870
## FLASH FLOOD 2,561
## THUNDERSTORM WIND 1,530
## WINTER STORM 1,483
## HEAT 1,459
## HURRICANE/TYPHOON 1,339
## HIGH WIND 1,318
## WILDFIRE 986
## HEAVY SNOW 805
## FOG 772
## HAIL 720
## WILD/FOREST FIRE 557
## RIP CURRENT 549
## RIP CURRENTS 496
## BLIZZARD 455
## ICE STORM 400
g <- ggplot(head(health_outcomes, 15), aes(x = Type, y = InjuriesAndDeath))
g <- g + geom_bar(stat = "identity", aes(fill = as.factor(InjuriesAndDeath)))
g <- g + guides(fill = guide_legend(title = "Total Count", reverse = TRUE))
g <- g + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
g <- g + labs(x = "")
g <- g + labs(y = "Deaths and Injuries")
g <- g + labs(title = paste("Deaths and Injuries for the 20 Worst Types of Event since", CUTOFF_YEAR))
g
Note: If some event types (x-axis) seem to appear twice that is a problem — but, see above where I discuss EVTYPE, not one that I am going to address in this project.
- Across the United States, which types of events have the greatest economic consequences?
The 20 worst types of event since 1996 in terms of economic damage using the total dollar amount of damage in billion dollars per event type over the entire selected period are:
I added crop and property damage to have just one sum.
# Subset: Ordered by number of fatalities and injuries and ignore the economic-damages column
# and all event types that had no economic impact
economic_outcomes <- subset(outcomes[order(outcomes$Damage, decreasing = TRUE), ],
Damage > 0,
c("Type", "Damage"))
# Convert the dollar amounts from "dollars" to "billions of dollars"
economic_outcomes$Damage <- economic_outcomes$Damage / 1e9
# Make the "Type" column an *ordered*(!) factor, using the order established just above.
# This is to prevent the plot function from doing its own ordering of the categorical
# variable "Event" used for the x-axis.
economic_outcomes$Type <- factor(economic_outcomes$Type, levels = economic_outcomes$Type)
# For display remove the number of digits
economic_outcomes$Damage <- round(economic_outcomes$Damage, digits = 2)
# Print a table and draw a bar plot of the top 20 worst health event types
print(noquote(format(head(economic_outcomes, 20), big.mark = ",", scientific = FALSE)), row.names = FALSE)
## Type Damage
## HURRICANE/TYPHOON 71.91
## STORM SURGE 43.19
## FLOOD 34.00
## TORNADO 24.90
## HAIL 17.07
## FLASH FLOOD 16.56
## HURRICANE 14.55
## DROUGHT 14.41
## TROPICAL STORM 8.32
## HIGH WIND 5.88
## WILDFIRE 5.05
## TSTM WIND 5.04
## STORM SURGE/TIDE 4.64
## THUNDERSTORM WIND 3.78
## ICE STORM 3.66
## WILD/FOREST FIRE 3.11
## WINTER STORM 1.54
## EXTREME COLD 1.33
## HEAVY RAIN 1.31
## FROST/FREEZE 1.10
g <- ggplot(head(economic_outcomes, 15), aes(x = Type, y = Damage))
g <- g + geom_bar(stat = "identity", aes(fill = as.factor(Damage)))
g <- g + guides(fill = guide_legend(title = "Total Billion USD", reverse = TRUE))
g <- g + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
g <- g + labs(x = "")
g <- g + labs(y = "Economic Damage (in Billion US-Dollars)")
g <- g + labs(title = paste("Total Economic Damage for the 20 Worst Types of Event since", CUTOFF_YEAR))
g
Note: Same as above, if some event types (x-axis) seem to appear twice that is a problem — but, see above where I discuss EVTYPE, not one that I am going to address in this project.