Synopsis

In this report I analyze the impact of severe weather events in the United States recorded using data from 1950 until 2011 in terms of health and economic impact. My goal was to get a simple and broad overview without going into too many details. I selected the data to analyze, performed some minor data cleaning.

I chose to only look at data after 1996 for reasons explained in detail in the document.

The worst three types of event in terms of health impact are tornados (22,178 deaths and injuries), excessive heat (8,188 deaths and injuries) and floods (7,172 deaths and injuries) since 1996.

The worst three types of event in terms of economic damage are floods (~149 billion dollars), hurricanes/typhoons (~72 billion dollars) and storm surges (~43 billion dollars) since 1996.

More details and the top 20 event types in both categories are available in the results section.

Data Processing

Loading the data

library(ggplot2)
Sys.setlocale("LC_ALL", "English")
# When using an HTTPS URL we cannot use the standard download method, see
# http://stackoverflow.com/questions/19890633/r-produces-unsupported-url-scheme-error-when-getting-data-from-https-sites
# I changed the URL to HTTP (no SSL), which in this case is the most simple solution.
if (! file.exists("StormData.csv.bz2")) {
    download.file("http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                  destfile = "StormData.csv.bz2",
                  method = "curl")
}

# reading a compressed file: See http://stackoverflow.com/questions/25948777/extract-bz2-file-in-r
data <- read.csv("StormData.csv.bz2", stringsAsFactors = FALSE)

Finding the Documentation for the data

The documentation of the data is available as document “STORM DATA PREPARATION” at http://www.nws.noaa.gov/directives/ by searching for “STORM DATA PREPARATION”. The current URL (as of 22 Apr, 2015) is http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf.

Filtering the data set’s columns

I decided to use only these columns because all others contain data not relevant to the questions I set out to answer. For example, I don’t need the exact location because all data is from the US and the questions are about the US as a whole too. Of the date and length columns I only use BGN_DATE (start date of the event). The exact timme is not relevant, nor is the duration and the area over which an event extends.

I only care about type of event, date, and damage to people and property, so I keep only the following columns:

data <- data[, c("BGN_DATE",
                 "STATE",
                 "EVTYPE",
                 "FATALITIES",
                 "INJURIES",
                 "PROPDMG", "PROPDMGEXP",
                 "CROPDMG", "CROPDMGEXP")]

Columns PROPDMGEXP and CROPDMGEXP will be used to adjust the damage dollar amounts and then discarded, see below.

Strip whitespace from EVTYPE

I have to trim the EVTYPE strings because some have leading and/or trailing whitespace:

# Define a helper function to trim leading or trailing whitespace from strings
trim <- function (x) gsub("^\\s+|\\s+$", "", x)

EVTYPE_trimmed  <- unique(trim(data$EVTYPE))
EVTYPE_original <- unique(data$EVTYPE)

# Are there any EVTYPE strings with leading or trailing whitespace?
EVTYPE_original[!(EVTYPE_original %in% EVTYPE_trimmed)]
## [1] " COASTAL FLOOD"        " TSTM WIND"            " LIGHTNING"           
## [4] " TSTM WIND (G45)"      " WIND"                 " FLASH FLOOD"         
## [7] " WATERSPOUT"           "   HIGH SURF ADVISORY"
# Remove any leading or trailing whitespace from the EVTYPE strings before doing
# any further processing:
data$EVTYPE <- trim(data$EVTYPE)

Make all strings in EVTYPE consistently UPPERCASE

The vast majority of EVTYPE strings are uppercase.

The number of unique EVTYPE strings before the change is:

length(unique(data$EVTYPE))
## [1] 977

Convert all EVTYPE to upper case:

data$EVTYPE <- toupper(data$EVTYPE)

The number of unique EVTYPE strings after the change is:

length(unique(data$EVTYPE))
## [1] 890

This took care of a number of duplicate entries, where one version of the string was all uppercase and the other version was not.

I will look at the contents of the EVTYPE column later, after examining if it makes sense to ignore older events.

Convert the date column to Date objects

After filtering the columns I convert date strings to dates:

data$BGN_DATE <- as.Date(data$BGN_DATE, format = "%m/%d/%Y")

Examine the number of events reported per year

Examine what years we have data for, how many events there are per year to evaluate how uneven an analysis based on yearly data is merely based on the number of measurements:

# Adjust output width upwards for the table that is the result of this chunk
options(width = 108)

# By using lists I can give names to the columns created in the result data frame
ev_by_year <- aggregate(list(Count = data$EVTYPE),
                        by = list(Year = 1900 + as.POSIXlt(data$BGN_DATE)$year),
                        FUN = length)

# For printing: use the horizontal space on the page, this item is used only
# for displaying information so we can focus on formatting and readability entirely
# The t() command returns a matrix!
ev_by_year_P <- t(ev_by_year)

# Use the first row of years as column names...
colnames(ev_by_year_P) <- paste("| ", ev_by_year_P[1,])

# ...and then throw away the original row with the years.
# This turns the matrix into a (numeric) vector!
ev_by_year_P <- ev_by_year_P[-1,]

# This creates a vector of strings.
ev_by_year_P <- format(ev_by_year_P, big.mark = ",", scientific = FALSE)

# Print the now horizontal table and apply some formatting to the numbers
noquote(ev_by_year_P)
## |  1950 |  1951 |  1952 |  1953 |  1954 |  1955 |  1956 |  1957 |  1958 |  1959 |  1960 |  1961 |  1962 
##     223     269     272     492     609   1,413   1,703   2,184   2,213   1,813   1,945   2,246   2,389 
## |  1963 |  1964 |  1965 |  1966 |  1967 |  1968 |  1969 |  1970 |  1971 |  1972 |  1973 |  1974 |  1975 
##   1,968   2,348   2,855   2,388   2,688   3,312   2,926   3,215   3,471   2,168   4,463   5,386   4,975 
## |  1976 |  1977 |  1978 |  1979 |  1980 |  1981 |  1982 |  1983 |  1984 |  1985 |  1986 |  1987 |  1988 
##   3,768   3,728   3,657   4,279   6,146   4,517   7,132   8,322   7,335   7,979   8,726   7,367   7,257 
## |  1989 |  1990 |  1991 |  1992 |  1993 |  1994 |  1995 |  1996 |  1997 |  1998 |  1999 |  2000 |  2001 
##  10,410  10,946  12,522  13,534  12,607  20,631  27,970  32,270  28,680  38,128  31,289  34,471  34,962 
## |  2002 |  2003 |  2004 |  2005 |  2006 |  2007 |  2008 |  2009 |  2010 |  2011 
##  36,293  39,752  39,363  39,184  44,034  43,289  55,663  45,817  48,161  62,174

The next two plots serve to determine the implications of removing years from the analysis by checking for the distribution of the number of reported events over the years.

###############################################################################
# Plot option #1: dot plot
###############################################################################
g <- ggplot(ev_by_year, aes(Year, Count))
g <- g + geom_point()

# The default scale has too few ticks and labels. Create a more reasonable
# amount of ticks and labels - but make it dependent on the actual data.
g <- g + scale_x_continuous(breaks = round(seq(min(ev_by_year$Year), max(ev_by_year$Year), by = 5), 1))

# The calculations for the y-axis look so complicated because I don't want the
# ticks/labels to start at an uneven number, but use full thousands. Negative
# values for round() make it round to the nearest 100/1000/etc.
g <- g + scale_y_continuous(breaks = seq(round(min(ev_by_year), -floor(log10(max(ev_by_year$Count)))),
                                         max(ev_by_year$Count),
                                         by = 10^floor(log10(max(ev_by_year$Count)))))

# Vertical text on the x-axis allows for more labels
g <- g + theme(axis.text.x = element_text(angle = 90, vjust = .2, hjust = 1))

g <- g + labs(x = "Year")
g <- g + labs(y = "Number of Events")
g <- g + labs(title = "Weather Events Recorded per Year")
g

The above plot already shows that - the amount of weather events recorded has increased - around 1995 the magnitude of the increase per year has itself increased sharply.

Let us examine a boxplot to clearly see where the bulk of the recorded events per year is located:

###############################################################################
# Plot option #2: box plot
###############################################################################
# ggplot2's boxplot function - unlike the standard plot function boxplot() - always
# needs a factor in addition to a vector. That is why I give it a "Fake factor" that
# consists only of one element. geom_boxplot() creates a boxplot for each factor.
g <- ggplot(data, aes(y = BGN_DATE, x = factor("Events")))
g <- g + geom_boxplot()
g <- g + theme(strip.text.x = element_blank()) 
g <- g + labs(x = "")
g <- g + labs(y = "Year")
g <- g + labs(title = "Distribution over Time of Weather Events Recorded")
g

The amount of data collected increases throughout the entire measurement period. The last year for which data is available shows the largest jump. This can also clearly be seen in the quartile border dates reported further down.

The results from the box and dot plots (not shown, but the code is available above) of the number of events available per year show that there was a significant jump in the number of events recorded per year around 1995, which can only be attributed to a change in recording behavior since there was no such big jump in weather patterns. Therefore most of the data is concentrated in the period after 1995.

Here is a text summary of the quartile boundary dates of our data set:

as.Date(quantile(as.integer(data$BGN_DATE)), origin = "1970-01-01")
##           0%          25%          50%          75%         100% 
## "1950-01-03" "1995-04-20" "2002-03-18" "2007-07-28" "2011-11-30"

The dates shown are the border dates of the respective percentiles. The 0% and the 100% percentile are the minimum (min(data$BGN_DATE): 1950-01-03) and maximum date (max(data$BGN_DATE): 2011-11-30) in our data set, respectively.

The 25%, 50% and 75% dates are the ones where 25%, 50% and 75% of all the data is before those dates.

We can see that the first 45 years (1950-1995) barely contain 25% of the available data!

Examine the crop and property damage value modifier columns

Let’s prepare to examine the CROPDMGEXP and PROPDMGEXP damage value modifiers.

From page 12 of “STORM DATA PREPARATION” (available at http://www.nws.noaa.gov/directives/, search for “STORM DATA PREPARATION”, current URL (as of 22 Apr, 2015) is http://www.nws.noaa.gov/directives/sym/pd01016005curr.pdf):

Estimates should be rounded to three significant digits, followed by an alphabetical character signifying the magnitude of the number, i.e., 1.55B for $1,550,000,000. Alphabetical characters used to signify magnitude include “K” for thousands, “M” for millions, and “B” for billions.

################################################################################
# JUST SETUP, this section has no output
# A function is defined - but not run!
################################################################################

modifierUsageByYear <- function (range) {
    # This creates a list of data frames with one list item per year
    yearList <- apply(as.matrix(range), 1, FUN = function (year) {
        startDate <- as.Date(paste("01/01/", year, sep = ""), format = "%d/%m/%Y")
        endDate   <- as.Date(paste("31/12/", year, sep = ""), format = "%d/%m/%Y")

        # Treating the table result as data frame makes the next steps easier and clearer.
        a <- as.data.frame(table(data[data$BGN_DATE >= startDate & data$BGN_DATE <= endDate, "PROPDMGEXP"]))
        b <- as.data.frame(table(data[data$BGN_DATE >= startDate & data$BGN_DATE <= endDate, "CROPDMGEXP"]))

        # Intermediate step on the way to get the total number for all modifiers
        # for either crop or property damage values - place them in two columns
        # but at least we now have them in the same data frame.
        r <- merge(a, b, by = "Var1", all = TRUE)

        # Replace all NA with 0. Those were the result of adding numbers and NA
        # which in this specific case actually IS zero (NA very often is **not**
        # the same as zero!)
        r[is.na(r)] <- 0

        # The addition of both modifier columns is what we want (not the sum() function!)
        r$Count <- r$Freq.x + r$Freq.y

        # Drop the two columns for PROPDMGEXP and CROPDMGEXP modifier frequency
        # since we are only interested in the total
        r <- r[, -c(2,3)]

        # return the result, a data frame with two columns (modifier and count)
        r
    })

    # Create a single data frame from the list of data frames. The modifier string
    # column gets filled with all modifiers that appear in any of the years, and
    # the numeric counter columns are appended one for each year.
    yearList <- Reduce(function(x, y) merge(x, y, all = TRUE, by = "Var1"), yearList)

    # Replace all NA with 0. Those were the result of adding numbers and NA
    # which in this specific case actually IS zero (NA very often is **not**
    # the same as zero!)
    yearList[is.na(yearList)] <- 0

    # Meaningful column names: Those are years (except the 1st one)
    colnames(yearList) <- c("Modifier", as.character(range))

    # row names are just numeric row indexes 1,2,3,4,...
    print(yearList, row.names = FALSE)
}

Now let us have a look at which modifiers are actually being used in what years. I only examine years starting with 1980, because even older data does not help me much to find out the impact of severe weather events today — see my detailed argument below after selecting the cutoff year for my analysis.

# Adjust output width upwards for the table that is the result of this chunk
options(width = 108)

# In two steps so that we get two tables, otherwise we would get a table that
# is too wide and columns wrapping around, making it harder to read since the
# first column will be missing from the wrapped-around columns.
modifierUsageByYear(1980:1995)
##  Modifier  1980 1981  1982  1983  1984  1985  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995
##           11320 8204 13083 15649 13650 15185 16603 14039 13741 19899 20628 23836 25664 18817 30572 45350
##         K   847  758  1038   867   822   676   743   644   676   803  1090  1122  1265  6055 10262  9565
##         M   125   72   143   128   198    97   106    51    97   118   174    86   139   330   365   711
##         ?     0    0     0     0     0     0     0     0     0     0     0     0     0     4     1    10
##         0     0    0     0     0     0     0     0     0     0     0     0     0     0     3    37   194
##         6     0    0     0     0     0     0     0     0     0     0     0     0     0     1     0     3
##         B     0    0     0     0     0     0     0     0     0     0     0     0     0     4     1     8
##         +     0    0     0     0     0     0     0     0     0     0     0     0     0     0     1     4
##         4     0    0     0     0     0     0     0     0     0     0     0     0     0     0     1     3
##         5     0    0     0     0     0     0     0     0     0     0     0     0     0     0     2    26
##         h     0    0     0     0     0     0     0     0     0     0     0     0     0     0     1     0
##         m     0    0     0     0     0     0     0     0     0     0     0     0     0     0     1     7
##         k     0    0     0     0     0     0     0     0     0     0     0     0     0     0    18     3
##         -     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     1
##         1     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0    25
##         2     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0    14
##         3     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     4
##         7     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     5
##         8     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     1
##         H     0    0     0     0     0     0     0     0     0     0     0     0     0     0     0     6
modifierUsageByYear(1996:2011)
##  Modifier  1996  1997  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007   2008  2009  2010   2011
##           53191 45344 59932 51053 55059 57346 59990 66107 66684 67532 67016     0      0     0     0      0
##         K 10663 11486 15611 11052 13496 12159 12225 12789 11523 10379 20480 86024 110457 91117 95764 123399
##         M   686   529   712   472   386   418   371   606   514   446   570   554    867   517   556    941
##         B     0     1     1     1     1     1     0     2     5    11     2     0      2     0     2      7
##         0     0     0     0     0     0     0     0     0     0     0     0     0      0     0     0      1

As we can see from the results the year 1995 and a little bit less 1994 stand out because they introduce a lot of undefined damage value modifiers.

There is only one unusual value in the period after 1996, a “0”, so let’s examine the event:

print(data[(data$PROPDMGEXP == "0" | data$PROPDMGEXP == "0") &
            data$BGN_DATE >= as.Date("01/01/1996", format = "%m/%d/%Y"), ],
      row.names = FALSE)
##    BGN_DATE      EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
##  2011-07-27 FLASH FLOOD          0        0       0          0       0          K

This event does not have any fatalities/injuries or damage associated with it, so not knowing what the “0” for PROPDMGEXP is supposed to mean will have no influence on the result.

Select a cutoff date and remove rows older than that date

Because of

  • The distribution of the number of event data over time
  • Cleanliness of data (modifiers)
  • Higher impact of disregarding inflation for years that are further in the past, which devalues the cost of those events (past dollars are more valuable today!)
  • Bias of over-valuing more recent events because of inflation but also because of ever increasing numbers of reported events, a trend that shows no slowing
  • Our focus on the impact of severe weather today with no regard for trends and no projections into the futur (see task description, we are simply asked what the situation is, not what we might predict)

I choose to disregard all data prior to this year:

# I will ignore all data prior to this year:
CUTOFF_YEAR <- 1996

This is going to leave me with this much of the total data available:

excludedRows <- nrow(data[data$BGN_DATE <  as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ])
includedRows <- nrow(data[data$BGN_DATE >= as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ])

# Percent of rows kept for analysis
noquote(paste(round(100 * includedRows / (excludedRows + includedRows), 1), "%", sep = ""))
## [1] 72.4%

So leaving out the very old data and the hard-to-interpret damage value data of 1994 and 1995 still leaves me with almost 3/4 of all data, and some of the reasons I gave like the missing adjustment for inflation and the qucikly increasing number of observations per period would suggest to put the cut off year even later.

# Remove all rows of data from years prior to the (beginning of the) chosen cutoff year
data <- data[data$BGN_DATE >= as.Date(paste("01/01/", CUTOFF_YEAR, sep = ""), format = "%m/%d/%Y"), ]

The amount of events per year in the selected years ranges from (min to max):

noquote(format(min(ev_by_year[ev_by_year$Year >= CUTOFF_YEAR, "Count"]), big.mark = ","))
## [1] 28,680
noquote(format(max(ev_by_year[ev_by_year$Year >= CUTOFF_YEAR, "Count"]), big.mark = ","))
## [1] 62,174

Damages

Modifier columns

Let us examine how many bad modifiers there are, i.e. their values are neither empty (“1”), “K” (1,000), “M” (1,000,000) or “B” (1,000,000,000):

There are/is 1 bad property damage value modifier(s) and 0 bad crop damage value modifier(s).

Here is an overview over all modifiers (the column without header is for “no modifier — empty”):

# Property damage
table(data$PROPDMGEXP)
## 
##             0      B      K      M 
## 276185      1     32 369938   7374
# Crop damage
table(data$CROPDMGEXP)
## 
##             B      K      M 
## 373069      4 278686   1771

Adjusting damages for the character multiplier

Modify the dollar damage amounts by

  • multiplying property and crop damage values by the modifier number, if present (by 1 otherwise)
  • adding the two damage types for property and crop damage to get one combined total damage dollar amount per event
# K = Thousand
# M = Million
# B = Billion

# Initialize the multipliers for crop and property damage to 1
multiplierC <- rep(1, nrow(data))
multiplierP <- rep(1, nrow(data))

multiplierC[which(toupper(data$CROPDMGEXP) == "K")] <- 1e3
multiplierC[which(toupper(data$CROPDMGEXP) == "M")] <- 1e6
multiplierC[which(toupper(data$CROPDMGEXP) == "B")] <- 1e9

multiplierP[which(toupper(data$PROPDMGEXP) == "K")] <- 1e3
multiplierP[which(toupper(data$PROPDMGEXP) == "M")] <- 1e6
multiplierP[which(toupper(data$PROPDMGEXP) == "B")] <- 1e9

# Multiply the dollar values with the respective multiplier
data$CROPDMG <- data$CROPDMG * multiplierC
data$PROPDMG <- data$PROPDMG * multiplierP

# Create a new column with the combined total damage amount
data$DAMAGE <- data$CROPDMG + data$PROPDMG

Adjusting damages for inflation

I do not adjust damages for inflation.

Here is preliminary information for how to adjust for inflation, and why:

The farther in the past a recorded event the less comparable to today its reported damage values are due to the cumulative effect of inflation. For example, using available CPI (Consumer Price Index) data for the US we can calculate that $100 spent in 1980 are wort about $286 in 2015!

So we would have to calculate a vector that backwards-accumulates the effect of inflation (further in the past equals higher adjustments) and multiply the damage values with that vector. The adjustment depends on the date.

We can download inflation data (consumer price index) 1950-2012, downloaded from https://research.stlouisfed.org/fred2/series/CPIAUCSL/downloaddata

There also are packages for R (for traders) that allow to dwonload inflation numbers from a federal database on the fly. ```

Outliers

Now that the actual damage dollar estimates are available and before going on to do the final analysis I check four outliers in the damage data to see if my result might depend on very few extremes.

I only have to look at outliers in the positive direction, since the lower limt of “0” is very common: format(sum(data$DAMAGE == 0), big.mark = ",", scientific = FALSE) shows 459,005 rows.

So ordering a) by damage and b) by “health impact” (sum of fatalities and injuries — since I want extremes it does not matter that I add those numbers) in descending order.

The top single events in terms of economic damage (Note: even though I removed a lot of rows before this point every row still is one event):

print(
    format(
        head(
            data[
                order(data$DAMAGE, decreasing = TRUE), c("BGN_DATE","EVTYPE","CROPDMG","PROPDMG","DAMAGE")],
            10),
        big.mark = ",", scientific = FALSE),
    row.names = FALSE)
##    BGN_DATE            EVTYPE       CROPDMG         PROPDMG          DAMAGE
##  2006-01-01             FLOOD    32,500,000 115,000,000,000 115,032,500,000
##  2005-08-29       STORM SURGE             0  31,300,000,000  31,300,000,000
##  2005-08-28 HURRICANE/TYPHOON             0  16,930,000,000  16,930,000,000
##  2005-08-29       STORM SURGE             0  11,260,000,000  11,260,000,000
##  2005-10-24 HURRICANE/TYPHOON             0  10,000,000,000  10,000,000,000
##  2005-08-29 HURRICANE/TYPHOON 1,510,000,000   5,880,000,000   7,390,000,000
##  2005-08-28 HURRICANE/TYPHOON             0   7,350,000,000   7,350,000,000
##  2004-08-13 HURRICANE/TYPHOON   285,000,000   5,420,000,000   5,705,000,000
##  2001-06-05    TROPICAL STORM             0   5,150,000,000   5,150,000,000
##  2004-09-04 HURRICANE/TYPHOON    93,200,000   4,830,000,000   4,923,200,000

The top single events in terms of “harmful for population health”:

print(
    format(
        head(
            data[
                order(data$FATALITIES + data$INJURIES, decreasing = TRUE), c("BGN_DATE","EVTYPE","FATALITIES","INJURIES")],
            10),
        big.mark = ",", scientific = FALSE),
    row.names = FALSE)
##    BGN_DATE            EVTYPE FATALITIES INJURIES
##  2011-05-22           TORNADO        158    1,150
##  2011-04-27           TORNADO         44      800
##  1998-10-17             FLOOD          2      800
##  2004-08-13 HURRICANE/TYPHOON          7      780
##  1998-10-17             FLOOD          0      750
##  2011-04-27           TORNADO         20      700
##  1998-10-17             FLOOD         11      600
##  1998-10-17             FLOOD          0      550
##  2007-08-04    EXCESSIVE HEAT          2      519
##  1998-10-17             FLOOD          6      500

There does seem to be a single event that stands out in the economic damages category.

That event is the NAPA Valley flood from December 2005 to January 2006. All sources report no more than 100-200 million dollars in damages:

Therefore this is clearly an erroneous data entry for “PROPDMG”, mistakenly using a “B” (billions) modifier instead of an “M” (millions).

Since this one error very heavily skews the result for economic damages I am going to manually change this datum from 115 BILLION in damages to 115 MILLION:

To check what changed I print the changed row before and after the change below.

# Create a logical vector to find the row with the outlier - from the above I know it is
# exactly one row.
v <- data$BGN_DATE == as.Date("2006-01-01") & data$EVTYPE == "FLOOD" & data$DAMAGE == max(data$DAMAGE)

# Check BEFORE the change - this should result in one row being printed:
# The first column is the row ID, which I don't suppress this time because it helps
# verify that the correct row has been modified (see "after" output below too)
format(data[v, ], big.mark = ",", scientific = FALSE)
##          BGN_DATE EVTYPE FATALITIES INJURIES         PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP
## 605953 2006-01-01  FLOOD          0        0 115,000,000,000          B 32,500,000          M
##                 DAMAGE
## 605953 115,032,500,000
# Replace the DAMAGE column value in the outlier row by the same value divided by 1,000
data[v, c("PROPDMG","DAMAGE")] <- data[v, c("PROPDMG","DAMAGE")] / 1000

# Unused, but for cosmetics and in case something is added to this code later:
# Fix the modifier too, change it from "B" to "M"
data[v, "PROPDMGEXP"] <- "M"

# Check the result (complete row):
format(data[v, ], big.mark = ",", scientific = FALSE)
##          BGN_DATE EVTYPE FATALITIES INJURIES     PROPDMG PROPDMGEXP    CROPDMG CROPDMGEXP      DAMAGE
## 605953 2006-01-01  FLOOD          0        0 115,000,000          M 32,500,000          M 115,032,500

Event types: string column EVTYPE revisited

How many unique EVTYPE strings do we have?

length(unique(data$EVTYPE))
## [1] 430

That is a lot — but we only need to consider those EVTYPE strings that had adverse health or economic effects:

# How does one weigh a death vs. "only" an injury, especially if an injury could be anything
# from total paralysis from the neck down to minor skin abrasions?
# To answer "...which types of events are most harmful with respect to population health?"
# I add deaths and injuries and order by the resulting total, descending. This lumps
# incomparable things into one number, but I have to in order to comme up with a single
# hierarchy of the "badness" of weather event types.
data$INJURY_AND_DEATH <- data$INJURIES + data$FATALITIES

# Sum up the columns FATALITIES + INJURIES per unique EVTYPE string. The resulting data
# frame will have a column "Type" with the EVTYPE string and a column "InjuriesAndDeath"
# with the sum of FATALITIES and INJURIES for that EVTYPE.
outcomes <- aggregate(list(InjuriesAndDeath = data$INJURY_AND_DEATH),
                      by = list(Type = data$EVTYPE),
                      FUN = sum)

# Sum up the column DAMAGE per unique EVTYPE string. The resulting data frame will have
# a column "Type" with the EVTYPE string and a column "Damage" with the damage dollar amount.
# Merge the resulting data frame with the previous result on "Type", which adds a
# fourth column "Damage".
outcomes <- merge(outcomes, aggregate(list(Damage = data$DAMAGE),
                                      by = list(Type = data$EVTYPE),
                                      FUN = sum),
                  by = "Type")

# Remove all outcomes that had no adverse health or property effects
outcomes <- subset(outcomes, InjuriesAndDeath > 0 | Damage > 0)

This leaves us with length(outcomes$Type) = 183 unique EVTYPE strings for events of significance for the analysis.

I choose NOT to make any more changes to EVTYPE!

Main reason: Comparability of results. Highly subjective decisions about which EVTYPE strings to merge into one makes the result (even) less comparable.

For example, one could summarize several EVTYPE types under a single string. What is the difference between “FLASH FLOOD” and “FLOOD” after all?

However, that process is too subjective, different people analyzing this data will inevitably come up with different policies! One person will put the above example into one item “FLOOD”, the next person might say that flash floods are very different events, happening in very different areas from other floods. Coastal and river floods are different for some people and not for others.

This subjectivity was already present when the data was entered, I do not want to add my own on top — not even for strings where it may seem “obvious” that they are one and the same, like “WINTER WEATHER MIX” and “WINTER WEATHER/MIX”. I do not really know what the people entering the data were thinking, maybe it does have a meaning to them?

More importantly, as I said, the result of this process is much more subjective than any of the changes I applied thus far. For example, is “cold rain” part of “cold” (just like the many other kinds of “cold weather” that were easy to subsume under “cold”) or part of “rain” (which also had easy subsummation cases)? When you start with the various kinds of floods, what is a flood due to heavy wind - is it part of “flood” or part of “heavy wind”? “COLD AND SNOW” - “cold weather” or “snow”? Or should I put all of snow under “cold weather”? But that’s a form of precipitation, when I have “rain” also keeping “snow” makes sense. There also is “WINTER WEATHER” (in several variants), isn’t that “COLD” or “SNOW” (but which)? Wind and hail - is that WIND or HAIL? “EXTREME WINDCHILL” - COLD or WIND? And which kinds of flood should I subsume under one type of flood? Coastal floods and flash floods (not uncommon in the desert) and river floods are VERY different things. But the latter two are actually part of “RAIN”…

Results

A log plot does not make sense for this analysis because we explicitly want to concentrate on the events with the highest number of injuries and/or fatalities, or the highest property or crop damage values.

My analysis is for the United States as a whole, not broken down by state. I refer to this forum discussion and others.

I also disregard the time component in the data (apart from having selected only data after 1996). Making predictions and analysing past developments in time is not part of the questions, and above I discussed reasons why it may not be a good idea in any case based on this particular set of data.

 

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

The 20 worst types of event since 1996 in terms of health related harms using the total sum of fatalities and injuries per event type over the entire selected period are:

I chose to add fatalities and injuries together to have just one total count, since it is not quite clear how I would weigh fatalities vs. injuries, and since my goal was to get a very broad overview.

# Subset: Ordered by number of fatalities and injuries and ignore the economic-damages column
# and all event types that had no health impact
health_outcomes <- subset(outcomes[order(outcomes$InjuriesAndDeath, decreasing = TRUE), ],
                          InjuriesAndDeath > 0,
                          c("Type", "InjuriesAndDeath"))

# Make the "Type" column an *ordered*(!) factor, using the order established just above.
# This is to prevent the plot function from doing its own ordering of the categorical
# variable "Event" used for the x-axis.
health_outcomes$Type <- factor(health_outcomes$Type, levels = health_outcomes$Type)

# Print a table and draw a bar plot of the top 20 worst health event types
print(noquote(format(head(health_outcomes, 20), big.mark = ",", scientific = FALSE)), row.names = FALSE)
##               Type InjuriesAndDeath
##            TORNADO           22,178
##     EXCESSIVE HEAT            8,188
##              FLOOD            7,172
##          LIGHTNING            4,792
##          TSTM WIND            3,870
##        FLASH FLOOD            2,561
##  THUNDERSTORM WIND            1,530
##       WINTER STORM            1,483
##               HEAT            1,459
##  HURRICANE/TYPHOON            1,339
##          HIGH WIND            1,318
##           WILDFIRE              986
##         HEAVY SNOW              805
##                FOG              772
##               HAIL              720
##   WILD/FOREST FIRE              557
##        RIP CURRENT              549
##       RIP CURRENTS              496
##           BLIZZARD              455
##          ICE STORM              400
g <- ggplot(head(health_outcomes, 15), aes(x = Type, y = InjuriesAndDeath))
g <- g + geom_bar(stat = "identity", aes(fill = as.factor(InjuriesAndDeath)))
g <- g + guides(fill = guide_legend(title = "Total Count", reverse = TRUE))
g <- g + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
g <- g + labs(x = "")
g <- g + labs(y = "Deaths and Injuries")
g <- g + labs(title = paste("Deaths and Injuries for the 20 Worst Types of Event since", CUTOFF_YEAR))
g

Note: If some event types (x-axis) seem to appear twice that is a problem — but, see above where I discuss EVTYPE, not one that I am going to address in this project.

 

  1. Across the United States, which types of events have the greatest economic consequences?

The 20 worst types of event since 1996 in terms of economic damage using the total dollar amount of damage in billion dollars per event type over the entire selected period are:

I added crop and property damage to have just one sum.

# Subset: Ordered by number of fatalities and injuries and ignore the economic-damages column
# and all event types that had no economic impact
economic_outcomes <- subset(outcomes[order(outcomes$Damage, decreasing = TRUE), ],
                            Damage > 0,
                            c("Type", "Damage"))

# Convert the dollar amounts from "dollars" to "billions of dollars"
economic_outcomes$Damage <- economic_outcomes$Damage / 1e9

# Make the "Type" column an *ordered*(!) factor, using the order established just above.
# This is to prevent the plot function from doing its own ordering of the categorical
# variable "Event" used for the x-axis.
economic_outcomes$Type <- factor(economic_outcomes$Type, levels = economic_outcomes$Type)

# For display remove the number of digits
economic_outcomes$Damage <- round(economic_outcomes$Damage, digits = 2)

# Print a table and draw a bar plot of the top 20 worst health event types
print(noquote(format(head(economic_outcomes, 20), big.mark = ",", scientific = FALSE)), row.names = FALSE)
##               Type Damage
##  HURRICANE/TYPHOON  71.91
##        STORM SURGE  43.19
##              FLOOD  34.00
##            TORNADO  24.90
##               HAIL  17.07
##        FLASH FLOOD  16.56
##          HURRICANE  14.55
##            DROUGHT  14.41
##     TROPICAL STORM   8.32
##          HIGH WIND   5.88
##           WILDFIRE   5.05
##          TSTM WIND   5.04
##   STORM SURGE/TIDE   4.64
##  THUNDERSTORM WIND   3.78
##          ICE STORM   3.66
##   WILD/FOREST FIRE   3.11
##       WINTER STORM   1.54
##       EXTREME COLD   1.33
##         HEAVY RAIN   1.31
##       FROST/FREEZE   1.10
g <- ggplot(head(economic_outcomes, 15), aes(x = Type, y = Damage))
g <- g + geom_bar(stat = "identity", aes(fill = as.factor(Damage)))
g <- g + guides(fill = guide_legend(title = "Total Billion USD", reverse = TRUE))
g <- g + theme(axis.text.x = element_text(angle = 60, vjust = 1, hjust = 1))
g <- g + labs(x = "")
g <- g + labs(y = "Economic Damage (in Billion US-Dollars)")
g <- g + labs(title = paste("Total Economic Damage for the 20 Worst Types of Event since", CUTOFF_YEAR))
g

Note: Same as above, if some event types (x-axis) seem to appear twice that is a problem — but, see above where I discuss EVTYPE, not one that I am going to address in this project.