Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project intends to address two basic questions:

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

These questions are addressed by exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database which tracks characteristics of major storms and weather events in the United States such as when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

The top 22 (out of 985) events causing the largest health and economic impacts make up over 95% of the total. For health impacts, in order of highest to lowest impact were: TORNADO, EXCESSIVE HEAT, TSTM WIND, FLOOD, LIGHTNING, HEAT, FLASH FLOOD, ICE STORM, THUNDERSTORM WIND, WINTER STORM, HIGH WIND, HAIL, HURRICANE/TYPHOON, HEAVY SNOW, WILDFIRE, THUNDERSTORM WINDS, BIZZARD, FOG, RIP CURRENT, WILD/FOREST FIRE, RIP CURRENTS, and HEAT WAVE.

For economic impacts, in order of highest to lowest impact were: FLOOD, HURRICANE/TYPHOON, TORNADO, STORM SURGE, HAIL, FLASH FLOOD, DROUGHT, HURRICANE, RIVER FLOOD, ICE STORM, TROPICAL STORM, WINTER STORM, HIGH WIND, WILDFIRE, TSTM WIND, STORM SURGE/TIDE, THUNDERSTORM WIND, HURRICANE OPAL, WILD/FOREST FIRE, HEAVY RAIN/SEVERE WEATHER, THUNDERSTORM WINDS, and a group labeled TORNADOES, TSTM WIND, and HAIL.

Data Processing

## Download data from the data file from the repo and load into a data frame
require(downloader)
## Loading required package: downloader
require(dplyr)
## Loading required package: dplyr
## Warning: package 'dplyr' was built under R version 3.2.2
## 
## Attaching package: 'dplyr'
## 
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## 
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
require(ggplot2)
## Loading required package: ggplot2
fileUrl <- paste0("https://github.com/MichaelSzczepaniak/",
                  "StormDataAnalysis1950to2011/raw/master/",
                  "repdata_data_StormData.csv.bz2")
dataDir <- "data"; renameZipTo <- "data.bz2"; renameDataTo <- "data.csv"
download(fileUrl, file.path(".", dataDir, renameZipTo))
cat(format(Sys.time(), "%T"), "START storm.data load...\n")
## 12:58:28 START storm.data load...
# read the downloaded file into a data frame so we can work with it
storm.data <- read.csv(bzfile(file.path(".", dataDir, renameZipTo)))
cat(format(Sys.time(), "%T"), "FINISH storm.data load...\n")
## 13:00:54 FINISH storm.data load...

Cleaning the downloaded data

The first task after downloading the data was selecting only the columns that are needed for the analysis.

storm.event.data <- select(storm.data, EVTYPE, FATALITIES, INJURIES,
                           PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

Next, a column called impact was created to be used to flag rows that showed no health or economic impact. Rows that show no impact were then removed along with the impact column.

storm.event.data <- mutate(storm.event.data,
                           impact = (FATALITIES+INJURIES+PROPDMG+CROPDMG) > 0)
storm.event.data <- filter(storm.event.data, impact == TRUE)
storm.event.data <- select(storm.event.data, EVTYPE, FATALITIES, INJURIES,
                           PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

The next task was to convert the PROPDMG and CROPDMG columns into their actual values. These values were coded in the database as the significant figures related to property and crop damage respectively. The column to the immediate right of PROPDMG is labeled PROPDMGEXP and this column contains a code such as K, M, or B which indicate whether the PROPDMG value should be multiplied by 1,000, 1,000,000, or 1,000,000,000 respectively. The CROPDMBEXP column to the immediate right of the CROPDMG column serves the same purpose for the crop damage numbers.

The decodeDMG function was written for this purpose and was then applied to the PROPDMG and CROPDMG columns to obtain the actual values for these quantites.

## Returns df with valueColumn updated according to codeColumn
## Warning: A single call to this function took between 3 and 9 mins to run
## on the storm.event.data data frame with ~254k rows.  TODO impl. caching
decodeDMG <- function(df, codeColumn, valueColumn) {
    row.count <- length(df[, codeColumn])
    for(i in 1:row.count) {
        code <- df[i, codeColumn]
        if(toupper(code) == 'K') {
            df[i, valueColumn] = df[i, valueColumn]*1000
        }
        if(toupper(code) == 'M') {
            df[i, valueColumn] = df[i, valueColumn]*1000000
        }
        if(toupper(code) == 'B') {
            df[i, valueColumn] = df[i, valueColumn]*1000000000
        }
    }
    # leave the original value as is if code is not k, K, m, M, b, or B
    
    return(df)
}

cat(format(Sys.time(), "%T"), "decodeDMG starting to decode PROP...\n")
## 13:00:55 decodeDMG starting to decode PROP...
storm.event.data <- decodeDMG(storm.event.data, 'PROPDMGEXP', 'PROPDMG')
cat(format(Sys.time(), "%T"), "decodeDMG FINISHED decoding PROP\n")
## 13:08:02 decodeDMG FINISHED decoding PROP
cat(format(Sys.time(), "%T"), "decodeDMG starting to decode CROP...\n")
## 13:08:02 decodeDMG starting to decode CROP...
storm.event.data <- decodeDMG(storm.event.data, 'CROPDMGEXP', 'CROPDMG')
cat(format(Sys.time(), "%T"), "decodeDMG FINISHED decoding CROP\n")
## 13:11:11 decodeDMG FINISHED decoding CROP

At this point, the storm.event.data data frame is ready for analysis.

Results

Determining the most significant events

Before plotting the data, the number of events in the cleaned data were determined to be:

length(unique(storm.event.data$EVTYPE))
## [1] 488

Because this is a rather large number of events, the following code was run to determine the most significant in terms of health and economic impact. The dotted line in the plots below delineate the 95% mark of the accumulated health or economic strom impact.

# start by summarizing the data
sdata.by.ev <- group_by(storm.event.data, EVTYPE)
sdata.summ <- summarize(sdata.by.ev, fatality = sum(FATALITIES),
                        injury = sum(INJURIES),
                        property = sum(PROPDMG),
                        crop = sum(CROPDMG))
# rank the events by health
health.summ <- select(sdata.summ, EVTYPE, fatality, injury)
health.summ <- mutate(health.summ, total.health = fatality + injury)
health.summ <- arrange(health.summ, desc(total.health))
health.summ <- mutate(health.summ,
                      accum.health = cumsum(health.summ$total.health) /
                                     sum(health.summ$total.health))
# rank the events by economic loss
economic.summ <- select(sdata.summ, EVTYPE, property, crop)
economic.summ <- mutate(economic.summ, total.damage = property + crop)
economic.summ <- arrange(economic.summ, desc(total.damage))
economic.summ <- mutate(economic.summ,
                        accum.econ = cumsum(economic.summ$total.damage) /
                                     sum(economic.summ$total.damage))

# plot the rankings using the base R graphics system
par(mfrow = c(1, 2), mar = c(8, 4, 4, 2))
barplot(health.summ$accum.health[1:22] * 100,
        names.arg = health.summ$EVTYPE[1:22],
        las = 2,
        cex.names = 0.5,
        ylab = "Accumlated Impact %",
        ylim = c(30, 100), xpd = FALSE,
        main = "Top 22 Health Impact Events:\nInjuries + Fatalities")
abline(h=95, lty=2)  # designate the 95% cut off
barplot(economic.summ$accum.econ[1:22] * 100,
        names.arg = economic.summ$EVTYPE[1:22],
        las = 2,
        cex.names = 0.5,
        ylab = "Accumlated Impact",
        ylim = c(30, 100), xpd = FALSE,
        main = "Top 22 Economic Impact Events:\nPropery + Crop Damage")
abline(h=95, lty=2)  # designate the 95% cut off

Storm Weather Impact on Health

## Before plotting, restructure health data to use factors (makes ggplot happy)
##### health

require(tidyr)
## Loading required package: tidyr
## Warning: package 'tidyr' was built under R version 3.2.2
health.summ.tidy <- select(health.summ[1:22,], EVTYPE, fatality, injury)
health.summ.tidy <- gather(health.summ.tidy, Outcome, count, -EVTYPE)
health.summ.tidy <- mutate(health.summ.tidy,
                           EVTYPE = factor(EVTYPE,
                                           levels = unique(EVTYPE),
                                           ordered = TRUE),
                           Outcome = as.factor(Outcome))
# create health stacked bar
p <- ggplot(health.summ.tidy,
            aes(x = EVTYPE, y = count, fill = Outcome, order = EVTYPE))
p <- p + geom_bar(stat = "identity")
p <- p + ggtitle("Historical Fatalities and Injuries Count (Top 22 Events)")
p <- p + ylab("Fatalities and Injuries") + xlab("Weather Event")
p <- p + theme(legend.position=c(0.25, 0.75))
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)

The plot above does not account for difference in impact between injuries and fatalities. One reference was found here which postulates two quantities to assist in discussing this area:

VSL = value of a statistical life and VSI = value of a statistical injury

On pages 128-129, the author estimates VSL to be $7,600,000 in 2007 dollars and VSI to $76,000 in 2006. This might imply weighting fatalities 100 more than injuries, but more research would be required to better understand this relationship. The reader is left to decide what (if any) weighting might be applied to injuries relative to fatalities.

Storm Weather Impact on Property and Crop Damage

## Restructure property and crop data to use factors.

economic.summ.tidy <- select(economic.summ[1:22,], EVTYPE, property, crop)
economic.summ.tidy <- gather(economic.summ.tidy, Damage, cost, -EVTYPE)
economic.summ.tidy <- mutate(economic.summ.tidy,
                             EVTYPE = factor(EVTYPE,
                                             levels = unique(EVTYPE),
                                             ordered = TRUE),
                             Damage = as.factor(Damage))
# create economic stacked bar
billion <- 1000000000
p <- ggplot(economic.summ.tidy,
            aes(x = EVTYPE, y = cost / billion, fill = Damage, order = EVTYPE))
p <- p + geom_bar(stat = "identity")
p <- p + ggtitle("Historical Property & Crop Damage (Top 22 Events)")
p <- p + ylab("Damage (Billions USD)") + xlab("Weather Event")
p <- p + theme(legend.position=c(0.25, 0.75))
p <- p + theme(axis.text.x = element_text(angle = 45, hjust = 1))
print(p)

Conclusions

Because tornados and excessive heat had the highest counts of both injuries and fatalities, they can be deemed the most harmful to population health regardless of how fatalities and injuries are weighted relative to each other with respect to their impact.

Floods, huricanes, and tornados had the most economic consequences in terms of total damage (property and crop) with floods having over 2X impact hurricanes and tornadoes.