NWS STORM DATA ASSESSMENT ON POPULATION HEALTH AND ECONOMIC CONSEQUENCES

INTRODUCTION & SYNOPSIS

This analysis considers the National Weather Service's storm data from 1989 to November 2011. The purpose is to determine the types of events that have the greatest impact on population health and economic losses. From 1989 to 2011, tornadoes were responsible for the most injuries, by far, of any storm event type. However, excessive heat caused more deaths during this same time period. In addition, out of over $451 billion in total damage, 89% was to property and 11% was to crops. Of note, drought causes mostly crop damage while storm surge causes mostly property damage, and tornadoes are not responsible for the most damage - flooding causes the most economic damage overall.

DATA PROCESSING

Data pre-processing

Load necessary R packages.

#install.packages("knitr")
library(knitr)

#install.packages("stats")
library(stats)

#install.packages("ggplot2")
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.4

#install.packages("reshape")
library(reshape)

#install.packages("dplyr")
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:reshape':
## 
##     rename

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Load the data.

storm <- read.csv("repdata-data-StormData.csv.bz2", header=TRUE)

Our goal is to identify events that are most harmful to population health. To answer this question, we will consider “harmful” to be counts of injuries and fatalities resulting from each event.

We also need to identify types of events with the greatest economic consequences. This will be answered with the dollar amount values in the property damage and crop damage variables.

Data preparation

Data for earlier years is sparse, so we must determine how many records are available for each year.

First we must convert the field BGN_DATE from a factor variable to a date variable and determine how to sub-set the data by year.

storm$begin_date <- as.Date(storm$BGN_DATE, format = "%m/%d/%Y")
storm$begin_date_year <- as.numeric(format(storm$begin_date, "%Y"))

table(storm$begin_date_year)

## 
##  1950  1951  1952  1953  1954  1955  1956  1957  1958  1959  1960  1961 
##   223   269   272   492   609  1413  1703  2184  2213  1813  1945  2246 
##  1962  1963  1964  1965  1966  1967  1968  1969  1970  1971  1972  1973 
##  2389  1968  2348  2855  2388  2688  3312  2926  3215  3471  2168  4463 
##  1974  1975  1976  1977  1978  1979  1980  1981  1982  1983  1984  1985 
##  5386  4975  3768  3728  3657  4279  6146  4517  7132  8322  7335  7979 
##  1986  1987  1988  1989  1990  1991  1992  1993  1994  1995  1996  1997 
##  8726  7367  7257 10410 10946 12522 13534 12607 20631 27970 32270 28680 
##  1998  1999  2000  2001  2002  2003  2004  2005  2006  2007  2008  2009 
## 38128 31289 34471 34962 36293 39752 39363 39184 44034 43289 55663 45817 
##  2010  2011 
## 48161 62174

Looking at a summary table of the number of records for each year, it appears that 1989 sees a significant jump in the number of events. Therefore, we will consider only records with events that began in 1989 or later.

storm_1989 <- subset(storm, begin_date_year >= 1989)

We now have all weather events for 1989 to 2011. However, many of the event types have typos or extraneous spaces. We will remove leading and trailing spaces and ensure all text is in uppercase, and create a new field for event type (preserving the original event type in case it is needed).

storm_1989$event_type <- gsub("^\\s+|\\s+$", "",toupper(storm_1989$EVTYPE))

RESULTS

1. Events and Population Health

We will summarize injuries and fatalities by event type, sort in descending order, and view the top events for both fatalities and injuries.

inj_fat <- data.frame(aggregate(cbind(FATALITIES, INJURIES) ~ event_type, data = storm_1989, sum))

fatalities <- inj_fat[order(-inj_fat$FATALITIES), ]
injuries <- inj_fat[order(-inj_fat$INJURIES), ]

head(fatalities)

##         event_type FATALITIES INJURIES
## 108 EXCESSIVE HEAT       1903     6525
## 750        TORNADO       1802    27944
## 130    FLASH FLOOD        978     1777
## 235           HEAT        937     2100
## 410      LIGHTNING        816     5230
## 146          FLOOD        470     6789

head(injuries)

##         event_type FATALITIES INJURIES
## 750        TORNADO       1802    27944
## 146          FLOOD        470     6789
## 108 EXCESSIVE HEAT       1903     6525
## 771      TSTM WIND        356     5404
## 410      LIGHTNING        816     5230
## 235           HEAT        937     2100

As seen by the sorted tables, tornadoes are responsible for far more injuries than any other event type, but excessive heat is responsible for more fatalities. These differences are worth noting, as shown in the plot of the top five event types by fatalities.

To further understand the data, we will analyze the top five event types by fatalities and view the number of injuries associated with them as well. To do this, we take the top event types with 470 or more fatalities, melt the table, and create a clustered bar chart.

fatalities_top <- subset(fatalities, FATALITIES >= 470)
injuries_top <- subset(injuries, INJURIES >= 2100)

fatalities_melt <- melt(fatalities_top, id.vars="event_type")

png("figure/fatalities-injuries.png", width = 650, height = 480)

ggplot(fatalities_melt, aes(factor(event_type), value, fill=variable)) + 
        geom_bar(stat="identity", position="dodge") + 
        scale_fill_brewer(palette="Set1") + 
        ggtitle("Fatalities and Injuries by Top Five Events") +
        xlab("Event Type") +
        ylab("Total Fatalities or Injuries")

dev.off()

## quartz_off_screen 
##                 2

fatalities-injuries plot

We can therefore conclude that tornados cause the most injuries by far, and excessive heat causes the most fatalities, closely followed by tornadoes.

2. Events and Economic Consequences

To analyze property damage (PROPDMG) and crop damage (CROPDMG) figures, we must note that there are additional fields indicating the units associated with these figures (PROPDMGEXP and CROPDMGEXP respectively). If the value is K, the dollar amount is in thousands, if the value is M, the dollar amount is in millions, etc. While this is not necessarily sound database design, we will account for it by creating two new variables that will be the units by which we multiply the dollar amounts.

storm_1989 <- mutate(storm_1989, propdmgunit=0, cropdmgunit=0)

prop_hundreds <- grep("h|H", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_hundreds] = 100

prop_thousands <- grep("k|K", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_thousands] = 1000

prop_millions <- grep("m|M", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_millions] = 1000000

prop_billions <- grep("b|B", storm_1989$PROPDMGEXP)
storm_1989$propdmgunit[prop_billions] = 1000000000

crop_thousands <- grep("k|K", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_thousands] = 1000

crop_millions <- grep("m|M", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_millions] = 1000000

crop_billions <- grep("b|B", storm_1989$CROPDMGEXP)
storm_1989$cropdmgunit[crop_billions] = 1000000000

Now that we have the proper units, we can multiply the property and crop damage values by these units and aggregate by event type. The top 10 event types are determined and ordered by total damage.

storm_1989$property_damage <- storm_1989$PROPDMG * storm_1989$propdmgunit

storm_1989$crop_damage <- storm_1989$CROPDMG * storm_1989$cropdmgunit

storm_1989$total_damage <- storm_1989$property_damage + storm_1989$crop_damage

damage <- data.frame(aggregate(cbind(property_damage, crop_damage, total_damage) ~ event_type, data = storm_1989, sum))

head(damage[order(-damage$total_damage), ], 10)

##            event_type property_damage crop_damage total_damage
## 146             FLOOD    144657709800  5661968450 150319678250
## 364 HURRICANE/TYPHOON     69305840000  2607872800  71913712800
## 591       STORM SURGE     43323536000        5000  43323541000
## 750           TORNADO     32218566360   414953110  32633519470
## 204              HAIL     15732267220  3025954450  18758221670
## 130       FLASH FLOOD     16140861510  1421317100  17562178610
## 76            DROUGHT      1046106000 13972566000  15018672000
## 355         HURRICANE     11868319010  2741910000  14610229010
## 521       RIVER FLOOD      5118945500  5029459000  10148404500
## 379         ICE STORM      3944927810  5022113500   8967041310

Now we will create a stacked bar chart to visualize property and crop damage for the top 10 event types.

damage_top <- subset(damage, total_damage >= 8967041310)
damage_melt <- melt(damage_top, id.vars="event_type")
damage_melt_sub <- subset(damage_melt, variable != "total_damage")

png("figure/property-crop-damage.png", width = 480, height = 480)

ggplot(damage_melt_sub, aes(factor(event_type), value, fill=variable)) + 
        geom_bar(stat="identity", position="stack") + 
        scale_fill_brewer(palette="Set2") + 
        ggtitle("Property and Crop Damage by Top 10 Event Types") +
        xlab("Event Type") +
        ylab("Total Damage in $USD") +
        coord_flip()

dev.off()

## quartz_off_screen 
##                 2

damage plot

As seen in the plot, floods, hurricanes/typhoons, storm surge, and tornadoes account for the greatest economic losses. Also of note, property damage accounts for the vast majority of damage for most events, except for drought (which makes sense, since crops would be most affected by that). Storm surge results in very little crop damage, which also make sense, since it's usually cities by coasts, rather than farmland. Also note that while tornadoes account for many more injuries and deaths than any other event type, they do not account for the most economic damage.

The differences between property and crop damage is vast, as seen in this staked bar chart of damage by year.

total_property_damage <- sum(storm_1989$property_damage)
total_crop_damage <- sum(storm_1989$crop_damage)
total_damage <- total_property_damage + total_crop_damage
percent_prop_damage <- round(100*(total_property_damage/total_damage), 0)
percent_crop_damage <- 100-percent_prop_damage

damage_by_year <- data.frame(aggregate(cbind(property_damage, crop_damage) ~ begin_date_year, data = storm_1989, sum))

damage_by_year_melt <- melt(damage_by_year, id.vars="begin_date_year")

png("figure/damage-by-year.png", width = 800, height = 480)

ggplot(damage_by_year_melt, aes(factor(begin_date_year), value, fill=variable)) + 
        geom_bar(stat="identity", position="stack") + 
        scale_fill_brewer(palette="Set2") + 
        ggtitle("Property and Crop Damage by Year, 1989-2011") +
        xlab("Event Type") +
        ylab("Total Damage in $USD")

dev.off()

## quartz_off_screen 
##                 2

damage plot

In total, there was 4.0260005 × 10¹¹ in property damage and 4.9104192 × 10¹⁰ in crop damage. It is also interesting to note that of the total 4.5170424 × 10¹¹ in damage, a whopping 89 percent of of damage is property damage, while only 11 percent is crop damage.

As seen in the chart, there was no crop damage until 1993, so that data likely wasn't collected until that year. Clearly visible is the spike in 2005 and 2006, almost certainly due to Hurricane Katrina. However, since that event and subsequent flooding occurred only in 2005, and 2006 sees even more damage, it would be a safe assumption that the quality of this data (specifically begin date) is questionable.