Synopsis

Weather events can impact the health of the population, and result in significant financial loss. This report analyzes data from the National Oceanic and Atmospheric Administration’s Storm Database to answer two fundamental questions:

  1. Across the United States, which types of weather events are most harmful with respect to human health?
  2. Across the United States, which types of weather events have the greatest economic consequences?

Recording of various weather event types occurred over different timefromes. Therefore, to allow for relatively consistent comparisons across weather events, the period of analysis has been restricted to the years 1999 through 2011.

Using this dataset, and restricting the timeframe to the years 1999 to 2011, the most harmful events in terms of injuries and fatalities were: Tornados, Excessive Heat, TSTM Wind, Flood, and Lightning.

Over the same time frame, the events that resulted in the greatest economic impact with respect to property damage and crop damage were found to be: Floods, Storm Surges, Tornados, Hail, and Flash Floods.

Data Processing

From the National Oceanic and Atmospheric Administration’s Storm Database, we obtained data on weather related events from 1950 through 2011

Reading in the Data

The storm database is provided as a zipped *.csv file with headers. Thus, download and import is straight forward. The data is downloaded to the “data” subdirectory of the current working directory and is then extracted for further analysis.

# Load libraries used in this script, installing if necessary
if(!("pacman" %in% rownames(installed.packages()))) {
  install.packages("pacman", repos = "http://cran.us.r-project.org")
}
library(pacman)
p_load(dplyr, ggplot2, ggpubr, lubridate, tidyr)

# Set up the destination location for the data file if necessary
if (!file.exists("./data")) {
    dir.create("./data")
}

fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile = "./data/StormData.csv.bz2", method = "curl")

# Extract the data
stormData <- read.csv("./data/StormData.csv.bz2")

After importing the data, we can check the first few rows and columns of the dataset.

head(stormData[ , 1:8])
##   STATE__           BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE  EVTYPE
## 1       1  4/18/1950 0:00:00     0130       CST     97     MOBILE    AL TORNADO
## 2       1  4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL TORNADO
## 3       1  2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL TORNADO
## 4       1   6/8/1951 0:00:00     0900       CST     89    MADISON    AL TORNADO
## 5       1 11/15/1951 0:00:00     1500       CST     43    CULLMAN    AL TORNADO
## 6       1 11/15/1951 0:00:00     2000       CST     77 LAUDERDALE    AL TORNADO

The first step in data manipulation is to set the variable “EVTYPE” as a factor. This variable identifies the weather event type used in the analysis. In addition, the year in which the event began is also extracted to support the analysis.

stormData$EVTYPE <- as.factor(stormData$EVTYPE)
stormData$date <- as.Date(stormData$BGN_DATE, "%m/%d/%Y %H:%M:%S")
stormData$year <- as.numeric(format(stormData$date, "%Y"))

The database provides information on property damage and crop damage resulting from each event. However, this data is spread across multiple variables. For property damage, the variable “PROPDMG” provides a cost number and the variable “PROPDMGEXP” indicates the exponent (K = thousands, M = millions, B = billions). The same is true for crop damage based on “CROPDMG” and “CROPDMGEXP”.

New variables are created to identify the cost in dollars for property damage and crop damage.

stormData <- stormData %>% 
    mutate(propDmgTot = case_when(
        PROPDMGEXP == "K" ~ 1000 * PROPDMG,
        PROPDMGEXP == "M" ~ 1000000 * PROPDMG,
        PROPDMGEXP == "B" ~ 1000000000 * PROPDMG,
        TRUE ~ PROPDMG
    ))

stormData <- stormData %>% 
    mutate(cropDmgTot = case_when(
        CROPDMGEXP == "K" ~ 1000 * CROPDMG,
        CROPDMGEXP == "M" ~ 1000000 * CROPDMG,
        CROPDMGEXP == "B" ~ 1000000000 * CROPDMG,
        TRUE ~ CROPDMG
    ))

Restricting the timeframe for analysis

While the storm database contains data from 1950 to 2011, collection of data for most weather related events did not start until much later in time. The histogram below identifies how many new weather events were added to the database each year.

dataColStart <- stormData %>% 
    group_by(EVTYPE) %>% 
    summarize(startYear = min(year, na.rm = TRUE))

ggplot(dataColStart, aes(x = startYear)) +
    geom_histogram(binwidth = 1) +
    ggtitle("When Data Collection Started for Each Event Type") +
    xlab("Year") +
    ylab("Count of New Event Types") +
    theme(plot.title = element_text(hjust = 0.5))

From this graph, it is apparent that data collection for the majority of weather events didn’t begin until 1993 or later. Thus, it is not appropriate to compare total costs for various weather related events when data for those events span significantly different timeframes. In order to compare data over relatively consistent timeframes, it was determined that data collection for 90% of the weather event types identified in the database began during or before the year 1999. Therefore, data will be restricted to weather events occurring during years 1999 and 2007. Further, only weather events for which collection of data began after the start of this timeframe will not be used.

dataColStart <- dataColStart %>% 
  filter(startYear <= 1999)

stormData <- stormData %>% 
  filter(year >= 1999 & EVTYPE %in% dataColStart$EVTYPE)

Results

Analyzing Weather Event Harm to Human Health

The provided data includes fatalities and injuries attributed to each weather event. To determine which weather events were most harmful to human health, we will consider the total number of fatalities and injuries over the time period attributed to each type of weather event. We will then identify the top five types of weather events with respect to total deaths, along with the top five types of weather events with respect to total injuries. The union of these two sets of events will be used to explore the harm to human health resulting from these events.

The following stacked bar chart identifies the top weather events in terms of impact to Human Health.

stormLoss <- stormData %>% 
    group_by(EVTYPE) %>% 
    summarize(sDeaths = sum(FATALITIES, na.rm = TRUE),
              sInjuries = sum(INJURIES, na.rm = TRUE))

topFatalEvents <- stormLoss %>% 
    arrange(desc(sDeaths)) %>% 
    slice(1:5) %>% 
    select(EVTYPE)

topInjuryEvents <- stormLoss %>% 
    arrange(desc(sInjuries)) %>% 
    slice(1:5) %>% 
    select(EVTYPE)

topHHEvents <- union(topFatalEvents$EVTYPE, topInjuryEvents$EVTYPE)

HHCosts <- stormLoss %>% 
    filter(EVTYPE %in% topHHEvents) %>% 
    gather("harm", "cost", -EVTYPE) %>% 
    ggplot(aes(x = reorder(EVTYPE, -cost), y = cost, fill = harm)) +
        geom_bar(position = "stack", stat = "identity") +
        ggtitle("Weather Event Related Impact to Human Health (1999 - 2011)") +
        xlab("Event Type") +
        ylab("Count of Injuries and Fatalities") +
        theme(plot.title = element_text(hjust = 0.5),
              axis.text.x = element_text(angle = 45)) +
        scale_fill_discrete(name = "Human Impact", labels = c("Fatalities", "Injuries"))

HHCosts

Analyzing Weather Event Economic Impact

The provided data includes property damage and crop damage cost estimates attributed to each weather event. To determine which weather events had the most significant economic impacts, we will consider the total cost of property damage and crop damage over the given time period attributed to each type of weather event. We will then identify the top types of weather events with respect to total property and crop damage.

The following stacked bar chart identifies the top weather events in terms of economic impact.

stormLoss <- stormData %>% 
    group_by(EVTYPE) %>% 
    summarize(sPropDamage = sum(propDmgTot, na.rm = TRUE),
              sCropDamage = sum(cropDmgTot, na.rm = TRUE))

topPropDmgEvents <- stormLoss %>% 
    arrange(desc(sPropDamage)) %>% 
    slice(1:5) %>% 
    select(EVTYPE)

topCropDmgEvents <- stormLoss %>% 
    arrange(desc(sCropDamage)) %>% 
    slice(1:5) %>% 
    select(EVTYPE)

topEconEvents <- union(topPropDmgEvents$EVTYPE, topCropDmgEvents$EVTYPE)

EconCosts <- stormLoss %>% 
    filter(EVTYPE %in% topEconEvents) %>% 
    gather("damage", "cost", -EVTYPE) %>% 
    ggplot(aes(x = reorder(EVTYPE, -cost), y = cost, fill = damage)) +
        geom_bar(position = "stack", stat = "identity") +
        ggtitle("Weather Event Related Economic Impact (1999 - 2011)") +
        xlab("Event Type") +
        ylab("Estimated Financial Impact (USD)") +
        theme(plot.title = element_text(hjust = 0.5),
              axis.text.x = element_text(angle = 45)) +
        scale_fill_discrete(name = "Economic Impact", labels = c("Crop", "Property"))

EconCosts

Summary

The storm data provided by NOAA was assessed to determine the human health and financial impacts due to weather events. The analysis indicated that tornados caused the greatest human health impact in terms of fatalities and injuries from 1999 to 2011. The greatest economic impact over the same period of time was the result of floods.