Synopsis

This report contains analysis of the NOAA Storm Database with an objective to answer two questions -

  1. Across the United States, which types of events are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

To conduct this analysis, we will first process and tidy the raw data that covers major storms and weather events from 1950 till November 2011. The data includes when and where the events occur, as well as estimates of any fatalities, injuries, and property damage.


Setup

library(dplyr)
library(knitr)
library(tools)
library(ggplot2)
library(stringdist)

Data Processing

For the data processing step, we will load the data using the read.csv function and convert the resulting data frame into a dplyr tibble data frame. The BGN_DATE column in stormData will be converted to the date format and the EVTYPE column will be stored as factors in proper case.

In order to assess the economic damage, we will use four columns provided in stormData, namely PROPDMG, PROPDMGEXP, CROPDMG and CROPDMGEXP. The dollar amounts are mentioned in the DMG columns while their respective exponential data is in the DMGEXP columns. In order to make the figures comparable across rows, we will compute the actual dollar amounts by replacing the character representations of the exponential figures with numeric values.

Some of the older records consist of improperly marked data, resulting in data accuracies. We will ignore all rows in the data where DMGEXPis not in the set {‘K’, ‘M’, ‘B’, ‘k’, ‘m’, ‘b’}.

stormData <- read.csv("repdata%2Fdata%2FStormData.csv.bz2", sep = ",", 
                      header = TRUE, na.strings = "NA")
stormData <- tbl_df(stormData)

# Remove unused columns from the data frame

stormData <- stormData %>% select(EVTYPE, FATALITIES, INJURIES, 
                                  PROPDMG, PROPDMGEXP, CROPDMG,
                                  CROPDMGEXP)

# Clean up the EVTYPE data for excess white spaces, spelling errors, etc.

stormData$EVTYPE <- as.factor(toTitleCase(tolower(as.character(stormData$EVTYPE))))

# Remove extra whitespaces and numeric values from event names

trim <- function (x) gsub("^\\s+|\\s+$", "", x)
rmwhite <- function (x) gsub("\\s+", " ", x)
rmdigits <- function (x) gsub('[[:digit:]]+', '', x)

stormData$EVTYPE <- trim(rmwhite(rmdigits(stormData$EVTYPE)))

# Remove rows where EVTYPE is a Summary entry

stormData <- filter(stormData, !grepl("^Summary", EVTYPE))

# Create a distance matrix of names to find similar strings (misspelled, etc.)
# Repeat the process till you keep finding similar strings with distance <= 1

repeat {
    
    uniqueEvents <- unique(rmwhite(trim(stormData$EVTYPE)))
    distMatrix <- adist(uniqueEvents, uniqueEvents)
    distMatrix <- ifelse(distMatrix > 1, 0, distMatrix)
    distMatrix[lower.tri(distMatrix, diag = TRUE)] <- 0 
    matchingPairs <- which(distMatrix != 0, arr.ind = TRUE)
    similarStrings <- tbl_df(data.frame(uniqueEvents[matchingPairs[,1]],
                                 uniqueEvents[matchingPairs[,2]]))
    
    if(nrow(similarStrings) == 0) { # No more similar strings found
        break
    }
    
    names(similarStrings) <- c("Match1", "Match2")
    similarStrings <- arrange(similarStrings, Match1)
    similarStrings$Match1 <- as.character(similarStrings$Match1)
    similarStrings$Match2 <- as.character(similarStrings$Match2)
    
    stormData <- left_join(stormData, similarStrings, by = c("EVTYPE" = "Match2"))
    stormData[!is.na(stormData$Match1), "EVTYPE"] <- stormData[!is.na(stormData$Match1), "Match1"]
    stormData <- stormData %>% select(-Match1)
    
}

# Remove improper data and replace the DMGEXP string values with respective numeric values 

stormData$PROPDMGEXP <- toupper(as.character(stormData$PROPDMGEXP))
stormData$CROPDMGEXP <- toupper(as.character(stormData$CROPDMGEXP))

stormData <- stormData %>% 
    filter(PROPDMGEXP %in% c("H", "K", "M", "B", "")) %>% 
    filter(CROPDMGEXP %in% c("H", "K", "M", "B", ""))

stormData$PROPDMGEXP <- replace(stormData$PROPDMGEXP, stormData$PROPDMGEXP == 'H', 1e+02)
stormData$PROPDMGEXP <- replace(stormData$PROPDMGEXP, stormData$PROPDMGEXP == 'K', 1e+03)
stormData$PROPDMGEXP <- replace(stormData$PROPDMGEXP, stormData$PROPDMGEXP == 'M', 1e+06)
stormData$PROPDMGEXP <- replace(stormData$PROPDMGEXP, stormData$PROPDMGEXP == 'B', 1e+09)
stormData$PROPDMGEXP <- replace(stormData$PROPDMGEXP, stormData$PROPDMGEXP == '', 1)
stormData$PROPDMGEXP <- as.numeric(stormData$PROPDMGEXP)

stormData$CROPDMGEXP <- replace(stormData$CROPDMGEXP, stormData$CROPDMGEXP == 'H', 1e+02)
stormData$CROPDMGEXP <- replace(stormData$CROPDMGEXP, stormData$CROPDMGEXP == 'K', 1e+03)
stormData$CROPDMGEXP <- replace(stormData$CROPDMGEXP, stormData$CROPDMGEXP == 'M', 1e+06)
stormData$CROPDMGEXP <- replace(stormData$CROPDMGEXP, stormData$CROPDMGEXP == 'B', 1e+09)
stormData$CROPDMGEXP <- replace(stormData$CROPDMGEXP, stormData$CROPDMGEXP == '', 1)
stormData$CROPDMGEXP <- as.numeric(stormData$CROPDMGEXP)

Assessing Human Damage

In order to assess the damage to human health, we will use two values provided in the database, namely human fatalities (FATALITIES) and human injuries (INJURIES). First, we will assess whether these two measures of human health damage are correlated to each other across types of events.

humanDamage <- stormData %>% select(EVTYPE, FATALITIES, INJURIES) %>% 
    group_by(EVTYPE) %>% 
    summarize(AVG_FATALITIES = mean(FATALITIES), 
              AVG_INJURIES = mean(INJURIES)) 

# Find events where average fatalities / injuries are in the top 2 percentile

maxHumanDamage <- humanDamage %>% 
    filter(AVG_FATALITIES > quantile(humanDamage$AVG_FATALITIES, 0.98) 
           | AVG_INJURIES > quantile(humanDamage$AVG_INJURIES, 0.98)) %>%
    arrange(desc(AVG_FATALITIES + AVG_INJURIES))

orderedEVTYPE <- with(maxHumanDamage, unique(EVTYPE[order(desc(AVG_FATALITIES + AVG_INJURIES))]))
maxHumanDamage$EVTYPE <- with(maxHumanDamage, factor(EVTYPE, levels = orderedEVTYPE, ordered = TRUE))

ggplot(maxHumanDamage, aes(AVG_INJURIES, AVG_FATALITIES)) +
    geom_point(shape = 16, size = 5, aes(col = rev(EVTYPE)), alpha = 0.7, show.legend = TRUE) + 
    geom_smooth(method = "loess", se = FALSE) + theme_minimal() + 
    theme(legend.position = "bottom", legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5)) +
    labs(title = "Correlation between Fatalities & Injuries Across Weather Events",
         x = "Average Injuries", y = "Average Fatalities", col = "Most Damaging Events")

As evident from the above plot, there is no correlation between fatalities and injuries across different types of major storms and weather events. There are certain destructive events that cause mostly injuries, such as Wild Fires and Thunderstorms. Similarly, there are certain events that cause high fatalities but relatively fewer injuries, such as Tornadoes, Hail and Cold and Snow.

In order to assess human health damage that combines both fatalities and injuries, we will need a new metric that adds both the existing values in some numeric combination. For this purpose, we will create a new metric called Health Damage. This new metric will be defined as per the following formula (accounting for the fact that human fatalities are typically a lot more damaging than injuries) -

Health Damage = 10 x Fatalities + Injuries

# Take the top 10 most damaging events for human health 

maxHumanDamage <- maxHumanDamage %>% 
    mutate(AVG_HEALTH_DAMAGE = 10 * AVG_FATALITIES + AVG_INJURIES) %>%
    arrange(desc(AVG_HEALTH_DAMAGE)) %>%
    slice(1:10)

According to our analysis, the 10 most damaging types of weather events, in descending order of human health damage, are listed below.


Assessing Economic Damage

Economic damage for each event can be assessed as the sum of the total dollar value of the property and crop damage. We calculate the total economic damage for each event and store them in a separate column called ECONDMG. We can then select the events with the highest average damage to complete the analysis.

econDamage <- stormData %>% select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)

# Compute total economic damage for each event and find the event-wise average
# Select the top 10 most damaging events

maxEconDamage <- econDamage %>% 
    mutate(ECONDMG = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP) %>%
    select(EVTYPE, ECONDMG) %>%
    group_by(EVTYPE) %>%
    summarize(AVG_DAMAGE = mean(ECONDMG)) %>%
    arrange(desc(AVG_DAMAGE)) %>%
    slice(1:10)

orderedEVTYPE <- with(maxEconDamage, unique(EVTYPE[order(desc(AVG_DAMAGE))]))
maxEconDamage$EVTYPE <- with(maxEconDamage, factor(EVTYPE, levels = orderedEVTYPE, ordered = TRUE))

ggplot(maxEconDamage, aes(EVTYPE, AVG_DAMAGE / 1e+06)) +
    geom_bar(stat = "identity", aes(fill = EVTYPE), alpha = 0.7) + 
    theme_minimal() + 
    theme(legend.position = "bottom", 
          legend.box.background = element_rect(), 
          plot.title = element_text(hjust = 0.5), 
          axis.title.x=element_blank(),
          axis.text.x=element_blank(), 
          axis.ticks.x=element_blank()) +
    labs(title = "Highest Average Economic Damage Events",
         x = "", y = "Average Damage (Million $)", fill = "Most Damaging Events")

According to our analysis, the 10 most damaging types of weather events, in descending order of average economic damage, are listed below.


Results

According to our analysis, there are typically distinct sets of events that are most damaging in terms of human health and economic impact. The former comprises of a number of extreme temperature events such as Cold and Snow, Record/Excessive Heat, Heat Wave Drought and Extreme Heat. The latter set comprises of extreme weather events such as Heavy Rain/Severe Weather, Hurricane/Typhoon, Storm Surge and Severe Thunderstorm.

Finally, there are two events that are highly destructive in terms of both human health as well as economic damage. These two types of weather events are listed below -