Exploring 60 years of NOAA Storm Data (1950-2011)

Date: July 15, 2021

Author: Nikhil D.

This analysis explores the Storm Event Database from NOAA in an effort to answer two main questions:

1) Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

2) Across the United States, which types of events have the greatest economic consequences?

Synopsis

This data was obtained from the NWS NOAA Storm Event Database via Coursera links. Consisting of nearly 1 million observations across 37 variables, this data was further subset to first evaluate the fatality and injury statistics over the years. To answer the first question, an arbitrary ‘harm index’ was created to combine the effects of injury and death into a single measure, by weighting an injury 25% as ‘harmful’ as a fatality.

For efficiency and at the expense of overall accuracy, the event types were left as-is (variable EVTYPE), but due to the inconsistent data entry formats over the years, data consolidation in EVTYPE should be considered for future studies. For example, some entries combine events, misspell, abbreviate, and pluralize inconsistently (ie: “TORNADOES, TSTM WIND, HAIL”).

For the second question, the “PROPDMG/EXP,CROPDMG/EXP” columns were transformed into numerical data, such that a total damage cost could be calculated across property and crops, and summarized across event types.

And finally for both questions, results were plotted showing the 10 most ‘harmful’ and economically consequential storm events. From these results, the “TORNADO” and “FLOOD” event types were the most harmful and highest economically consequential events, respectively.

For future studies, it may be too time consuming to split EVTYPE data for combined entries and determine the true event contribution breakdown in the data, however, one could consolidate event data based on the presence of a single string, ie: “FLOOD” or “TORNADO” in any part of the entry. This data would be double counted in some instances, but might better represent true extent of the damage.

Data Processing

Loading the data

library(dplyr)

## Warning: package 'dplyr' was built under R version 4.0.5

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)

## Warning: package 'tidyr' was built under R version 4.0.5

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 4.0.5

library(lubridate)

## Warning: package 'lubridate' was built under R version 4.0.5

## 
## Attaching package: 'lubridate'

## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union

library(ggpubr)

## Warning: package 'ggpubr' was built under R version 4.0.5

# Set setwd() to working directory containing data
setwd("C:/Users/nkd29/Desktop/Nikhils_Stuff/01_Career/03_Courses/Coursera_R_Data_Science_Johns_Hopkins/05_Repoducible_Research/Project_2")

data <- tbl_df(read.csv('repdata_data_StormData.csv.bz2'))

## Warning: `tbl_df()` was deprecated in dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.

Preprocessing the data

Question 1: Across the US, which types of events (based on EVTYPE variable) are most harmful with respect to population health?

# data subset and summed for each event type
df_harm <- data %>% 
  group_by(EVTYPE) %>% 
  select(EVTYPE,PROPDMG,FATALITIES,INJURIES,STATE,COUNTYNAME) %>% 
  summarise(sum(FATALITIES),sum(INJURIES))# %>% 
  # arrange(desc(`sum(FATALITIES)`))

# check the max values for death and injury
max(df_harm$tot_deaths)

## Warning: Unknown or uninitialised column: `tot_deaths`.

## Warning in max(df_harm$tot_deaths): no non-missing arguments to max; returning -
## Inf

## [1] -Inf

max(df_harm$tot_inj)

## Warning: Unknown or uninitialised column: `tot_inj`.

## Warning in max(df_harm$tot_inj): no non-missing arguments to max; returning -Inf

## [1] -Inf

# columns variable
cols <- names(df_harm)

# rename cols and make an (arbitrary) harm index (injuries weighted as 25% of deaths)
df_harm <- df_harm %>% 
  rename(tot_deaths = cols[2], tot_inj = cols[3]) %>% 
  mutate(harm_index = (tot_deaths + 0.25*tot_inj)/3000)     

# df_harm <- head(df_harm,10)

Question 2: Across the US, which types of events have the greatest economic consequences?

# group data and subset
df_dmg <- data %>% 
  group_by(EVTYPE) %>% 
  select(EVTYPE,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,FATALITIES,INJURIES,STATE,COUNTYNAME) 

# Cleanup: recode the abbreviations for PROP/CROP damage to actual values to calculate economic costs totals
# The symbols -+? or blanks are assumed to have EXP = 1 (exponent)
df_dmg$PROPDMGEXP <- dplyr::recode(df_dmg$PROPDMGEXP,
                                            "H"=1e2,"h"=1e2,
                                            "K"=1e3,"k"=1e3,
                                            "M"=1e6,"m"=1e6,
                                            "B"=1e9,"b"=1e9,
                                            "0"=1e0,"1"=1e1,"2"=1e2,"3"=1e3,"4"=1e4,
                                            "5"=1e5,"6"=1e6,"7"=1e7,"8"=1e8,"9"=1e9, .default=1e0) #,
                                            # ''=1e0,'-'=1e0,'+'=1e0,'?'=1e0)

df_dmg$CROPDMGEXP <- dplyr::recode(df_dmg$CROPDMGEXP,
                                            "H"=1e2,"h"=1e2,
                                            "K"=1e3,"k"=1e3,
                                            "M"=1e6,"m"=1e6,
                                            "B"=1e9,"b"=1e9,
                                            "0"=1e0,"1"=1e1,"2"=1e2,"3"=1e3,"4"=1e4,
                                            "5"=1e5,"6"=1e6,"7"=1e7,"8"=1e8,"9"=1e9, .default=1e0)


# calculate total economic damage in new col
df_dmg$tot_dmg_usd <- df_dmg$PROPDMG*df_dmg$PROPDMGEXP+df_dmg$CROPDMG*df_dmg$CROPDMGEXP



# summarise with new df, the total economic cost by event type
df_summ <- df_dmg %>% summarise(sum(tot_dmg_usd)/1e9) 
df_summ <- df_summ %>% rename(Total_Cost_USD_Billions = names(df_summ)[2])

Results

Plot for Q1

# plot total deaths
p1 <- ggplot(data = head(arrange(df_harm,desc(tot_deaths)),10)) +
  geom_col(aes(x = reorder(EVTYPE,-tot_deaths), y = tot_deaths)) +
  ggtitle("Top 10 Causes of Death from 1951-2001") +
  xlab("Event Type") +
  ylab("Total Deaths") +
  scale_y_continuous(breaks = seq(0,6000,1000)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        plot.title = element_text(hjust=0.5,size=12))
  
p2 <- ggplot(data = head(arrange(df_harm,desc(tot_inj)),10)) +
  geom_col(aes(x = reorder(EVTYPE,-tot_inj), y = tot_inj)) +
  ggtitle("Top 10 Causes of Injury from 1951-2001") +
  xlab("Event Type") +
  ylab("Total Injuries") +
  scale_y_continuous(breaks = seq(0,100000,10000)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        plot.title = element_text(hjust=0.5,size=12))

ggarrange(p1, p2, nrow = 1)

p3 <- ggplot(data = head(arrange(df_harm,desc(harm_index)),10)) +
  geom_col(aes(x = reorder(EVTYPE,-harm_index), y = harm_index)) +
  ggtitle("Top 10 Causes of Injury, by Harm Index (1951-2001)") +
  xlab("Event Type") +
  ylab("Harm Index 1-10\n(Death & Injury)") +
  scale_y_continuous(breaks = seq(0,10,1)) +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        axis.title.y = element_text(angle=0,vjust=0.5),
        plot.title = element_text(hjust=0.5,size=12))
p3

Plot for Q2

p4 <- ggplot(data = head(arrange(df_summ,desc(Total_Cost_USD_Billions)),10)) +
  geom_col(aes(x = reorder(EVTYPE,-Total_Cost_USD_Billions), y = Total_Cost_USD_Billions)) +
  ggtitle("Top 10 Storm-Events with Greatest Economic Impact (NOAA 1951-2001)") +
  xlab("Event Type") +
  ylab("Cost\n(Billions $ USD)") +
  theme(axis.text.x = element_text(angle = 45, hjust=1),
        axis.title.y = element_text(angle=0,vjust=0.5),
        plot.title = element_text(hjust=0.5,size=12))

p4

Conclusion

Based on the results of plots for Q1, it is clear that “TORNADO” is the most harmful event type, but as mentioned earlier, the EVTYPE data is full of errors and inconsistencies that can be cleaned up further.

Based on the plots for Q2, the “FLOOD” EVTYPE was clearly the most economically consequenc\tial, but cleaning up the EVTYPE may also help to increase accuracy of the damage cost estimates in the results.