Introductory Analysis of The U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database

Reproducible Research Peer Assessment 2

Introduction

Using the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database dataset available online via stormdata.csv.bz2. The raw dataset is large and unweildy so the transformation steps and analysis are all documented below.

Objective

With the dataset provided earlier, two main questions are to be answered.

1. Across the United States, which types of events (as indicated in the 𝙴𝚅𝚃𝚈𝙿𝙴 variable) are most harmful with respect to population health?
2. Across the United States, which types of events have the greatest economic consequences?

Additionally, many transformations to the dataset were required to filter through the vast amount of variables and data and grab what data is actually needed to answer the questions. Ultimately, two graphs will be provided to supplement the results.

Data Cleaning & Processing

Downloading and Importing:

The dataset is available with the link provided in the Introduction section of this document. The code below takes the link and downloads the file to your home directory. Assuming you use this directiory as your working directory, the rest of the script will run as needed.

Initially, the file is downloaded as a compressed file. You will see that the read.csv function also has a function call to bzfile in order to unzip the file. I believe now that this might be redundant (read.csv might handle compressed csv files) but I include the call to bzfile as a precautionary measure. The script works as needed.

Note: The download and the importing of the data will take some time to process, please be patient. The dataset is LARGE initially.

#Attach Necessary Libraries:
library(dplyr); library(knitr); library(rmarkdown); library(ggplot2); library(gridExtra); library(lubridate); library(stringr)

#Download File & Unzip/Import Data:
fileURL <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, "stormdata.bz2") 
stormdata <- read.csv(bzfile("stormdata.bz2"))

Creating a Tidy Dataset:

You will notice the size of the stormdata data frame and be a little weiry. Don’t stress, the following few lines create filtered data frames that are much more usable and only include the needed data.

First, the data will be filtered to include only columns that are necessary for analysis on population health, econonmic data, and obviously information on the event types and dates.

stormData2 <- subset(stormdata, select = c(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
stormData2$BGN_DATE <- year(as.Date(stormData2$BGN_DATE, format = "%m/%d/%Y"))

Next, I created a cut-off date feature that limits the number of records included in the analysis. The earlier years of the records tend to be incomplete and less abundant meaning they could introduce static into the analysis, therefore I chose 1965 as a cutoff. 1965 is the 80% mark of the years included in the records. This date is arbitrary and can be changed accordingly.

#Let's consider only the more recent years of the records to avoid the early records that may skew our data, I used 1965 as an arbitrary cutoff date for the records. This can be changed:
cutoffdate <- 1965
stormDataAdjusted <- filter(stormData2, stormData2$BGN_DATE >= cutoffdate)

The resulting data frame stormDataAdjusted will be used for the analysis. Aggregated datasets will be created based off of this data frame. A sample of the data is provided below

head(stormDataAdjusted, 3)

##   BGN_DATE    EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1     1965   TORNADO          0        0      25          K       0
## 2     1965 TSTM WIND          0        0       0                  0
## 3     1965   TORNADO          0       18     250          K       0
##   CROPDMGEXP
## 1           
## 2           
## 3

Consolidating Event Types:

At this stage of the data cleaning process, there are still a vast number of event types listed under the EVTYPE variable in our data frame.

length(levels(stormDataAdjusted$EVTYPE))

## [1] 985

Many of the 985 levels can be refactored into common groups such as “Tornado” or “Rain”. Thus, the following code uses keywords to essentially sort the event types into 8 main factors: Precipitation & Flood, Heatwave & Drought, Snow/Ice & Avalanche, Volcanic, Oceanic, Storm & Wind, Landslide and Fire.

#Create new column to categorize the events:

stormDataAdjusted <- mutate(stormDataAdjusted, Event = NA)

stormDataAdjusted[grep("rain|fog|drizzle|flood|flash flood|rainfall|hail|shower|cloud", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Precipitation & Flood"

stormDataAdjusted[grep("extreme heat|heat|heatwave|drought|dry|hot|warm|warmth|unusually warm|hyperthermia", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Heatwave & Drought"

stormDataAdjusted[grep("snow|ice|avalanche|icy|hypothermia|blizzard|heavy snow|glaze|frost|freeze", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Snow/Ice & Avalanche"

stormDataAdjusted[grep("volcano|erupt|eruption|ash|volcanic|ashfall|plume|vog", stormDataAdjusted$EVTYPE, ignore.case = TRUE), ] <- "Volcanic"

stormDataAdjusted[grep("tidal|ocean|waves|tidal flooding|surf|high surf|coast|coastal|tsunami|beach|erosion|cstl", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Oceanic"

stormDataAdjusted[grep("tornado|wind|wnd|windy|hurricane|tstm|high winds|whirlwind|burst|waterspout|blowing", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Storm & Wind"

stormDataAdjusted[grep("landslide|land|mud|mudslide|dam|dam break", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Landslide"

stormDataAdjusted[grep("fire|burn|forest fire|fires|forest|red|red flag|smoke", stormDataAdjusted$EVTYPE, ignore.case = TRUE), "Event"] <- "Fire"

#Get rid of the remaining NA values:
stormDataAdjusted <- stormDataAdjusted[complete.cases(stormDataAdjusted[ ,"Event"]), ]
stormDataAdjusted <- select(stormDataAdjusted, -EVTYPE)

The stormDataAdjusted data frame now has a new column with the consolidated factors:

head(stormDataAdjusted, 3)

##   BGN_DATE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## 1     1965          0        0      25          K       0           
## 2     1965          0        0       0                  0           
## 3     1965          0       18     250          K       0           
##          Event
## 1 Storm & Wind
## 2 Storm & Wind
## 3 Storm & Wind

Creating Exponent Variables and Applying:

Regarding the property and crop damage data, there are two columns that store information pertaining to the exponent associated with the value in the property or crop damage column. This exponent tells us whether the damage is in hundreds, thousands, millions or billions of dollars. The issue with these columns is that there are multiple levels and some are uninterpretable. Thus, the code below changes the character values to numbers, gets rid of the null values that have no meaning (such as ?, -, +), and all blanks are interpreted as 0’s as exponents (multiplying by a factor of 1).

#Fix the exponenets
factors <- levels(stormDataAdjusted$PROPDMGEXP)

##FIX PROPDMEXP - Replace letters with associated numbers
stormDataAdjusted$PROPDMGEXP <- gsub(factors[14], 9, stormDataAdjusted$PROPDMGEXP, ignore.case = T)
stormDataAdjusted$PROPDMGEXP <- gsub(factors[15], 2, stormDataAdjusted$PROPDMGEXP, ignore.case = T)
stormDataAdjusted$PROPDMGEXP <- gsub(factors[17], 3, stormDataAdjusted$PROPDMGEXP, ignore.case = T)
stormDataAdjusted$PROPDMGEXP <- gsub(factors[18], 6, stormDataAdjusted$PROPDMGEXP, ignore.case = T)

##FIX PROPDMEXP - Filter out uninterpretable characters 
stormDataAdjusted <- filter(stormDataAdjusted, !grepl(factors[2], PROPDMGEXP, fixed = TRUE))
stormDataAdjusted <- filter(stormDataAdjusted, !grepl(factors[3], PROPDMGEXP, fixed = TRUE))
stormDataAdjusted <- filter(stormDataAdjusted, !grepl(factors[4], PROPDMGEXP, fixed = TRUE))
stormDataAdjusted$PROPDMGEXP <- as.character(stormDataAdjusted$PROPDMGEXP)
stormDataAdjusted$PROPDMGEXP <- as.numeric(stormDataAdjusted$PROPDMGEXP)

##FIX CROPDMGEXP - Replace letters with associated numbers
stormDataAdjusted$CROPDMGEXP <- gsub(factors[14], 9, stormDataAdjusted$CROPDMGEXP, ignore.case = T)
stormDataAdjusted$CROPDMGEXP <- gsub(factors[15], 2, stormDataAdjusted$CROPDMGEXP, ignore.case = T)
stormDataAdjusted$CROPDMGEXP <- gsub(factors[17], 3, stormDataAdjusted$CROPDMGEXP, ignore.case = T)
stormDataAdjusted$CROPDMGEXP <- gsub(factors[18], 6, stormDataAdjusted$CROPDMGEXP, ignore.case = T)

##FIX CROPDMEXP - Filter out uninterpretable characters 
stormDataAdjusted <- filter(stormDataAdjusted, !grepl(factors[3], CROPDMGEXP, fixed = TRUE))
stormDataAdjusted$CROPDMGEXP <- as.character(stormDataAdjusted$CROPDMGEXP)
stormDataAdjusted$CROPDMGEXP <- as.numeric(stormDataAdjusted$CROPDMGEXP) 

#Handle the blank values (Had some trouble identifying all blanks as "" or "[:blank:]")
stormDataFiltered <- stormDataAdjusted
stormDataFiltered$PROPDMG <- as.numeric(stormDataFiltered$PROPDMG)
stormDataFiltered$CROPDMG <- as.numeric(stormDataFiltered$CROPDMG)
stormDataFiltered <- filter(stormDataFiltered, !is.na(stormDataFiltered$PROPDMG) & stormDataFiltered$PROPDMG !=0 & !is.na(stormDataFiltered$CROPDMG) & stormDataFiltered$CROPDMG != 0)

#Exponents function to apply the exponents to values in the prop/crop columns
exponents <- function(dmg, exponent){
    dmg <- dmg *(10^exponent)
}

##Apply function
stormDataFiltered$PROPDMG <- mapply(exponents, stormDataFiltered$PROPDMG, stormDataFiltered$PROPDMGEXP)
stormDataFiltered$CROPDMG <- mapply(exponents, stormDataFiltered$CROPDMG, stormDataFiltered$CROPDMGEXP) 

##Make any character values numerics for processing later 
stormDataAdjusted$INJURIES <- as.numeric(stormDataAdjusted$INJURIES) 
stormDataAdjusted$FATALITIES <- as.numeric(stormDataAdjusted$FATALITIES) 
stormDataAdjusted <- filter(stormDataAdjusted, !is.na(stormDataAdjusted$INJURIES) & stormDataAdjusted$INJURIES != 0 & !is.na(stormDataAdjusted$FATALITIES) & stormDataAdjusted$FATALITIES != 0)

After all of this work, we have interpretable crop and property damage values with the proper factor of \((x * 10^{y})\) applied.

Creating Aggregated Datasets For Analysis:

Using our transformed data frame stormDataAdjusted, the next step is to created aggregated datasets to use for analysis. The following processes the data in two ways. First, the number of injuries and fatalities are collected and summed up by event type into two data sets. Next, the property and crop columns are collectively summed by event type as well and organized into two data sets.

#Drop Exponent Columns (clean data set just for ease of use)
stormDataFiltered <- select(stormDataFiltered, -PROPDMGEXP, -CROPDMGEXP)

#Aggregate Economic & Health
totalDamage <- aggregate(formula = cbind(CROPDMG, PROPDMG) ~ Event, FUN = sum, data = stormDataFiltered)
totalDamage <- rename(totalDamage, Crops = CROPDMG, Property = PROPDMG)
totalHealth <- aggregate(formula = cbind(INJURIES, FATALITIES) ~ Event, FUN = sum, data = stormDataAdjusted)
totalHealth <- rename(totalHealth, Injuries = INJURIES, Fatalities = FATALITIES)

Results

Finally time to see some results. Below are the resulting graphs and data for the analysis.

Question 1: Across the United States, which types of events are most harmful with respect to population health?

#Fatalities & Injuries
p3 <- ggplot(totalHealth, aes(x = Event, y = Injuries)) + geom_bar(stat = "identity") + ggtitle(paste("Total Injury Occurance By Event Type from", cutoffdate, "to 2011")) + xlab("Event Type") + ylab("Injuries") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
p4 <- ggplot(totalHealth, aes(x = Event, y = Fatalities)) + geom_bar(stat = "identity") + ggtitle(paste("Total Fatality Occurance By Event Type from", cutoffdate, "to 2011")) + xlab("Event Type") + ylab("Fatalities") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
grid.arrange(p3,p4)

#Print Data
arrange(totalHealth, desc(Injuries))

##                   Event Injuries Fatalities
## 1          Storm & Wind    49734       4162
## 2    Heatwave & Drought     6495        501
## 3 Precipitation & Flood     3232        207
## 4  Snow/Ice & Avalanche     2800        211
## 5                  Fire      427         65
## 6               Oceanic      184         73
## 7             Landslide       28         22

Question 2: Across the United States, which types of events have the greatest economic consequences?

#Damage Crops & Property
p1 <- ggplot(totalDamage, aes(x = Event, y = Crops/10^6)) + geom_bar(stat = "identity") + ggtitle(paste("Total Crop Damage from", cutoffdate, "to 2011")) + xlab("Event Type") + ylab("Damage (in Millions of USD)") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
p2 <- ggplot(totalDamage, aes(x = Event, y = Property/10^6)) + geom_bar(stat = "identity") + ggtitle(paste("Total Property Damage from", cutoffdate, "to 2011")) + xlab("Event Type") + ylab("Damage (in Millions of US Dollars)") + scale_x_discrete(labels = function(x) str_wrap(x, width = 10))
grid.arrange(p1,p2)

#Print Data
arrange(totalDamage, desc(Crops))

##                   Event       Crops     Property
## 1 Precipitation & Flood 11111350850 129361980990
## 2          Storm & Wind  7517223790  44076751703
## 3  Snow/Ice & Avalanche  5389501500    258250050
## 4    Heatwave & Drought  1724990000    233407000
## 5                  Fire   202643100   1479543000
## 6             Landslide    20022000     14424000
## 7               Oceanic     1576000    119800000

Conclusion

From the results, there are some obvious conclusions to be made.

Addressing question one, the most hazardous event types for human health are by far wind and storm related injries and fatalities. This category includes hurricanes, strong winds, tornados, lighting and other major storm types. Heatwave and drought also seems to be a major event that causes a surprising number of injuries and fatalities. Keep in mind, these numbers are the total number of injuries and fatalities recorded over the years 1965 - 2011. These results should not seem too surprising. Throughout the date range, there have been a number of hurricanes that have wrecked havoc on the South and Eastern coast lines. Many of these storms are within the last decade to 15 years. Additionally, with tornados included in the storm category, many of the large tornado strikes throughout the Midwest are have added to the total number of injuries and fatalities.

Question two has also has some interesting results. The majority of crop damage seems to be caused by storms involving flood, wind and ice. Additionally, a large amount of crop damage is caused by drought, but the resulting damage is still only a fraction of the total caused by the flood, wind and ice damage. Property damage is heavily associated with floods and storms. Fire also has some economic impact, but as far as catastrophic economic consequence, the majority of the property damage is due to floods and major storms or tornados.

In summary, the majority of detrimental natural health and economic distress seems to stem from storms involving hurricanes, tornados, snow and ice and excessive winds. We’ve seen many of these horrible instances in the past decade with storms like Hurricane Katrina (2005) and Hurricane Dennis (2005). Additionally, tornados have caused severe distress throughout the midwest over the past few decades as well.

Further research can provide insight into whether the stark consequences of storms is linked to climate change and global warming. Using the NOAA dataset, I believe digging into individual event types or analyzing the storm related events on a date related scale may produce some trends relating to the spark in global warming over the past three decades as indicated by many researchers today.

Storm Data Analysis

Sean Krinik