Synopsis

This document provides analysis of the NOAA Storm Database, which records pertinent information about severe weather events occurring across the United States from 1950 - 2011. It attempts to address two specific questions…

  1. Which types of events are most harmful with respect to population health?
  2. Which types of events have the greatest economic consequences?

The analysis focuses on the investigation of a particular subset of variables from the provided original data set. Below is a list of the topics and variables of importance to the analysis followed by the results of the analysis.

Variables
  • Events
    • “EVTYPE”
  • Population Health
    • “FATALITIES” & “INJURIES”
  • Economic Impact
    • “CROPDMG” & “CROPDMGEXP”
    • “PROPDMG” & “PROPDMGEXP”
Results
  1. Tornados are the most harmful events with regard to the health of the human population.
    • Total Recorded Injuries: 91346
    • Total Recorded Fatalities: 5633
  2. Floods are the most damaging events with regard to total damage.
    • Total Damage to Crops: approximately $5.66 Billion USD
    • Total Damage to Property: approximately $14.65 Billion USD

Supporting Packages

library(R.utils) # bunzip2
library(dplyr)  # arrange, mutate   
library(ggplot2) # ggplot

Data Processing

Extraction

Starting with the raw .bz2 zipped dataset from the, employ the use of the “R.utils” package and its “bunzip2” function to extract the .csv file. Then store the raw data in a data.frame called “storms”.

setwd("~/R/StormData/")

#   File Names to be passed to bunzip2()
file_bz <- "repdata-data-StormData.csv.bz2"
file_csv <- "repdata-data-StormData.csv"

#   Unzip .bz2 using R.utils::bunzip2
bunzip2(file_bz, file_csv, remove = FALSE, skip = TRUE)
## [1] "repdata-data-StormData.csv"
## attr(,"temporary")
## [1] FALSE
#   Read Data from extracted .csv; Initialize "storms" data.frame
storms <- read.csv(file_csv, header = TRUE, stringsAsFactors = FALSE, na.strings = "")

Exploratory Analysis

Before analyszing any data, it is good pracice to perform a preliminary visual inspection the raw data. We do this for a few reasons in particular. In this case it has helped to…

  1. Become familiar with what variables exist within the dataset.
  2. Determine which variables are key to answering the research questions.
  3. Prescribe methods of data cleaning for the chosen variables.
#   1. View the names of the variable in storms
names(storms)
##  [1] "STATE__"    "BGN_DATE"   "BGN_TIME"   "TIME_ZONE"  "COUNTY"    
##  [6] "COUNTYNAME" "STATE"      "EVTYPE"     "BGN_RANGE"  "BGN_AZI"   
## [11] "BGN_LOCATI" "END_DATE"   "END_TIME"   "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE"  "END_AZI"    "END_LOCATI" "LENGTH"     "WIDTH"     
## [21] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [26] "PROPDMGEXP" "CROPDMG"    "CROPDMGEXP" "WFO"        "STATEOFFIC"
## [31] "ZONENAMES"  "LATITUDE"   "LONGITUDE"  "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS"    "REFNUM"
#   View EXP variables for DMG multipliers
sort(unique(storms$CROPDMGEXP))
## [1] "?" "0" "2" "B" "k" "K" "m" "M"
sort(unique(storms$PROPDMGEXP))
##  [1] "-" "?" "+" "0" "1" "2" "3" "4" "5" "6" "7" "8" "B" "h" "H" "K" "m"
## [18] "M"

Data Reconciliation

Before using storms to draw any conclusions, the non-numeric variables are further examined. CASE differences are reconciled first with toupper() which can make comparison of character values later more succint.
No actual missing values exist in the EVTYPE variable, but a (“?”) appears and is so removed.

#   CASE Recon
storms$EVTYPE <- toupper(storms$EVTYPE)

#   Missing Value Recon
storms <- storms[storms$EVTYPE != "?",]

In the case of the the Damage Variables, the DMG and DMGEXP components need to be combined to properly estimate cost of each event. This is done by redefining what the DMGEXP variables mean. These newly defined cost multiplier variables are then used to correctly recalculate the DMG variables for Crops and Property.

#   toupper() for ease of parsing
storms$CROPDMGEXP <- toupper(storms$CROPDMGEXP)
storms$PROPDMGEXP <- toupper(storms$PROPDMGEXP)

#   Redefine DMGEXP
storms$CROPDMGEXP[is.na(storms$CROPDMGEXP) | storms$CROPDMGEXP == "?"] <- 0
storms$CROPDMGEXP[storms$CROPDMGEXP == "2"] <- 10^2
storms$CROPDMGEXP[storms$CROPDMGEXP == "K"] <- 10^3
storms$CROPDMGEXP[storms$CROPDMGEXP == "M"] <- 10^6
storms$CROPDMGEXP[storms$CROPDMGEXP == "B"] <- 10^9

storms$PROPDMGEXP[is.na(storms$PROPDMGEXP) | storms$PROPDMGEXP == "-" | storms$PROPDMGEXP == "+" | storms$PROPDMGEXP == "?"] <- 0
storms$PROPDMGEXP[storms$PROPDMGEXP == "1" ] <- 10
storms$PROPDMGEXP[storms$PROPDMGEXP == "2" | storms$PROPDMGEXP == "H"] <- 10^2
storms$PROPDMGEXP[storms$PROPDMGEXP == "3" | storms$PROPDMGEXP == "K"] <- 10^3
storms$PROPDMGEXP[storms$PROPDMGEXP == "4"] <- 10^4
storms$PROPDMGEXP[storms$PROPDMGEXP == "5"] <- 10^5
storms$PROPDMGEXP[storms$PROPDMGEXP == "6" | storms$PROPDMGEXP == "M"] <- 10^6
storms$PROPDMGEXP[storms$PROPDMGEXP == "7"] <- 10^7
storms$PROPDMGEXP[storms$PROPDMGEXP == "8"] <- 10^8
storms$PROPDMGEXP[storms$PROPDMGEXP == "B"] <- 10^9

#   Convert DMGEXP variables to Number
storms$CROPDMGEXP <- as.numeric(storms$CROPDMGEXP)
storms$PROPDMGEXP <- as.numeric(storms$PROPDMGEXP)

#   View EXP variables for DMG multipliers
sort(unique(storms$CROPDMGEXP))
## [1] 0e+00 1e+02 1e+03 1e+06 1e+09
sort(unique(storms$PROPDMGEXP))
##  [1] 0e+00 1e+01 1e+02 1e+03 1e+04 1e+05 1e+06 1e+07 1e+08 1e+09
#   Recalculate DMG
storms$CROPDMG <- storms$CROPDMG * storms$CROPDMGEXP
storms$PROPDMG <- storms$PROPDMG * storms$PROPDMGEXP

Subset & Aggregate

Question one involves human harm and therefore will require the use of two variables, “INJURIES” & “FATALITIES”. Question two involves the amount of monetary damage to crops and property and will require four variables, “CROPDMG”, “CROPDMGEXP”, “PROPDMG” & “PROPDMGEXP”. Since the purpose of the analysis is to estimate maximum of these variables, the sub_storms dataset is created by subsetting the original storms dataset for cases in which the value any of these numeric variables is greater than 0.

sub_storms <- storms[storms$FATALITIES > 0 | storms$INJURIES > 0 | 
                         storms$CROPDMG > 0 | storms$PROPDMG > 0 ,]

To answer the first research question, the aggregate function is used to sum the total FATALITIES and INJURIES per EVTYPE.

HARMFUL <- mutate(sub_storms, TOTAL_HARM = FATALITIES + INJURIES)
HARMFUL <- aggregate(TOTAL_HARM ~ EVTYPE, HARMFUL, sum)

#   Sort Descending on Total
HARMFUL <- arrange(HARMFUL, desc(TOTAL_HARM))

#   Top 10 Combined Harmful EVTYPE
HARMFUL_10 <- head(HARMFUL, 10)

To answer the second research question, let’s again use the aggregate function this time to sum the total CROPDMG and PROPDMG per EVTYPE.

DAMAGE <- mutate(sub_storms, TOTAL_DMG = CROPDMG + PROPDMG)
DAMAGE <- aggregate(TOTAL_DMG ~ EVTYPE, DAMAGE, sum)

#   Sort Descending on Total
DAMAGE <- arrange(DAMAGE, desc(TOTAL_DMG))

# Top 10 TOTAL_DMG EVTYPE
DAMAGE_10 <- head(DAMAGE, 10)

Plot

Finally the ggplot2 package and its ggplot function are called to display the answers visually.

# Plot Total Harm
plot_HARM <- ggplot(HARMFUL_10, aes(x= EVTYPE, y=TOTAL_HARM/ 1000)) + 
    geom_bar(stat = "identity") +
    theme(axis.text.x = element_text(angle = 90, 
                                      hjust = 1,
                                      colour = "black",
                                      size = 10,
                                      face="bold")) + 
    labs(list(title="Most Harmful Storm Events (1950 -2011)")) +
    xlab("Event Type") +
    ylab("Fatalities & Injuries (1000's of Incidents)")

# Plot Total Damage
plot_DAMAGE <- ggplot(DAMAGE_10, aes(x= EVTYPE, y=TOTAL_DMG / 10^9)) + 
    geom_bar(stat = "identity") + 
    theme(axis.text.x = element_text(angle = 90, 
                                      hjust = 1,
                                      colour = "black",
                                      size = 10,
                                      face="bold")) +
    labs(list(title="Most Costly Storm Events (1950 -2011)")) +
    xlab("Event Type") +
    ylab("Damage to Crops & Propery (Billions: USD)")

Results

plot_HARM

plot_DAMAGE