Synopsis

The goal of this analysis is, referencing the US NOAA storm database, determine the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

This analysis used data from the US NOAA storm database and narrowed the analysis down to the latest ten years (Nov 2001 through Nov 2011) and discovered the following:

Given these findings, government officials and managers should focus their preparedness on the storm events listed in a prioritized fashion based upon their geographic history.

Data Processing

Load r packages

library(ggplot2)
library(plyr)
library(dplyr)
library(data.table)
library(grid)

First, read in the US NOAA storm database located here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2

Then unzip the .csv into the working /data directory and read into the storms data frame.

# set working & create data directory
setwd("C:\\Users\\dabradford\\Desktop\\Coursera\\DataSci\\Reproducible\\RepData_PeerAssessment2")
if(!file.exists("./data")){dir.create("./data")}

# download storm data
get.data.project <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(get.data.project,destfile="./data/storm_data.csv.bz2",method="auto")

zipfile.data = "storm_data.csv.bz2"

# make sure the data is in the working directory if not download the zip file into the to zipfile.data and unzip it
if(!file.exists(zipfile.data)) {        
    unzip(zipfile="./data/storm_data.csv.bz2",exdir="./data")
} 

path_act <- file.path("./data" , "./data/storm_data")
files<-list.files(path_act, recursive=TRUE)

# Read data file > THIS WILL TAKE A WHILE
storms_raw <- read.csv("./data/storm_data.csv.bz2")

# inspect data file
#str(storms_raw)
#summary(storms_raw)

This analysis specifically asks for the effects of storm events across the United States. For this analysis we will not be considering geographical boundaries, such as county or state borders, and will include all geolocations included in the dataset. Additionally, we will only be looking at the last ten years of data for our analysis to account for more recent trends and to ensure we are working with more accurate data points.

Filter data to only include events from November 2001 through November 2011 based on the BGN_DATE variable.

# First convert BGN_DATE to date 
storms_raw$BGN_DATE <- as.Date(storms_raw$BGN_DATE, "%m/%d/%Y")

# Then parse only those events between Nov 2001 and Nov 2011 to get the last ten years of data.  Also only include the variables that will help us understand consequences to population health (FATALITIES & INJURIES) and/or economics (PROPDMG &  CROPDMG)

storms_10yr  <- storms_raw %>%
    
    filter(BGN_DATE >= "2001-11-01" & BGN_DATE <= "2011-11-30")
## Warning: package 'bindrcpp' was built under R version 3.4.4

We need to understand the dataset better since there isn’t clear documentation of what some of the variable contents signify. Let’s view a few events in Georgia:

# view one event to determine variable meanings
ga_events  <- storms_raw %>%
    select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    filter(BGN_DATE == "2007-03-01" & EVTYPE == "TORNADO" & STATE == "GA")

head(ga_events)
##     BGN_DATE STATE  EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 2007-03-01    GA TORNADO          0        0     250          K       0
## 2 2007-03-01    GA TORNADO          0        0       1          K       0
## 3 2007-03-01    GA TORNADO          0        0     400          K       0
## 4 2007-03-01    GA TORNADO          1        4     500          K       0
## 5 2007-03-01    GA TORNADO          0        9     500          K       0
## 6 2007-03-01    GA TORNADO          0        0      25          K       0
##   CROPDMGEXP
## 1          K
## 2          K
## 3          K
## 4          K
## 5          K
## 6          K

Based on this analysis and comparing to the NOAA Storm Events database search page at the link below we understand the following:

FATALITIES - Number of people killed as a direct result of the event INJURIES - Number of people injured as a direct result of the event

Property damage is found by combining PROPDMG and PROPDMGEXP variables. Possible values for PROPDMGEXP include:

- no damage 0 - no damage B - damage in billions USD. For example, if PROPDMG = 5, and PROPDMG = ‘B’ then property damage is $5,000,000,000 K - damage in thousands USD M - damage in millions USD

Crop damage is calculated in the same way.

https://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=-999%2CALL

count(storms_10yr, 'PROPDMGEXP')
## # A tibble: 1 x 2
##   `"PROPDMGEXP"`      n
##   <chr>           <int>
## 1 PROPDMGEXP     455585
count(storms_10yr, 'CROPDMGEXP')
## # A tibble: 1 x 2
##   `"CROPDMGEXP"`      n
##   <chr>           <int>
## 1 CROPDMGEXP     455585

Therefore we need to add to new variables to account for property and crop damage.

# Property damage
storms_10yr <- mutate(storms_10yr,
                      prop.dmg = ifelse(PROPDMGEXP == "K", PROPDMG * 1000,
                                        ifelse(PROPDMGEXP == "M", PROPDMG * 1000000,
                                               ifelse(PROPDMGEXP == "B", PROPDMG * 1000000000, 0))))
# Crop damage
storms_10yr <- mutate(storms_10yr,
                      crop.dmg = ifelse(CROPDMGEXP == "K", CROPDMG * 1000,
                                        ifelse(CROPDMGEXP == "M", CROPDMG * 1000000,
                                               ifelse(CROPDMGEXP == "B", CROPDMG * 1000000000, 0))))

Since there are 128 event types we wanted to plot the top 20 and categorize all other events as “OTHER”.

# sort by top 20 and categorize the remaining events in "Other"
top20fatalities <- data.table(storms_10yr)
top20fatalities <- top20fatalities[FATALITIES != 0 & !is.na(FATALITIES), sum(FATALITIES), by = EVTYPE]

top20fatalities <- top20fatalities[order(-rank(V1), EVTYPE)[21:nrow(top20fatalities)],EVTYPE:="OTHER*"]
top20fatalities <- top20fatalities[, sum(V1), by = EVTYPE]
top20fatalities$EVTYPE <- factor(top20fatalities$EVTYPE, levels = top20fatalities$EVTYPE[order(top20fatalities$V1)])


top20injuries <- data.table(storms_10yr)
top20injuries <- top20injuries[INJURIES != 0 & !is.na(INJURIES), sum(INJURIES), by = EVTYPE]

top20injuries <- top20injuries[order(-rank(V1), EVTYPE)[21:nrow(top20injuries)],EVTYPE:="OTHER*"]
top20injuries <- top20injuries[, sum(V1), by = EVTYPE]
top20injuries$EVTYPE <- factor(top20injuries$EVTYPE, levels = top20injuries$EVTYPE[order(top20injuries$V1)])

Results

Q1 - Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

plot1 <- ggplot(top20fatalities, aes(x=EVTYPE, y=V1)) + 
    geom_bar(stat="identity", fill="blue") +
    labs(x="Event Type", y="# Fatalities", title="                             # Fatalities by Event Type (Top 20)", subtitle = "                                           November 2001 through November 2011", size = 5) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
    geom_text(aes(label=V1), size=2.5, position=position_dodge(width=1), vjust=-.35) +
    scale_y_continuous(breaks = seq(min(0), max(1500), by = 300)) 

plot1

plot2 <- ggplot(top20injuries, aes(x=EVTYPE, y=V1)) + 
    geom_bar(stat="identity", fill="blue") +
    labs(x="Event Type", y="# Injuries", title="                                   # Injuries by Event Type (Top 20)", subtitle = "                                                November 2001 through November 2011", size = 5) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
    geom_text(aes(label=V1), size=2.5, position=position_dodge(width=1), vjust=-.35) +
    scale_y_continuous(breaks = seq(min(0), max(14000), by = 2000)) 

plot2

Q2 - Across the United States, which types of events have the greatest economic consequences?

# sum prop.dmg and crop.dmg by EVTTYPE
damage <- data.table(storms_10yr)

damage <- damage %>% 
    group_by(EVTYPE) %>% 
    summarise(tot.prop.dmg = sum(prop.dmg), tot.crop.dmg = sum(crop.dmg))

damage <- damage %>% 
    mutate(tot.dmg = tot.prop.dmg + tot.crop.dmg) 
# sort by top 20 and categorize the remaining events in "Other"
options(scipen=999)
top20damage <- data.table(damage)
top20damage <- top20damage[tot.dmg != 0 & !is.na(tot.dmg), sum(tot.dmg), by = EVTYPE]

top20damage <- top20damage[order(-rank(V1), EVTYPE)[21:nrow(top20damage)],EVTYPE:="OTHER*"]
top20damage <- top20damage[, sum(V1), by = EVTYPE]
top20damage$EVTYPE <- factor(top20damage$EVTYPE, levels = top20damage$EVTYPE[order(top20damage$V1)])

# then convert to millions for chart readability
top20damage$V1 <- top20damage$V1*0.00000001
plot3 <- ggplot(top20damage, aes(x=EVTYPE, y=V1)) + 
    geom_bar(stat="identity", fill="blue") +
    labs(x="Event Type", y="Economic Impact", title="                       Economic Impact (USD Millions) by Event Type", subtitle = "                                            November 2001 through November 2011", size = 5) +
    theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
    geom_text(aes(label=sprintf("%0.0f", V1)), size=3, position=position_dodge(width=1), vjust=-.35) +
    scale_y_continuous(breaks = seq(min(0), max(1400), by = 200)) 

plot3