The goal of this analysis is, referencing the US NOAA storm database, determine the following questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
This analysis used data from the US NOAA storm database and narrowed the analysis down to the latest ten years (Nov 2001 through Nov 2011) and discovered the following:
Given these findings, government officials and managers should focus their preparedness on the storm events listed in a prioritized fashion based upon their geographic history.
Load r packages
library(ggplot2)
library(plyr)
library(dplyr)
library(data.table)
library(grid)
First, read in the US NOAA storm database located here: https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2
Then unzip the .csv into the working /data directory and read into the storms data frame.
# set working & create data directory
setwd("C:\\Users\\dabradford\\Desktop\\Coursera\\DataSci\\Reproducible\\RepData_PeerAssessment2")
if(!file.exists("./data")){dir.create("./data")}
# download storm data
get.data.project <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(get.data.project,destfile="./data/storm_data.csv.bz2",method="auto")
zipfile.data = "storm_data.csv.bz2"
# make sure the data is in the working directory if not download the zip file into the to zipfile.data and unzip it
if(!file.exists(zipfile.data)) {
unzip(zipfile="./data/storm_data.csv.bz2",exdir="./data")
}
path_act <- file.path("./data" , "./data/storm_data")
files<-list.files(path_act, recursive=TRUE)
# Read data file > THIS WILL TAKE A WHILE
storms_raw <- read.csv("./data/storm_data.csv.bz2")
# inspect data file
#str(storms_raw)
#summary(storms_raw)
This analysis specifically asks for the effects of storm events across the United States. For this analysis we will not be considering geographical boundaries, such as county or state borders, and will include all geolocations included in the dataset. Additionally, we will only be looking at the last ten years of data for our analysis to account for more recent trends and to ensure we are working with more accurate data points.
Filter data to only include events from November 2001 through November 2011 based on the BGN_DATE variable.
# First convert BGN_DATE to date
storms_raw$BGN_DATE <- as.Date(storms_raw$BGN_DATE, "%m/%d/%Y")
# Then parse only those events between Nov 2001 and Nov 2011 to get the last ten years of data. Also only include the variables that will help us understand consequences to population health (FATALITIES & INJURIES) and/or economics (PROPDMG & CROPDMG)
storms_10yr <- storms_raw %>%
filter(BGN_DATE >= "2001-11-01" & BGN_DATE <= "2011-11-30")
## Warning: package 'bindrcpp' was built under R version 3.4.4
We need to understand the dataset better since there isn’t clear documentation of what some of the variable contents signify. Let’s view a few events in Georgia:
# view one event to determine variable meanings
ga_events <- storms_raw %>%
select(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
filter(BGN_DATE == "2007-03-01" & EVTYPE == "TORNADO" & STATE == "GA")
head(ga_events)
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 2007-03-01 GA TORNADO 0 0 250 K 0
## 2 2007-03-01 GA TORNADO 0 0 1 K 0
## 3 2007-03-01 GA TORNADO 0 0 400 K 0
## 4 2007-03-01 GA TORNADO 1 4 500 K 0
## 5 2007-03-01 GA TORNADO 0 9 500 K 0
## 6 2007-03-01 GA TORNADO 0 0 25 K 0
## CROPDMGEXP
## 1 K
## 2 K
## 3 K
## 4 K
## 5 K
## 6 K
Based on this analysis and comparing to the NOAA Storm Events database search page at the link below we understand the following:
FATALITIES - Number of people killed as a direct result of the event INJURIES - Number of people injured as a direct result of the event
Property damage is found by combining PROPDMG and PROPDMGEXP variables. Possible values for PROPDMGEXP include:
Crop damage is calculated in the same way.
https://www.ncdc.noaa.gov/stormevents/choosedates.jsp?statefips=-999%2CALL
count(storms_10yr, 'PROPDMGEXP')
## # A tibble: 1 x 2
## `"PROPDMGEXP"` n
## <chr> <int>
## 1 PROPDMGEXP 455585
count(storms_10yr, 'CROPDMGEXP')
## # A tibble: 1 x 2
## `"CROPDMGEXP"` n
## <chr> <int>
## 1 CROPDMGEXP 455585
Therefore we need to add to new variables to account for property and crop damage.
# Property damage
storms_10yr <- mutate(storms_10yr,
prop.dmg = ifelse(PROPDMGEXP == "K", PROPDMG * 1000,
ifelse(PROPDMGEXP == "M", PROPDMG * 1000000,
ifelse(PROPDMGEXP == "B", PROPDMG * 1000000000, 0))))
# Crop damage
storms_10yr <- mutate(storms_10yr,
crop.dmg = ifelse(CROPDMGEXP == "K", CROPDMG * 1000,
ifelse(CROPDMGEXP == "M", CROPDMG * 1000000,
ifelse(CROPDMGEXP == "B", CROPDMG * 1000000000, 0))))
Since there are 128 event types we wanted to plot the top 20 and categorize all other events as “OTHER”.
# sort by top 20 and categorize the remaining events in "Other"
top20fatalities <- data.table(storms_10yr)
top20fatalities <- top20fatalities[FATALITIES != 0 & !is.na(FATALITIES), sum(FATALITIES), by = EVTYPE]
top20fatalities <- top20fatalities[order(-rank(V1), EVTYPE)[21:nrow(top20fatalities)],EVTYPE:="OTHER*"]
top20fatalities <- top20fatalities[, sum(V1), by = EVTYPE]
top20fatalities$EVTYPE <- factor(top20fatalities$EVTYPE, levels = top20fatalities$EVTYPE[order(top20fatalities$V1)])
top20injuries <- data.table(storms_10yr)
top20injuries <- top20injuries[INJURIES != 0 & !is.na(INJURIES), sum(INJURIES), by = EVTYPE]
top20injuries <- top20injuries[order(-rank(V1), EVTYPE)[21:nrow(top20injuries)],EVTYPE:="OTHER*"]
top20injuries <- top20injuries[, sum(V1), by = EVTYPE]
top20injuries$EVTYPE <- factor(top20injuries$EVTYPE, levels = top20injuries$EVTYPE[order(top20injuries$V1)])
plot1 <- ggplot(top20fatalities, aes(x=EVTYPE, y=V1)) +
geom_bar(stat="identity", fill="blue") +
labs(x="Event Type", y="# Fatalities", title=" # Fatalities by Event Type (Top 20)", subtitle = " November 2001 through November 2011", size = 5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
geom_text(aes(label=V1), size=2.5, position=position_dodge(width=1), vjust=-.35) +
scale_y_continuous(breaks = seq(min(0), max(1500), by = 300))
plot1
plot2 <- ggplot(top20injuries, aes(x=EVTYPE, y=V1)) +
geom_bar(stat="identity", fill="blue") +
labs(x="Event Type", y="# Injuries", title=" # Injuries by Event Type (Top 20)", subtitle = " November 2001 through November 2011", size = 5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
geom_text(aes(label=V1), size=2.5, position=position_dodge(width=1), vjust=-.35) +
scale_y_continuous(breaks = seq(min(0), max(14000), by = 2000))
plot2
# sum prop.dmg and crop.dmg by EVTTYPE
damage <- data.table(storms_10yr)
damage <- damage %>%
group_by(EVTYPE) %>%
summarise(tot.prop.dmg = sum(prop.dmg), tot.crop.dmg = sum(crop.dmg))
damage <- damage %>%
mutate(tot.dmg = tot.prop.dmg + tot.crop.dmg)
# sort by top 20 and categorize the remaining events in "Other"
options(scipen=999)
top20damage <- data.table(damage)
top20damage <- top20damage[tot.dmg != 0 & !is.na(tot.dmg), sum(tot.dmg), by = EVTYPE]
top20damage <- top20damage[order(-rank(V1), EVTYPE)[21:nrow(top20damage)],EVTYPE:="OTHER*"]
top20damage <- top20damage[, sum(V1), by = EVTYPE]
top20damage$EVTYPE <- factor(top20damage$EVTYPE, levels = top20damage$EVTYPE[order(top20damage$V1)])
# then convert to millions for chart readability
top20damage$V1 <- top20damage$V1*0.00000001
plot3 <- ggplot(top20damage, aes(x=EVTYPE, y=V1)) +
geom_bar(stat="identity", fill="blue") +
labs(x="Event Type", y="Economic Impact", title=" Economic Impact (USD Millions) by Event Type", subtitle = " November 2001 through November 2011", size = 5) +
theme(axis.text.x = element_text(angle = 45, hjust = 1, size=8), plot.margin = unit(c(.2,.2,.2,.2), "cm")) +
geom_text(aes(label=sprintf("%0.0f", V1)), size=3, position=position_dodge(width=1), vjust=-.35) +
scale_y_continuous(breaks = seq(min(0), max(1400), by = 200))
plot3