There are many weather events that impact the United States. Based on data from the U.S. National Oceanic and Atmospheric Administration's (NOAA) storm database we found that Tornadoes and Floods accounted for the highest impact on Population Health and Economic costs respectively. For this analysis we used data found in this zip file. The data covers weather events in the U.S. starting in the year 1950 through Novemeber of 2011, with records becoming more complete as they're more recent.
We begin by downloading the file and loading needed packages for data processing:
# download bz2 file
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./StormData.csv.bz2", method = "curl")
# read file
df <- read.csv(bzfile("StormData.csv.bz2"), stringsAsFactors = FALSE, strip.white = TRUE)
# load required packages, note the %>% operator requires the latest verison
# of dplyr (0.2)
require("dplyr")
## Loading required package: dplyr
##
## Attaching package: 'dplyr'
##
## The following objects are masked from 'package:stats':
##
## filter, lag
##
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
require("stringr")
## Loading required package: stringr
Since we're only looking at events that resulted in property damage, injuries or fatalities we can filter the dataset to only observations where one of those is attributed to an event.
# subset dataset to just those observations where an injury OR a fatality OR
# property damage OR crop damage was attributed to an event
df <- subset(df, PROPDMG > 0 | FATALITIES > 0 | INJURIES > 0 | CROPDMG > 0)
We can further filter the dataset to only the variables needed for our analysis: BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP
# select only columns needed
df <- select(df, BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP,
CROPDMG, CROPDMGEXP)
Concert date to class date:
# convert BGN_DATE to class Date
df$BGN_DATE <- as.POSIXct(strptime(df$BGN_DATE, "%m/%d/%Y %H:%M:%S"))
Now we need to convert the property damage and crop damage values. For example, the PROPDMG and PROPDMGEXP need to be used in conjunction to calculate total property damage. Using the first example:
head(df, 1)
## BGN_DATE STATE EVTYPE FATALITIES INJURIES PROPDMG PROPDMGEXP CROPDMG
## 1 1950-04-18 AL TORNADO 0 15 25 K 0
## CROPDMGEXP
## 1
The property damage is listed as 25.0 and the PROPDMGEXP is K which according to the documentation means the 25.0 is thousands. We'll perform similar calculations for h/H (hundreds), m/M (millions) and b/B (billions).
# convert PROPDMGEXP to lowercase
df$PROPDMGEXP <- tolower(df$PROPDMGEXP)
# convert each based on PROPDMGEXP
df$PROPDMG[df$PROPDMGEXP == "h"] <- df$PROPDMG[df$PROPDMGEXP == "h"] * 100
df$PROPDMG[df$PROPDMGEXP == "k"] <- df$PROPDMG[df$PROPDMGEXP == "k"] * 1000
df$PROPDMG[df$PROPDMGEXP == "m"] <- df$PROPDMG[df$PROPDMGEXP == "m"] * 1e+06
df$PROPDMG[df$PROPDMGEXP == "b"] <- df$PROPDMG[df$PROPDMGEXP == "b"] * 1e+09
Now do same for crop damage…
# convert CROPDMGEXP to lowercase
df$CROPDMGEXP <- tolower(df$CROPDMGEXP)
# convert each based on CROPDMGEXP
df$CROPDMG[df$CROPDMGEXP == "h"] <- df$CROPDMG[df$CROPDMGEXP == "h"] * 100
df$CROPDMG[df$CROPDMGEXP == "k"] <- df$CROPDMG[df$CROPDMGEXP == "k"] * 1000
df$CROPDMG[df$CROPDMGEXP == "m"] <- df$CROPDMG[df$CROPDMGEXP == "m"] * 1e+06
df$CROPDMG[df$CROPDMGEXP == "b"] <- df$CROPDMG[df$CROPDMGEXP == "b"] * 1e+09
To calculate the total economic impact we'll combine the property and crop damage
df$econ_impact <- df$PROPDMG + df$CROPDMG
To calculate the total population health impact we'll combine the injuries and fatalities
df$pop_impact <- df$INJURIES + df$FATALITIES
For plotting we'll look at trends of both injuries/fatalities and economic impact by year
require("lubridate")
## Loading required package: lubridate
df$year <- year(df$BGN_DATE)
For this analysis we'll focus on the event types that are most harmful from a population health and economic standpoint, so we'll limit to just the top ten events for each.
# create top ten groups
pop10 <- head(df %>% group_by(EVTYPE) %>% summarise(pop_impact = sum(pop_impact)) %>%
arrange(-pop_impact), 10)
econ10 <- head(df %>% group_by(EVTYPE) %>% summarise(econ_impact = sum(econ_impact)) %>%
arrange(-econ_impact), 10)
For the summary plot by year create a data frame with year and the sum of total health and economic impacts
summary_df <- df %>% group_by(year) %>% summarise(econ_impact = sum(econ_impact),
pop_impact = sum(pop_impact))
Here are the result plots:
require("ggplot2")
## Loading required package: ggplot2
require("scales")
## Loading required package: scales
# population health plot
ggplot(pop10, aes(EVTYPE, pop_impact)) + geom_bar(stat = "identity") + coord_flip() +
xlab("Event") + ylab("Injuries + Deaths") + ggtitle("Weather Event Impact on U.S. Population Health \n1953 - 2011")
# economic impact plot
ggplot(econ10, aes(EVTYPE, econ_impact/1e+09)) + geom_bar(stat = "identity") +
coord_flip() + xlab("Event") + ylab("Property + Crop Damage ($ billions)") +
ggtitle("Weather Event Impact on U.S. Economy \n1953 - 2011")