Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in injuries, property damage and even loss of life.
This project explorins the US National Oceanic and Atmospheric Administration’s (NOAA) storm database. The analysis will include two main parts:
1. Analyzing finacial loss incliding property and crop damage.
2. Analyzing the effects of storms on personal loss including injury and death.
1. Gather and Load required libraries
setwd("~/Documents/R/rep_research/peer_assignment2")
# Make list of packages: Loop to makes sure they are all installed and load quietly. Thanks Hadley!
pkgs <-c('knitr','scales','quantmod','dplyr','ggplot2','reshape2','magrittr','devtools')
for(p in pkgs) if(p %in% rownames(installed.packages()) == FALSE) {install.packages(p)}
for(p in pkgs) suppressPackageStartupMessages(library(p, quietly=TRUE, character.only=TRUE))
options("getSymbols.warning4.0"=FALSE)
if("plotly" %in% rownames(installed.packages()) == FALSE){
require(devtools)
devtools::install_github("ropensci/plotly")
}
suppressPackageStartupMessages(library("plotly", quietly=TRUE, character.only=TRUE))
opts_chunk$set(fig.path='fig/')
2. Load Data
if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="./data/repdata_data_StormData.csv", method="curl")
# Load data and use na.strings and strip.white to clean up data
if (!"repdata_data_StormData.csv" %in% ls()) {
dat <- read.csv("./data/repdata_data_StormData.csv", sep = ",",
stringsAsFactors = FALSE, na.strings=c(""," "), strip.white = TRUE)
}
3. Take a look at the loaded data.
head(dat, n=3)
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 2 TORNADO 0 <NA> <NA> <NA> <NA> 0
## 3 TORNADO 0 <NA> <NA> <NA> <NA> 0
## COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1 NA 0 <NA> <NA> 14.0 100 3 0 0
## 2 NA 0 <NA> <NA> 2.0 150 2 0 0
## 3 NA 0 <NA> <NA> 0.1 123 2 0 0
## INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP WFO STATEOFFIC ZONENAMES
## 1 15 25.0 K 0 <NA> <NA> <NA> <NA>
## 2 0 2.5 K 0 <NA> <NA> <NA> <NA>
## 3 2 25.0 K 0 <NA> <NA> <NA> <NA>
## LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1 3040 8812 3051 8806 <NA> 1
## 2 3042 8755 0 0 <NA> 2
## 3 3340 8742 0 0 <NA> 3
There are 902297 rows and 37 columns in total. The events in the database start in the year 1950 and end in November 2011.
NOAA Storm Event Database reveals that prior to 1996 the database did not contain all types of events and may show a bias towards events that have been recorded the longest. We will subset the data to include events after 1996-01-01.
# Get rid of caps in header.
names(dat) <- tolower(names(dat))
# Change date formats to get year only.
dat$year_dt <- as.numeric(format(as.Date(dat$bgn_date, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
# Cut data to only the variables and years we will be working with.
stormdf <- dat[dat$year >= 1996,]
stormdf <- stormdf[,c(38,8,23:28)]
These data are very messy. There are misspellings, inconsistencies and white space through out the data. Before any meaningful analysis takes place, we have to clean the data into a useable format.
1. Turn zeros into NA.
Normally one wouldn’t do this but there are several columns that are all NA with zero values in numerical fields. In order to cut these columns, we will set zeros to NA and later convert them back to numerical “0”.
# Make true zeros NA.
stormdf[stormdf==0] <- NA
2. Correct misspellings.
# Clean up the many many misspellings in Evtype column.
stormdf$evtype <- gsub("/", " ", stormdf$evtype)
stormdf$evtype <- gsub("^\\s+|\\s+$", "", stormdf$evtype)
stormdf$evtype <- gsub("TSTM", "THUNDERSTORM", stormdf$evtype)
stormdf$evtype <- gsub("THUNDEERSTORM", "THUNDERSTORM", stormdf$evtype)
stormdf$evtype <- gsub("STORMS", "STORM", stormdf$evtype)
stormdf$evtype <- gsub("FLOODING", "FLOOD", stormdf$evtype)
stormdf$evtype <- gsub("WND", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("W INDS", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("WINDS", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("SML", "SMALL", stormdf$evtype)
stormdf$evtype <- gsub("HVY", "HEAVY", stormdf$evtype)
stormdf$evtype <- gsub("HAIL\\s.+$", "HAIL", stormdf$evtype)
3. Delete rows that are all NA. Since we’ve turned numerial zeros to NA, we can now get rid of any rows which are all NA.
tidy <- stormdf[rowSums(is.na(stormdf))!=6,]
4. Revert NAs in numerical columns to zeros.
We will be performing some mathematical analysis on these columns later, so we need to revert all of the NA we changed from zero back to a numerical state.
# Change NAs back to zeros so R won't get mad when we do math.
tidy$propdmgexp[is.na(tidy$propdmgexp)] <- 0
tidy$cropdmgexp[is.na(tidy$cropdmgexp)] <- 0
tidy$propdmg[is.na(tidy$propdmg)] <- 0
tidy$cropdmg[is.na(tidy$cropdmg)] <- 0
tidy$fatalities[is.na(tidy$fatalities)] <- 0
tidy$injuries[is.na(tidy$injuries)] <- 0
Now that we’ve achieved a tidy data set that we can work with, the analysis can begin. The two main parts of the analysis are: 1. Financial Loss 2. Personal Injury of Death
The data set includes monetery amounts for property and crop damage along with exponent columns that show how many thousand, million or billion of US dollars of loss.
Our method will be to use the multiplyer to convert the amount to USD and convert the USD amount to the value of a 2011 dollar. For dollar conversion, we will leverage the quantmod R package to convert all amounts to a 2011 dollar. Note: Results have been varified with the calculator profived by the Bureau of Labor Statistics.
1. Add two new columns to calculate dollar amount of property and crop damange using the exponents.
# Add columns to calculate crop/property damages with modifiers.
tidy$cropdmg <- tidy$cropdmg * 10^strtoi(chartr("KMB", "369", tidy$cropdmgexp))
tidy$propdmg <- tidy$propdmg * 10^strtoi(chartr("KMB", "369", tidy$propdmgexp))
tidy$damage <- tidy$cropdmg + tidy$propdmg
2. Convert all dollar amounts to 2011 USD.
# Grab the Consumer Price Index for All Urban Consumers: All Items
getSymbols("CPIAUCSL", src='FRED')
## [1] "CPIAUCSL"
# Get annual average Consumer Price Index.
avg.cpi <- apply.yearly(CPIAUCSL, mean)
# Convert by dividing by base price. Year of base price set to 2011.
cf <- avg.cpi/as.numeric(avg.cpi['2011'])
cf <- data.frame(date=index(cf), coredata(cf))
cf$date <- as.numeric(format(as.Date(cf$date, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
3. Perform a left join by year on both data sets and calculate the result.
# Join CPI inflation adjustment with main dataframe
cpi_dat <- left_join(tidy, cf, by=c("year_dt" = "date"))
# Convert damages into 2011 dollars.
cpi_dat$cropdmg_adj <- cpi_dat$cropdmg * cpi_dat$CPIAUCSL
cpi_dat$propdmg_adj <- cpi_dat$propdmg * cpi_dat$CPIAUCSL
cpi_dat$damage_adj <- cpi_dat$damage * cpi_dat$CPIAUCSL
4. Subset the adjusted damage values.
eco_plot <- cpi_dat %>%
filter(damage_adj>0) %>%
select(evtype, damage_adj, propdmg_adj, cropdmg_adj) %>%
group_by(evtype) %>%
summarise(damage_adj = sum(damage_adj), property = sum(propdmg_adj), crop = sum(cropdmg_adj))
head(arrange(eco_plot, desc(damage_adj)), n = 5)
## Source: local data frame [5 x 4]
##
## evtype damage_adj property crop
## (chr) (dbl) (dbl) (dbl)
## 1 FLOOD 132582494372 128204789247 4.377705e+09
## 2 HURRICANE TYPHOON 61832138041 59585648919 2.246489e+09
## 3 STORM SURGE 37488941819 37488938332 3.486831e+03
## 4 TORNADO 22360717182 22113686558 2.470306e+08
## 5 HAIL 14681460485 12609095765 2.072365e+09
5. Plot results of damage.
Stacked bars using the ggplot2 package weren’t as informative as we would like due to the samll marks on the y-axis. To make the numbers a bit more verbose, we add hover-over functionality from the plotly package.
eco_plot <- melt(eco_plot, id.vars = c("evtype", "damage_adj"), variable.name = "Damage_Type")
eco_plot <- transform(eco_plot, Damage_Type = factor(Damage_Type))
eco_plot <- arrange(eco_plot, desc(damage_adj))
# Only graph the top 10 events
ep <- eco_plot[1:(2*10),] %>%
ggplot(aes(x = reorder(evtype, -damage_adj), y = value/10^6, fill = Damage_Type)) +
geom_bar(stat = "identity", position = "stack", color="black") +
scale_fill_manual(values = c("#000066", "#FF3300")) +
theme_classic(base_size = 9) +
scale_x_discrete(name = "Event Type") +
scale_y_continuous(name = "Damages (converted to the value of 2011 US dollars)") +
ggtitle("Economic Damages from Storm Weather (1996-2011)") +
theme(axis.title.y = element_text(face="bold", colour="#000000", size=10),
axis.title.x = element_text(face="bold", colour="#000000", size=8),
axis.text.x = element_text(angle=16, vjust=0, size=8))
ggplotly(ep)
The weather events with the greatest economic impact over the period of this study. Between the years 1996 and 2011, floods accounted for the most economic damage by far.
1. Subset the injury and death values.
cpi_dat$victims <- cpi_dat$fatalities + cpi_dat$injuries
health_plot <- cpi_dat %>%
filter(victims > 0) %>%
select(evtype, victims, fatalities, injuries) %>%
group_by(evtype) %>%
summarise(victims = sum(victims), fatalities = sum(fatalities), injuries = sum(injuries))
2. Plot results of injuries and deaths.
health_plot <- melt(health_plot, id.vars = c("evtype", "victims"), variable.name = "Loss_Type")
health_plot <- transform(health_plot, Loss_Type = factor(Loss_Type))
health_plot <- arrange(health_plot, desc(victims))
# Only graph the top 10 events
hp <- health_plot[1:(2*10),] %>%
ggplot(aes(x = reorder(evtype, -victims), y = value, fill = Loss_Type)) +
geom_bar(stat = "identity", position = "stack", color="black") +
scale_fill_manual(values = c("#000066", "#FF3300")) +
theme_classic(base_size = 9) +
scale_x_discrete(name = "Event Type") +
scale_y_continuous(name = "Total Persons Injured or Killed") +
ggtitle("Health Impact from Storm Weather (1996-2011)") +
theme(axis.title.y = element_text(face="bold", colour="#000000", size=12),
axis.title.x = element_text(face="bold", colour="#000000", size=8),
axis.text.x = element_text(angle=16, vjust=0, size=8))
ggplotly(hp)
As we see in the figure above, tornados have casued the most injuries by a significant amount. However, heat and excessive heat have caused the most deaths in the selected time period.
devtools::session_info()
## Session info --------------------------------------------------------------
## setting value
## version R version 3.2.2 (2015-08-14)
## system x86_64, darwin13.4.0
## ui X11
## language (EN)
## collate en_US.UTF-8
## tz America/New_York
## date 2015-11-18
## Packages ------------------------------------------------------------------
## package * version date source
## assertthat 0.1 2013-12-06 CRAN (R 3.2.0)
## base64enc 0.1-3 2015-07-28 CRAN (R 3.2.0)
## colorspace 1.2-6 2015-03-11 CRAN (R 3.2.0)
## DBI 0.3.1 2014-09-24 CRAN (R 3.2.0)
## devtools * 1.9.1 2015-09-11 CRAN (R 3.2.0)
## digest 0.6.8 2014-12-31 CRAN (R 3.2.0)
## dplyr * 0.4.3 2015-09-01 CRAN (R 3.2.0)
## evaluate 0.8 2015-09-18 CRAN (R 3.2.0)
## formatR 1.2.1 2015-09-18 CRAN (R 3.2.0)
## ggplot2 * 1.0.1 2015-03-17 CRAN (R 3.2.0)
## gridExtra 2.0.0 2015-07-14 CRAN (R 3.2.0)
## gtable 0.1.2 2012-12-05 CRAN (R 3.2.0)
## htmltools 0.2.6 2014-09-08 CRAN (R 3.2.0)
## htmlwidgets 0.5 2015-06-21 CRAN (R 3.2.0)
## httr 1.0.0 2015-06-25 CRAN (R 3.2.0)
## jsonlite 0.9.17 2015-09-06 CRAN (R 3.2.0)
## knitr * 1.11 2015-08-14 CRAN (R 3.2.1)
## labeling 0.3 2014-08-23 CRAN (R 3.2.0)
## lattice 0.20-33 2015-07-14 CRAN (R 3.2.2)
## lazyeval 0.1.10.9000 2015-08-14 Github (hadley/lazyeval@ecb8dc0)
## magrittr * 1.5 2014-11-22 CRAN (R 3.2.0)
## MASS 7.3-44 2015-08-30 CRAN (R 3.2.0)
## memoise 0.2.1 2014-04-22 CRAN (R 3.2.0)
## munsell 0.4.2 2013-07-11 CRAN (R 3.2.0)
## plotly * 2.0.2 2015-11-17 Github (ropensci/plotly@23c3945)
## plyr 1.8.3 2015-06-12 CRAN (R 3.2.0)
## proto 0.3-10 2012-12-22 CRAN (R 3.2.0)
## quantmod * 0.4-5 2015-07-24 CRAN (R 3.2.0)
## R6 2.1.1 2015-08-19 CRAN (R 3.2.1)
## Rcpp 0.12.2 2015-11-15 CRAN (R 3.2.2)
## reshape2 * 1.4.1 2014-12-06 CRAN (R 3.2.0)
## rmarkdown 0.8.1 2015-10-10 CRAN (R 3.2.1)
## rstudioapi 0.3.1 2015-04-07 CRAN (R 3.2.0)
## scales * 0.3.0 2015-08-25 CRAN (R 3.2.0)
## stringi 1.0-1 2015-10-22 CRAN (R 3.2.0)
## stringr 1.0.0 2015-04-30 CRAN (R 3.2.0)
## TTR * 0.23-0 2015-07-10 CRAN (R 3.2.0)
## viridis 0.3.1 2015-10-11 CRAN (R 3.2.0)
## xts * 0.9-7 2014-01-02 CRAN (R 3.2.0)
## yaml 2.1.13 2014-06-12 CRAN (R 3.2.0)
## zoo * 1.7-12 2015-03-16 CRAN (R 3.2.0)