Synopsis

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in injuries, property damage and even loss of life.

This project explorins the US National Oceanic and Atmospheric Administration’s (NOAA) storm database. The analysis will include two main parts:

1. Analyzing finacial loss incliding property and crop damage.
2. Analyzing the effects of storms on personal loss including injury and death.

Data Processing

1. Gather and Load required libraries

setwd("~/Documents/R/rep_research/peer_assignment2")
# Make list of packages: Loop to makes sure they are all installed and load quietly. Thanks Hadley!
pkgs <-c('knitr','scales','quantmod','dplyr','ggplot2','reshape2','magrittr','devtools')
for(p in pkgs) if(p %in% rownames(installed.packages()) == FALSE) {install.packages(p)}
for(p in pkgs) suppressPackageStartupMessages(library(p, quietly=TRUE, character.only=TRUE))
options("getSymbols.warning4.0"=FALSE)
if("plotly" %in% rownames(installed.packages()) == FALSE){
    require(devtools)
    devtools::install_github("ropensci/plotly")
}
suppressPackageStartupMessages(library("plotly", quietly=TRUE, character.only=TRUE))
opts_chunk$set(fig.path='fig/')

2. Load Data

if(!file.exists("./data")){dir.create("./data")}
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile="./data/repdata_data_StormData.csv", method="curl")

# Load data and use na.strings and strip.white to clean up data
if (!"repdata_data_StormData.csv" %in% ls()) {
    dat <- read.csv("./data/repdata_data_StormData.csv", sep = ",", 
                    stringsAsFactors = FALSE, na.strings=c(""," "), strip.white = TRUE)
}

3. Take a look at the loaded data.

head(dat, n=3)
##   STATE__          BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## 1       1 4/18/1950 0:00:00     0130       CST     97     MOBILE    AL
## 2       1 4/18/1950 0:00:00     0145       CST      3    BALDWIN    AL
## 3       1 2/20/1951 0:00:00     1600       CST     57    FAYETTE    AL
##    EVTYPE BGN_RANGE BGN_AZI BGN_LOCATI END_DATE END_TIME COUNTY_END
## 1 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 2 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
## 3 TORNADO         0    <NA>       <NA>     <NA>     <NA>          0
##   COUNTYENDN END_RANGE END_AZI END_LOCATI LENGTH WIDTH F MAG FATALITIES
## 1         NA         0    <NA>       <NA>   14.0   100 3   0          0
## 2         NA         0    <NA>       <NA>    2.0   150 2   0          0
## 3         NA         0    <NA>       <NA>    0.1   123 2   0          0
##   INJURIES PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP  WFO STATEOFFIC ZONENAMES
## 1       15    25.0          K       0       <NA> <NA>       <NA>      <NA>
## 2        0     2.5          K       0       <NA> <NA>       <NA>      <NA>
## 3        2    25.0          K       0       <NA> <NA>       <NA>      <NA>
##   LATITUDE LONGITUDE LATITUDE_E LONGITUDE_ REMARKS REFNUM
## 1     3040      8812       3051       8806    <NA>      1
## 2     3042      8755          0          0    <NA>      2
## 3     3340      8742          0          0    <NA>      3

There are 902297 rows and 37 columns in total. The events in the database start in the year 1950 and end in November 2011.

Subsetting Relevant Data

NOAA Storm Event Database reveals that prior to 1996 the database did not contain all types of events and may show a bias towards events that have been recorded the longest. We will subset the data to include events after 1996-01-01.

# Get rid of caps in header.
names(dat) <- tolower(names(dat))
# Change date formats to get year only.
dat$year_dt <- as.numeric(format(as.Date(dat$bgn_date, format = "%m/%d/%Y %H:%M:%S"), "%Y"))
# Cut data to only the variables and years we will be working with.
stormdf <- dat[dat$year >= 1996,]
stormdf <- stormdf[,c(38,8,23:28)]

Data Cleaning

These data are very messy. There are misspellings, inconsistencies and white space through out the data. Before any meaningful analysis takes place, we have to clean the data into a useable format.

1. Turn zeros into NA.

Normally one wouldn’t do this but there are several columns that are all NA with zero values in numerical fields. In order to cut these columns, we will set zeros to NA and later convert them back to numerical “0”.

# Make true zeros NA.
stormdf[stormdf==0] <- NA

2. Correct misspellings.

# Clean up the many many misspellings in Evtype column.
stormdf$evtype <- gsub("/", " ", stormdf$evtype)
stormdf$evtype <- gsub("^\\s+|\\s+$", "", stormdf$evtype)
stormdf$evtype <- gsub("TSTM", "THUNDERSTORM", stormdf$evtype)
stormdf$evtype <- gsub("THUNDEERSTORM", "THUNDERSTORM", stormdf$evtype)
stormdf$evtype <- gsub("STORMS", "STORM", stormdf$evtype)
stormdf$evtype <- gsub("FLOODING", "FLOOD", stormdf$evtype)
stormdf$evtype <- gsub("WND", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("W INDS", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("WINDS", "WIND", stormdf$evtype)
stormdf$evtype <- gsub("SML", "SMALL", stormdf$evtype)
stormdf$evtype <- gsub("HVY", "HEAVY", stormdf$evtype)
stormdf$evtype <- gsub("HAIL\\s.+$", "HAIL", stormdf$evtype)

3. Delete rows that are all NA. Since we’ve turned numerial zeros to NA, we can now get rid of any rows which are all NA.

tidy <- stormdf[rowSums(is.na(stormdf))!=6,]

4. Revert NAs in numerical columns to zeros.

We will be performing some mathematical analysis on these columns later, so we need to revert all of the NA we changed from zero back to a numerical state.

# Change NAs back to zeros so R won't get mad when we do math.
tidy$propdmgexp[is.na(tidy$propdmgexp)] <- 0
tidy$cropdmgexp[is.na(tidy$cropdmgexp)] <- 0
tidy$propdmg[is.na(tidy$propdmg)] <- 0
tidy$cropdmg[is.na(tidy$cropdmg)] <- 0
tidy$fatalities[is.na(tidy$fatalities)] <- 0
tidy$injuries[is.na(tidy$injuries)] <- 0

Results

Now that we’ve achieved a tidy data set that we can work with, the analysis can begin. The two main parts of the analysis are: 1. Financial Loss 2. Personal Injury of Death

Financial Loss

The data set includes monetery amounts for property and crop damage along with exponent columns that show how many thousand, million or billion of US dollars of loss.

Our method will be to use the multiplyer to convert the amount to USD and convert the USD amount to the value of a 2011 dollar. For dollar conversion, we will leverage the quantmod R package to convert all amounts to a 2011 dollar. Note: Results have been varified with the calculator profived by the Bureau of Labor Statistics.

1. Add two new columns to calculate dollar amount of property and crop damange using the exponents.

# Add columns to calculate crop/property damages with modifiers.
tidy$cropdmg <- tidy$cropdmg * 10^strtoi(chartr("KMB", "369", tidy$cropdmgexp))
tidy$propdmg <- tidy$propdmg * 10^strtoi(chartr("KMB", "369", tidy$propdmgexp))
tidy$damage <- tidy$cropdmg + tidy$propdmg

2. Convert all dollar amounts to 2011 USD.

# Grab the Consumer Price Index for All Urban Consumers: All Items
getSymbols("CPIAUCSL", src='FRED')
## [1] "CPIAUCSL"
# Get annual average Consumer Price Index.
avg.cpi <- apply.yearly(CPIAUCSL, mean)
# Convert by dividing by base price. Year of base price set to 2011.
cf <- avg.cpi/as.numeric(avg.cpi['2011'])
cf <- data.frame(date=index(cf), coredata(cf))
cf$date <- as.numeric(format(as.Date(cf$date, format = "%m/%d/%Y %H:%M:%S"), "%Y"))

3. Perform a left join by year on both data sets and calculate the result.

# Join CPI inflation adjustment with main dataframe
cpi_dat <- left_join(tidy, cf, by=c("year_dt" = "date"))
# Convert damages into 2011 dollars.
cpi_dat$cropdmg_adj <- cpi_dat$cropdmg * cpi_dat$CPIAUCSL
cpi_dat$propdmg_adj <- cpi_dat$propdmg * cpi_dat$CPIAUCSL
cpi_dat$damage_adj <- cpi_dat$damage * cpi_dat$CPIAUCSL

4. Subset the adjusted damage values.

eco_plot <- cpi_dat %>%
filter(damage_adj>0) %>%
select(evtype, damage_adj, propdmg_adj, cropdmg_adj) %>%
group_by(evtype) %>%
summarise(damage_adj = sum(damage_adj), property = sum(propdmg_adj), crop = sum(cropdmg_adj))
head(arrange(eco_plot, desc(damage_adj)), n = 5)
## Source: local data frame [5 x 4]
## 
##              evtype   damage_adj     property         crop
##               (chr)        (dbl)        (dbl)        (dbl)
## 1             FLOOD 132582494372 128204789247 4.377705e+09
## 2 HURRICANE TYPHOON  61832138041  59585648919 2.246489e+09
## 3       STORM SURGE  37488941819  37488938332 3.486831e+03
## 4           TORNADO  22360717182  22113686558 2.470306e+08
## 5              HAIL  14681460485  12609095765 2.072365e+09

5. Plot results of damage.

Stacked bars using the ggplot2 package weren’t as informative as we would like due to the samll marks on the y-axis. To make the numbers a bit more verbose, we add hover-over functionality from the plotly package.

eco_plot <- melt(eco_plot, id.vars = c("evtype", "damage_adj"), variable.name = "Damage_Type")
eco_plot <- transform(eco_plot, Damage_Type = factor(Damage_Type))
eco_plot <- arrange(eco_plot, desc(damage_adj))

# Only graph the top 10 events
ep <- eco_plot[1:(2*10),] %>%
        ggplot(aes(x = reorder(evtype, -damage_adj), y = value/10^6, fill = Damage_Type)) +
        geom_bar(stat = "identity", position = "stack", color="black") +
        scale_fill_manual(values = c("#000066", "#FF3300")) +
        theme_classic(base_size = 9) +
        scale_x_discrete(name = "Event Type") +
        scale_y_continuous(name = "Damages (converted to the value of 2011 US dollars)") +
        ggtitle("Economic Damages from Storm Weather (1996-2011)") +
        theme(axis.title.y = element_text(face="bold", colour="#000000", size=10),
              axis.title.x = element_text(face="bold", colour="#000000", size=8),
              axis.text.x  = element_text(angle=16, vjust=0, size=8))
ggplotly(ep)

The weather events with the greatest economic impact over the period of this study. Between the years 1996 and 2011, floods accounted for the most economic damage by far.

Personal Injury / Death

1. Subset the injury and death values.

cpi_dat$victims <- cpi_dat$fatalities + cpi_dat$injuries

health_plot <- cpi_dat %>%
    filter(victims > 0) %>%
    select(evtype, victims, fatalities, injuries) %>%
    group_by(evtype) %>%
    summarise(victims = sum(victims), fatalities = sum(fatalities), injuries = sum(injuries))

2. Plot results of injuries and deaths.

health_plot <- melt(health_plot, id.vars = c("evtype", "victims"), variable.name = "Loss_Type")
health_plot <- transform(health_plot, Loss_Type = factor(Loss_Type))
health_plot <- arrange(health_plot, desc(victims))

# Only graph the top 10 events
hp <- health_plot[1:(2*10),] %>%
        ggplot(aes(x = reorder(evtype, -victims), y = value, fill = Loss_Type)) +
        geom_bar(stat = "identity", position = "stack", color="black") +
        scale_fill_manual(values = c("#000066", "#FF3300")) +
        theme_classic(base_size = 9) +
        scale_x_discrete(name = "Event Type") +
        scale_y_continuous(name = "Total Persons Injured or Killed") +
        ggtitle("Health Impact from Storm Weather (1996-2011)") +
        theme(axis.title.y = element_text(face="bold", colour="#000000", size=12),
              axis.title.x = element_text(face="bold", colour="#000000", size=8),
              axis.text.x  = element_text(angle=16, vjust=0, size=8))
ggplotly(hp)

As we see in the figure above, tornados have casued the most injuries by a significant amount. However, heat and excessive heat have caused the most deaths in the selected time period.

System and R Session Information

devtools::session_info()
## Session info --------------------------------------------------------------
##  setting  value                       
##  version  R version 3.2.2 (2015-08-14)
##  system   x86_64, darwin13.4.0        
##  ui       X11                         
##  language (EN)                        
##  collate  en_US.UTF-8                 
##  tz       America/New_York            
##  date     2015-11-18
## Packages ------------------------------------------------------------------
##  package     * version     date       source                          
##  assertthat    0.1         2013-12-06 CRAN (R 3.2.0)                  
##  base64enc     0.1-3       2015-07-28 CRAN (R 3.2.0)                  
##  colorspace    1.2-6       2015-03-11 CRAN (R 3.2.0)                  
##  DBI           0.3.1       2014-09-24 CRAN (R 3.2.0)                  
##  devtools    * 1.9.1       2015-09-11 CRAN (R 3.2.0)                  
##  digest        0.6.8       2014-12-31 CRAN (R 3.2.0)                  
##  dplyr       * 0.4.3       2015-09-01 CRAN (R 3.2.0)                  
##  evaluate      0.8         2015-09-18 CRAN (R 3.2.0)                  
##  formatR       1.2.1       2015-09-18 CRAN (R 3.2.0)                  
##  ggplot2     * 1.0.1       2015-03-17 CRAN (R 3.2.0)                  
##  gridExtra     2.0.0       2015-07-14 CRAN (R 3.2.0)                  
##  gtable        0.1.2       2012-12-05 CRAN (R 3.2.0)                  
##  htmltools     0.2.6       2014-09-08 CRAN (R 3.2.0)                  
##  htmlwidgets   0.5         2015-06-21 CRAN (R 3.2.0)                  
##  httr          1.0.0       2015-06-25 CRAN (R 3.2.0)                  
##  jsonlite      0.9.17      2015-09-06 CRAN (R 3.2.0)                  
##  knitr       * 1.11        2015-08-14 CRAN (R 3.2.1)                  
##  labeling      0.3         2014-08-23 CRAN (R 3.2.0)                  
##  lattice       0.20-33     2015-07-14 CRAN (R 3.2.2)                  
##  lazyeval      0.1.10.9000 2015-08-14 Github (hadley/lazyeval@ecb8dc0)
##  magrittr    * 1.5         2014-11-22 CRAN (R 3.2.0)                  
##  MASS          7.3-44      2015-08-30 CRAN (R 3.2.0)                  
##  memoise       0.2.1       2014-04-22 CRAN (R 3.2.0)                  
##  munsell       0.4.2       2013-07-11 CRAN (R 3.2.0)                  
##  plotly      * 2.0.2       2015-11-17 Github (ropensci/plotly@23c3945)
##  plyr          1.8.3       2015-06-12 CRAN (R 3.2.0)                  
##  proto         0.3-10      2012-12-22 CRAN (R 3.2.0)                  
##  quantmod    * 0.4-5       2015-07-24 CRAN (R 3.2.0)                  
##  R6            2.1.1       2015-08-19 CRAN (R 3.2.1)                  
##  Rcpp          0.12.2      2015-11-15 CRAN (R 3.2.2)                  
##  reshape2    * 1.4.1       2014-12-06 CRAN (R 3.2.0)                  
##  rmarkdown     0.8.1       2015-10-10 CRAN (R 3.2.1)                  
##  rstudioapi    0.3.1       2015-04-07 CRAN (R 3.2.0)                  
##  scales      * 0.3.0       2015-08-25 CRAN (R 3.2.0)                  
##  stringi       1.0-1       2015-10-22 CRAN (R 3.2.0)                  
##  stringr       1.0.0       2015-04-30 CRAN (R 3.2.0)                  
##  TTR         * 0.23-0      2015-07-10 CRAN (R 3.2.0)                  
##  viridis       0.3.1       2015-10-11 CRAN (R 3.2.0)                  
##  xts         * 0.9-7       2014-01-02 CRAN (R 3.2.0)                  
##  yaml          2.1.13      2014-06-12 CRAN (R 3.2.0)                  
##  zoo         * 1.7-12      2015-03-16 CRAN (R 3.2.0)