Note: The analysis depends on the following R Libraries.

library(dplyr)
library(R.utils)
library(ggplot2)
library(scales)
library(lubridate)

Synopsis

The following analysis quantifies the amount of damage to property, crops, and human life (fatalities and injuries) that result from severe weather. Two specific questions are addressed in this analysis

Population health is quantified based on injuries and fatalities for a given event. Economic consequences are based on combined crop and property damage estimates. The analysis concludes that hurricanes have caused the most economic damage while tornados resulted in the most fatailities as well as injuries.

Data Processing

The following code is used to download, decompress, extract and read the data. In addition, the date range of the data set is determined by extracting the year of the first and last beginning date in the analysis.

url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
csvFile = 'repdata-data-StormData.csv'
zipFile = 'repdata-data-StormData.csv.bz2'
download.file(url, zipFile, method='curl', mode = "wb")
bunzip2(zipFile)
df <- read.csv(csvFile, stringsAsFactors=FALSE)
df$BGN_DATE<-strptime(df$BGN_DATE,'%m/%d/%Y')
earliestYear<-year(min(df$BGN_DATE))
latestYear<-year(max(df$BGN_DATE))

The dataset was compiled by the National Oceanic and Atmospheric Administration (NOAA). It describes damage in the United States between 1950 and 2011.

Data Description

The relevant columns in the Raw Data are as follows:

Column Description
EVTYPE The EVTYPE column contains data used to identify the type of event.
CROPDMG CROPDMG is a number indicating the amount of crop damage. The total dollar amount is determinied by multiplying it by a value derived from the CROPDMGEXP column.
CROPDMGEXP CROPDMGEXP contains the crop damage multipler.
PROPDMG PROPDMG is a number indicating the amount of property damage. The total dollar amount is determinied by multiplying it by a value derived from the PROPDMGEXP column.
PROPDMGEXP PROPDMGEXP contains the property damage multipler.

Valid mutipliers found in CROPDMGEXP and PROPDMGEXP can have the following values:

Value Description
B Billion
M Million
K Thousand
H Hundred

Multiplier values with other data are ignored in this analysis.

Functions used in Data Processing

The multiplier() function takes a value from the CROPDMGEXP or PROPDMGEXP column and returns the value to be used as a multiplier

Due to various problems with the data in EVTYPE (e.g. misspellings, extra spaces, capitalization inconsistencies, truncated data, events spread across multiple records, records associated with multiple weather phenomena) the data is normalized to a set of categories in a derived column named CAT. The categorize() function maps each EVTYPE to a category.

##
## Input:  letter is an arbitrary character which SHOULD be a letter.
## Output: a multiplier
## 
## CROPDMGEXP marked with ?, 0, 2 had minor damage.   
## PROPDMGEXP -, ?, + ,0 , 1, 2, 3, 4, 5, 6, 7, 8 are difficult to discern in some cases.  
## Following the strict definition of this column, 
## B, H, K or M only will be used as multipliers
##
multiplier <- function(letter){
  x<-1
  if (letter %in% c('B','b')){ x<-1000000000}
  else if(letter %in% c('M','m')){x<-1000000}
  else if(letter %in% c('K','k')){x<-1000}  
  else if(letter %in% c('H','h')){x<-100}
  return(x)    
}

##
## Function to group related EVTYPES
##
categorize <- function(eventType){
  x<-'NA'
  if (grepl('(?i)AVALANC|SLIDE|LANDSLUMP', eventType)){                                x<-'AVALANCHE'}
  else if (grepl('(?i)SURF|TIDE|SEA|SWELL|WATER|WAVE|CURRENT|WAVE|SEICHE', eventType)){x<-'TIDE SURF'}
  else if (grepl('(?i)TORNADO|WALL CLOUD|SPOUT|FUNNEL|TORNDAO', eventType)){           x<-'TORNADO'}
  else if (grepl('(?i)TSUNAMI', eventType)){                                           x<-'TSUNAMI'}
  else if (grepl('(?i)VOLCAN', eventType)){                                            x<-'VOLCANO'}
  else if (grepl('(?i)HURRICANE|FLOYD', eventType)){                                   x<-'HURRICANE'}
  else if (grepl('(?i)TYPHOON', eventType)){                                           x<-'TYPHOON'}
  else if (grepl('(?i)DOWNBURST', eventType)){                                         x<-'DOWNBURST'}
  else if (grepl('(?i)DROUGHT|DRY|DRIEST', eventType)){                                x<-'DROUGHT'}
  else if (grepl('(?i)DUST', eventType)){                                              x<-'DUST'}
  else if (grepl('(?i)LIGHTNING|LIGNTNING|LIGHTING', eventType)){                      x<-'LIGHTNING'}
  else if (grepl('(?i)MICROBURST', eventType)){                                        x<-'MICROBURST'}
  else if (grepl('(?i)FIRE|RED FLAG|SMOKE', eventType)){                               x<-'FIRE'}
  else if (grepl('(?i)URBAN|FLOO|STREAM', eventType)){                                 x<-'FLOOD'}
  else if (grepl('(?i)FOG', eventType)){                                               x<-'FOG'}
  else if (grepl('(?i)SLEET', eventType)){                                             x<-'SLEET'}
  else if (grepl('(?i)SNOW|BLIZZ', eventType)){                                        x<-'SNOW'}
  else if (grepl('(?i)ICE|FREEZ|GLAZE|ICY', eventType)){                               x<-'ICE'}    
  else if (grepl('(?i)HAIL', eventType)){                                              x<-'HAIL'}
  else if (grepl('(?i)TROPICAL|STORM|TSTM', eventType)){                               x<-'STORM'}
  else if (grepl('(?i)RAIN|WET|DRIZZLE|SHOWER', eventType)){                           x<-'RAIN'}
  else if (grepl('(?i)PRECIP|MIX', eventType)){                                        x<-'PRECIPITATION'}
  else if (grepl('(?i)EROSI', eventType)){                                             x<-'EROSION'}
  else if (grepl('(?i)HEAT|HOT|HIGH TEMP|WARM|RECORD HIGH', eventType)){               x<-'HEAT'}
  else if (grepl('(?i)COLD|EXPOSUR|LOW TEMP|COOL|WINT|RECORD LOW', eventType)){        x<-'COLD'}
  else if (grepl('(?i)FROST', eventType)){                                             x<-'FROST'}
  else if (grepl('(?i)RECORD TEMPERATURE', eventType)){                                x<-'RECORD TEMPERATURE'}
  else if (grepl('(?i)MILD', eventType)){                                              x<-'MILD'}
  else if (grepl('(?i)WIND|GUST|TURBULENCE|WND', eventType)){                          x<-'WIND'}
  else if (grepl('(?i)DAM', eventType)){                                               x<-'DAM FAILURE'}
  else if (grepl('(?i)SUMMARY', eventType)){                                           x<-'NA'}
  return(x)         
}

Cleanup Data

An egregious data error was noted for the record with REFNUM 605943 concerning flooding in Napa Valley in 2006. The PROPDMGEXP was set to B (for billion) which would suggest damage on par with Katrina. See the REMARKS column for this record and this article for a rationale for why 115 million rather than 115 billion is an appropriate value.

TOTAL_COST is a derived column calculated as

TOTALCOST_=(CROPDMG * multiplier(CROPDMGEXP)) + (PROPDMG * multiplier(PROPDMGEXP)

df[df$REFNUM==605943,]$PROPDMGEXP='M'
df2<-tbl_df(df) %>% rowwise() %>% mutate (TOTAL_COST=(CROPDMG * multiplier(CROPDMGEXP)) + (PROPDMG * multiplier(PROPDMGEXP)))

## Set Category based on function above
df2$CAT<-sapply(df2$EVTYPE, function(eventType)categorize(eventType))

Data summarization consists of grouping based on the cleaned up EVTYPE categories (the CAT column), and determining the total cost of damage, total number of fatalities, and total number of injuries. BGN_DATE had been cast above and is not a supported type, so remove the column which is not used anyway.

df2<-df2[,!colnames(df2) %in% 'BGN_DATE']

catGroup <- tbl_df(df2) %>% group_by(CAT)
highest_total_cost_cat <- catGroup %>% summarize(sum_total_cost=sum(TOTAL_COST)) %>%arrange(desc(sum_total_cost)) 
populate_health_fatalities_sum_cat <- catGroup %>% summarize(sum_fatalities=sum(FATALITIES)) %>% arrange(desc(sum_fatalities))  
populate_health_injuries_sum_cat <- catGroup %>% summarize(sum_injuries=sum(INJURIES)) %>% arrange(desc(sum_injuries))  

Results

Over the period analyzed, hurricanes have caused over 80 billion dollars in damage. Related hurricane effects including tropical storms, flooding and tornados also result cause significant loss to properties and crops. However, tornados result in thousands more injuries and deaths than any other category.

ggplot(head(highest_total_cost_cat), aes(x=CAT, y=sum_total_cost / 1000000)) + 
  geom_bar(stat='identity') + 
  xlab("Cat Type") + ylab("Economic Damage (in millions of $)") +
  ggtitle("NOAA Costliest Weather Events (Property + Crop), 1950-2011") + scale_y_continuous(labels=comma) +
  coord_flip()

plot of chunk plot-results

ggplot(head(populate_health_fatalities_sum_cat), aes(x=CAT, y=sum_fatalities)) + geom_bar(stat='identity') + coord_flip()

plot of chunk plot-results

ggplot(head(populate_health_injuries_sum_cat), aes(x=CAT, y=sum_injuries)) + geom_bar(stat='identity') + coord_flip()

plot of chunk plot-results