Coursera - Reproducible Research - Project 2 - NOAA Storm Database Analysis 1950-2011

Purpose

The purpose of this report is to answer two questions using the NOAA Storm Database using data from the earliest recordings in 1950 to November 2011. Those questions are:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Synopsis of Findings

The analysis of this data shows a pareto-like (“80/20” rule) distribution of human impacts (fatalities and injuries) and economic impact by storm event type where the vast majority of economic and human inpact is concentrated in a small subset of the event types.

The most significant impacts on humans from these events comes from the following sources in this order and magnitude:

Event         Injuries and Fatalities   Percent of Total

TORNADO 96,997 62.3%
WIND 12,555 8.1%
HEAT 12,362 7.9%
FLOOD 10,125 6.5%
LIGHTNING 6,049 3.9%

The most significant economic impacts come from the following storm events in this order and magnitude:

Event         Economic Impact           Percent of Total

FLOOD $179,734,099,570 37.7%
HURRICANE $90,271,397,810 18.9%
TORNADO $57,357,351,690 12.0%
STORM SURGE $43,323,541,000 9.1%
HAIL $20,734,573,770 4.4%

It is important to remember that this analysis covers a very long period (1950-2011) and both the coverage in reporting and the quality of the reporting has changed dramatically over time (improved).

Recommended further studies:

Review of findings in the last 5,10,20, and 30 years to assess any differences in the trends. Is more recent data showing changes that should affect government policy decisions and emergency management training and budgeting?
Review of how the volume of injuries, fatalities, property, and crop damage is distributed across individual storm events to look at differences between persistent yearly threats vs. long period catastrophes. This analysis would be valuable to government and business disaster planners to identify potential peak demand for emergency response vs. time period.
Analysis of the relative impacts of Fatalities as a % of Total Human impact and relative cost of property damage vs. crop damage. This deeper analysis may be useful to government policy formulation.
Analysis of events vs. years and also during various seasons of the year.
Regional and locational analysis using the additional location attributes in the data.

Data Processing

Data Sources:

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data (47Mb)

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events FAQ

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

Download source data from NOAA…

#  cache is NOT set.  Ran into strange errors during testing.
#
#
sourceFile <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
destFile <- "NOAAStormData.csv.bz2"
# next line commented out to avoid duplication
# download.file(sourceFile, destFile, method="curl")

dirList <- dir()
ifelse("NOAAStormData.csv.bz2" %in% dirList, print("Successful download"), print("Download problem"))

## [1] "Successful download"

## [1] "Successful download"

stormData <- read.csv(destFile)

require(data.table) # this will speed processing of this large data frame

## Loading required package: data.table

require(plyr)  # this will make it easier to query this large file and aggregate results

## Loading required package: plyr

stormDataDT <- data.table(stormData)  # converting to a data.table will speed queries

EVTypes <- sort(unique(stormDataDT$EVTYPE))

Clean up Weather/Storm Event Types…

#  the event type classification system raw data includes typos and levels of granularity too low for this report
#  clean up event classfications and rename EVTYPE to EVTYPECLEAN

stormDataDT$EVTYPECLEAN <- stormDataDT$EVTYPE

stormDataDT[grep("THUN", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "THUNDERSTORM"
stormDataDT[grep("THUD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "THUNDERSTORM"
stormDataDT[grep("TSTM", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "THUNDERSTORM"
stormDataDT[grep("TROPICAL", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "TROPICAL STORM"
stormDataDT[grep("TORNADO", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "TORNADO"
stormDataDT[grep("FLOOD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "FLOOD"
stormDataDT[grep("WATER", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WATERSPOUT"
stormDataDT[grep("WINTER", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WINTER STORM"
stormDataDT[grep("WINTERY", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WINTERY MIX"
stormDataDT[grep("WINTRY", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WINTERY MIX"
stormDataDT[grep("SNOW", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "SNOW"
stormDataDT[grep("RECORD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "RECORD SETTING EVENT"
stormDataDT[grep("ICE", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "ICE"
stormDataDT[grep("WIND", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WIND"
stormDataDT[grep("RAIN", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "RAIN"
stormDataDT[grep("PRECIPITATION", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "RAIN"
stormDataDT[grep("SHOWER", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "RAIN"
stormDataDT[grep("HAIL", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "HAIL"
stormDataDT[grep("HEAT", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "HEAT"
stormDataDT[grep("HOT", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "HEAT"
stormDataDT[grep("COOL", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "COOL"
stormDataDT[grep("WARM", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WARM"
stormDataDT[grep("COLD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "COLD"
stormDataDT[grep("DUST", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "DUST"
stormDataDT[grep("WET", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WET"
stormDataDT[grep("DRY", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "DRY"
stormDataDT[grep("BLIZZARD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "BLIZZARD"
stormDataDT[grep("FREEZ", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "FREEZE"
stormDataDT[grep("FROST", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "FREEZE"
stormDataDT[grep("SURF", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "SURF"
stormDataDT[grep("HURRICANE", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "HURRICANE"
stormDataDT[grep("FIRE", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "FIRE"
stormDataDT[grep("FUNNEL", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "FUNNEL"
stormDataDT[grep("TIDE", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "TIDE"
stormDataDT[grep("LIGHTNING", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "LIGHTNING"
stormDataDT[grep("SUMMARY", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "IGNORE"
stormDataDT[grep("MUD", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "MUDSLIDE"
stormDataDT[grep("VOLCANIC", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "VOLCANIC"
stormDataDT[grep("UNSEASON", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "UNSEASONABLE"
stormDataDT[grep("LANDSLIDE", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "LANDSLIDE"
stormDataDT[grep("SWELLS", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "WAVES"
stormDataDT[grep("URBAN", stormDataDT$EVTYPE, ignore.case = TRUE), ]$EVTYPECLEAN <- "URBAN EVENT"

Start Data Analysis and summarize by Human Impact and Economic Impact…

# Begin analysis of the data
# Analysis on Human Impacts
stormDataHuman <- stormDataDT[,list(numEVTYPE = sum(EVTYPECLEAN),Fatalmean=mean(FATALITIES, na.rm=TRUE), Fatalsd=sd(FATALITIES,na.rm=TRUE), Fatalsum=sum(FATALITIES, na.rm=TRUE), InjuryMean=mean(INJURIES, na.rm=TRUE), Injurysd=sd(INJURIES, na.rm=TRUE), Injurysum=sum(INJURIES, na.rm=TRUE)), by=EVTYPECLEAN] # use EVTYPECLEAN to get aggregated Types

stormDataHuman$FatalandInjury <- stormDataHuman$Fatalsum + stormDataHuman$Injurysum
stormDataHumanSortFatal <- stormDataHuman[order(-Fatalsum)]  # sort by fatalities in descending order
stormDataHumanSortInjury <- stormDataHuman[order(-Injurysum)] # sort by injuries in descending order
stormDataHumanSortFatalandInjury <- stormDataHuman[order(-FatalandInjury)]

# Analysis on Economic Impacts
#   The economic data is captured in two columns.  They are: PROPDMG "Property Damage" & CROPDMG "Crop Damage"
#   The economic data needs to be converted into a common numeric format.  Field PROPDMGEXP and CROPDMGEXP use the notation of K for thousands of $'s, M ofr millions, and B for Billions - need to convert these values via  function

# not necessarily the fastest approach but it works.  Missing how to handle numeric chars, +, - , and ? symbols per NOAA code book.   
convert2dollars <- function(amount, unit) {
  unit <- tolower(unit)
  damageCost <- amount * 1000 # (to be confirmed) assumes the figures are in k unless otherwise noted. +, - symbols are unclear
  if(unit =="h") {return(amount * 1e2)}
  if(unit =="k") {return(amount * 1e3)}
  if(unit =="m") {return(amount * 1e6)}
  if(unit =="b") {return(amount * 1e9)}
  return(damageCost)
}

# this routine is slow relative to other approaches but it works.  Replace in a future upgrade.
# Process the Property Damage numbers to be in whole dollars
stormDataDT$PROPDMGDOLLARS <- mapply(convert2dollars, as.numeric(stormDataDT$PROPDMG), as.character(stormDataDT$PROPDMGEXP))
# Process the Crop Damage numbers to be in whole dollars
stormDataDT$CROPDMGDOLLARS <- mapply(convert2dollars, as.numeric(stormDataDT$CROPDMG), as.character(stormDataDT$CROPDMGEXP))

# order Economic Data
stormDataDT$YEAR <- format(as.Date(stormDataDT$BGN_DATE, format="%m/%d/%Y"),"%Y")  # this is to simplify later processing on year of event

stormDataEconomicExtract <- data.frame(YEAR=stormDataDT$YEAR, EVTYPECLEAN= stormDataDT$EVTYPECLEAN, PROPERTYDMGDOLLARS=stormDataDT$PROPDMGDOLLARS, CROPDMGDOLLARS=stormDataDT$CROPDMGDOLLARS )

stormDataEconomicExtractDT <- data.table(stormDataEconomicExtract)

stormDataEconomicExtractDT$TOTALECONOMIC <- stormDataEconomicExtractDT$PROPERTYDMGDOLLARS + stormDataEconomicExtractDT$CROPDMGDOLLARS

stormDataEconomicProperty <- stormDataEconomicExtractDT[order(-PROPERTYDMGDOLLARS)]
stormDataEconomicCrop <- stormDataEconomicExtractDT[order(-CROPDMGDOLLARS)]
stormDataEconomicTotal <- stormDataEconomicExtractDT[order(-TOTALECONOMIC)]

setkey(stormDataEconomicTotal, EVTYPECLEAN)  # note use of Event Type Clean version 

# note use of EVTYPECLEAN to group by hte cleaned up Event Types
stormDataEconomicSummary <- as.data.frame(stormDataEconomicTotal[, sum(TOTALECONOMIC, na.rm=TRUE), by = EVTYPECLEAN])
colnames(stormDataEconomicSummary)[2] <- "TOTALECONOMIC"
# sort the data by total economic damage by event type
stormDataEconomicSummarySorted <- stormDataEconomicSummary[order(-stormDataEconomicSummary$TOTALECONOMIC),]

Results

Construct Plots…

par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
barplot(stormDataHumanSortFatalandInjury[1:10]$FatalandInjury/1e3, names.arg=stormDataHumanSortFatalandInjury[1:10,]$EVTYPECLEAN, main="Total Human Impact (Fatalities & Injuries)\n in thousands by Storm Event Type\n U.S.1950 to Nov 2011 Cumulative", horiz=TRUE, cex.names=0.8, axis.lty = 1, xlim=c(0,100))
grid(col="blue", lwd=2)

plot of chunk readData

# Economic Im
par(las=2) # make label text perpendicular to axis
par(mar=c(5,8,4,2)) # increase y-axis margin.
barplot(stormDataEconomicSummarySorted[1:10,]$TOTALECONOMIC/1e9, names.arg=stormDataEconomicSummarySorted[1:10,]$EVTYPECLEAN, main="Total Economic Impact in $B's Cumulative\n by Storm Event Type, U.S.\n 1950 to Nov 2011", horiz=TRUE, cex.names=0.8, axis.lty = 1, xlim=c(0,200))
grid(col="blue", lwd=2)