Reproducible Research: Peer Assessment 2

Most Harmful Weather Events

The purpose of the analysis is to determine which types of events are most harmful with respect to population health in the United States by using the NOAA Storm Database. The paper attempts to answer two basic questions about severe weather events:

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

Synopsis

My analysis reveals that tornadoes and excessive heat are the most harmful weather event to human health in the United States. These conclusions are based on the total fatalities as well as injuries recorded for each type of event. Floods are the most economically damaging weather event, costing the economy approximately USD 144 billion per year.

Data Processing

I begin the analysis by loading libraries and setting a few global parameters:

  ## load needed libraries, set global options, and working directory
  library(knitr); library(plyr)
  opts_chunk$set(echo=TRUE)       
  setwd("~/Documents/Courses/datasciencecoursera/RepResProj2/")

I then check that the data file exists, download it (if needed), and unzip it:

  #Download file if it does not exist
  if (!file.exists("repdata-data-StormData.csv.bz2")) {
      fileURL <- "http://bit.ly/1uNSAQY"
      zipfile = "repdata-data-StormData.csv.bz2"
      download.file(fileURL, destfile=zipfile, method="curl")
  }

The data is then read into R. As it is a large file, data is first read as character strings to improve performance and speed:

  # Load the data and assign it to a variable
  file   = "repdata-data-StormData.csv.bz2"
  raw    =  read.csv(file, stringsAsFactors = FALSE) # FALSE to optimize read speed

Information about the data can be summarized using the str command:

str(raw)

## 'data.frame':    902297 obs. of  37 variables:
##  $ STATE__   : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr  "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr  "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr  "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num  97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr  "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr  "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr  "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr  "" "" "" "" ...
##  $ BGN_LOCATI: chr  "" "" "" "" ...
##  $ END_DATE  : chr  "" "" "" "" ...
##  $ END_TIME  : chr  "" "" "" "" ...
##  $ COUNTY_END: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi  NA NA NA NA NA NA ...
##  $ END_RANGE : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr  "" "" "" "" ...
##  $ END_LOCATI: chr  "" "" "" "" ...
##  $ LENGTH    : num  14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num  100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int  3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num  15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num  25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr  "K" "K" "K" "K" ...
##  $ CROPDMG   : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr  "" "" "" "" ...
##  $ WFO       : chr  "" "" "" "" ...
##  $ STATEOFFIC: chr  "" "" "" "" ...
##  $ ZONENAMES : chr  "" "" "" "" ...
##  $ LATITUDE  : num  3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num  8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num  3051 0 0 0 0 ...
##  $ LONGITUDE_: num  8806 0 0 0 0 ...
##  $ REMARKS   : chr  "" "" "" "" ...
##  $ REFNUM    : num  1 2 3 4 5 6 7 8 9 10 ...

Looking at the summary above, variables can be identified that match to the question of interest. In this analysis, I will use EVTYPE (the event type), FATALITIES, INJURIES, PROPDMG (monetary estimate of property damage), and PROPDMGEXP (unit used for the damage estimate).

Many of these variables need to be converted or manipulated into more workable formats. I perform a few data transformations below:

  # reformat data type of key variables
  raw$EVTYPE = as.factor(raw$EVTYPE)
  raw$BGN_DATE = as.POSIXlt(strptime(raw$BGN_DATE,format="%m/%d/%Y %H:%M:%S"))
  raw$DMG = ""
  raw$DMG = mapvalues(raw$PROPDMGEXP, c("B","M","m","K","H","h","0"),
                                      c(1e9,1e6,1e6,1e3,1e2,1e2,1))
  raw$DMG = as.numeric(raw$DMG) * raw$PROPDMG

## Warning: NAs introduced by coercion

  raw$DMG = as.numeric(raw$DMG)

A new variable DMG is created to capture the monetary estimate of damages from weather events in a universal unit of measure. Although there are certain uncaught response types that cause NAs to be coerced, these cases are ambigous to interpret (even after reading the data help files). As they are few enough, I find it justifiable to leave them out of the exploratory analysis without threatening the validity of the conclusions in a major way.

By plotting the number of unique types of weather events per year below, it is apparent see that the initial period of data (~1950 to 1995) has few categorizations. I find it more likely that this absence of data is a result of lack of collection systems/standards, rather than an absence of particular types of events. I deem it more probable than not that including this initial period would bias the analysis away from type of events that only started being tracked recently.

  t = table(format(raw$BGN_DATE,"%Y"))
  u  = as.numeric(tapply(raw$EVTYPE,raw$BGN_DATE[[6]], function(x) length(unique(x))))

  plot(t, type = "l", main = "Number of Records", ylab = "", col = "blue")
  par(new=T); plot(u, type='l', col = "red", lwd = 2, axes=F, xlab=NA, ylab=NA)
  axis(4, col = "red", lwd = 2)
  legend("topleft", c("Total","Unique"), lwd=c(2.5,2.5), col=c("blue","red"))

The number of unique weather events jumps sharply in 1995 (up to 387 records). Accordingly, I will work only with the subset of data from this date forward, as it is more representative, and will not bias the data towards type of weather events that were tracked earlier on in history.

  df = raw[raw$BGN_DATE >= "1995-01-01 00:00:00",]

This subset nonetheless captures 75.6% of the observations in the raw data.

Results

To answer the research questions, I will analyse which weather events cause the greatest number of fatalities, injuries, and economic damage.

To facilitate this, I have written a re-usable function that generate a summary of the absolute and relative impact of a particular parameter, by event type:

  # a function that makes a summary table of a given parameter and calculates
  # relative impact (per weather event)
  mktab = function(variable){
    v = df[[variable]]
    dfa = aggregate(v ~ EVTYPE, data = df, sum)
    dfb = aggregate(v ~ EVTYPE, data = df, length)
    names(dfb)[2] = "OCCURENCES"; names(dfa)[2] = variable
    out = merge(dfa, dfb)
    out$RELIMPACT = out[[variable]] / out$OCCURENCES
    out = out[order(out[[variable]], decreasing = T),]
    out
  }

I use this function to explore the most harmful weather events, both on the basis of total fatalities and total injuries:

  df1 = mktab("FATALITIES"); head(df1)

##             EVTYPE FATALITIES OCCURENCES  RELIMPACT
## 112 EXCESSIVE HEAT       1903       1673 1.13747759
## 665        TORNADO       1545      24251 0.06370871
## 134    FLASH FLOOD        930      52449 0.01773151
## 231           HEAT        924        755 1.22384106
## 357      LIGHTNING        725      14258 0.05084865
## 144          FLOOD        423      24473 0.01728435

  df2 = mktab("INJURIES"); head(df2)

##             EVTYPE INJURIES OCCURENCES  RELIMPACT
## 665        TORNADO    21757      24251 0.89715888
## 144          FLOOD     6769      24473 0.27659053
## 112 EXCESSIVE HEAT     6525       1673 3.90017932
## 357      LIGHTNING     4627      14258 0.32451957
## 682      TSTM WIND     3625     128560 0.02819695
## 231           HEAT     2030        755 2.68874172

With respect to the second research question, I will use the DMG parameter. These property damage estimates were computed in the data processing stage using the PROPDMG and PROPDMGEXP variables. Using the same analysis as before, we obtain a table of the most costly events:

  df3 = mktab("DMG"); head(df3)

##                EVTYPE          DMG OCCURENCES   RELIMPACT
## 49              FLOOD 143986183550      16830   8555328.8
## 129 HURRICANE/TYPHOON  69305840000         70 990083428.6
## 202       STORM SURGE  43193536000        167 258643928.1
## 243           TORNADO  24907463396      16081   1548875.3
## 43        FLASH FLOOD  15349730441      31719    483928.6
## 83               HAIL  14989598191      89714    167082.0

  barplot(head(df3$DMG), main = "Weather Events Causing the Greatest Economic Damage, 1995-2011", ylab = "Estimated Cost (USD)")
  axis(1, at = 1:6, labels = head(df3$EVTYPE))

The above chart shows that on an absolute basis, floods have been the most costly weather event to Americans (USD 144 billion). This is followed by hurricanes/typhoons and storm surges.

However, by looking at the relative impact, we see that heaviest costs per event comes from HEAVY RAIN/SEVERE WEATHER (USD 2.5 billion).

  df3[which.max(df3$RELIMPACT),]

##                       EVTYPE     DMG OCCURENCES RELIMPACT
## 99 HEAVY RAIN/SEVERE WEATHER 2.5e+09          1   2.5e+09

With only one observation above, this result appears suspect/mistaken. However, a review of the REMARKS variable for this entry reveals that this weather event was indeed substantial and costly:

  df[df$EVTYPE %in% "HEAVY RAIN/SEVERE WEATHER" & df$PROPDMG == "2.5", "REMARKS"]

## [1] "A potent weather system stalled over southeast Louisiana from the evening of May 8 through mid day of May 10 producing two bouts of heavy rain and severe thunderstorms.  The first event occurred the evening of May 8 and early morning of May 9 producing several tornadoes and widespread heavy rain of 8 to 15 inches across the greater New Orleans metro area into southeast St. Tammany Parish.  The second bout of severe weather struck during the evening of May 9 and continued into the morning of May 10, producing rainfall of 10 to 15 inches primarily from St. Tammany Parish into south Mississippi.  Drainage capacity was overwhelmed by the torrential rainfall on each of these nights and water flooded tens of thousands of homes in southeast Louisiana.  By early June, the Federal Emergency Management Agency (FEMA) reported approximately 60,000 addresses representing single family units,  multifamily units, and businesses had filed for assistance due to weather damage.  In late May, the Red Cross estimated 36,000 homes in southeast Louisiana were affected by flood water.  Newspaper accounts indicated weather related damage to reach the $2.5 to $3.0 billion range.  Parish by parish detail follows. "

Session Info

  sessionInfo()

## R version 3.1.2 (2014-10-31)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## 
## locale:
## [1] en_CA.UTF-8/en_CA.UTF-8/en_CA.UTF-8/C/en_CA.UTF-8/en_CA.UTF-8
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## loaded via a namespace (and not attached):
## [1] digest_0.6.6     evaluate_0.5.5   formatR_1.0      htmltools_0.2.6 
## [5] knitr_1.8        rmarkdown_0.3.10 stringr_0.6.2    tools_3.1.2     
## [9] yaml_2.1.13