Synopsis

This report summarises the data found in the NOAA Storm Database, particularly in relation to the following questions:

Which severe weather events are most harmful with respect to population health?
Which types of events have the greatest economic consequences?

The report presents only preliminary findings. It comprises (1) a section describing the transformations which were made to the original dataset (notably selecting relevant years to analyze, converting coded amounts to reflect their economic value, and doing some minor variable cleaing), and (2) a section presenting basic tables and figures related to the two questions stated above.

The data used in this report covers the period from 01 January 1996 to 30 November 2011.

Excessive heat, tornado, and flash flood were found to have the most fatalities; tornado, flood, and excessive heat were found to have the most injuries. It is likely that flood and flash flood could be combined, but they have been left as is for this report.
Hurricane/typhoon, storm surge, and flood caused the most property damage, while drought, flood, and hurricane caused the most crop damage.

Data Processing

First we read in the data using R. The source file is a bz2 file, so we can use bzcat to directly read the data in using fread from the “data.table” package.

According to the details available¹, complete storm data for all event types are available only from 1996 on-wards, so we first convert the event date to an actual date format, and subset on the relevant rows. There are also several columns that we are not interested in, so we will drop those columns and re-save the Rdata file. This should speed up loading the data on subsequent runs.

## -------------------------------------------------------------------
## 01 - SETTING THE WORKING DIRECTORY --------------------------------
## -------------------------------------------------------------------

## I work on both Windows and Linux, and keep the files in Dropbox.
## This lets me load the relevant data from either system.
OS <- Sys.info()["sysname"]

if (OS == "Windows") {
  setwd("C:/Users/Ananda PC/Dropbox/Coursera/reproducible_research")
} else if (OS == "Linux") {
  setwd("~/Dropbox/Coursera/reproducible_research/")
}

## -------------------------------------------------------------------
## 02 - LOADING REQUIRED PACKAGES ------------------------------------
## -------------------------------------------------------------------

## Load "data.table"
suppressPackageStartupMessages(library(data.table))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(lattice))

## -------------------------------------------------------------------
## 03 - DECIDING WHICH VERSION OF THE DATA TO LOAD -------------------
## -------------------------------------------------------------------

## Check to see if the data exists as a rdata file.
## If not, read it in and save as rdata file

## -------------------------------------------------------------------
## 03A - LOADING THE RAW DATA ----------------------------------------
## -------------------------------------------------------------------

if (!file.exists("storm.Rdata")) {
  storm <- fread("bzcat repdata-data-StormData.csv.bz2")
  save(storm, file = "storm.Rdata")
} 

## -------------------------------------------------------------------
## 03B - LOADING THE RAW DATA PROCESSED AS AN RDATA FILE -------------
##     - PROCESS THE FILE AND RE-SAVE AS A NEW DATA SUBSET -----------
## -------------------------------------------------------------------

if (!file.exists("storm_final.Rdata")) {
  cat("Loading from saved Rdata file.")
  load("storm.Rdata")

  ## -----------------------------------------------------------------
  ## 03B1 - Keep only 1996 onwards since that's when 
  ##        full data is available
  ##      - First, convert to proper dates, 
  ##        then select the relevant rows
  ## -----------------------------------------------------------------
  
  storm[, BGN_DATE := as.Date(BGN_DATE, format = "%m/%d/%Y")]
  storm[, END_DATE := as.Date(END_DATE, format = "%m/%d/%Y")]
  storm <- storm[BGN_DATE >= "1996-01-01"]

  ## -----------------------------------------------------------------
  ## 03B2 - Make the dataset a little bit more manageable
  ##        by only keeping selected columns
  ## -----------------------------------------------------------------

  keep <- c("BGN_DATE", "TIME_ZONE", "COUNTY", "STATE", "EVTYPE", "F", 
            "MAG", "FATALITIES", "INJURIES", "PROPDMG", "PROPDMGEXP", 
            "CROPDMG", "CROPDMGEXP", "WFO", "LATITUDE", "LONGITUDE")
  storm <- storm[, keep, with = FALSE]
  
  ## -----------------------------------------------------------------
  ## 03B3 - Tidy up the EVTYPE column
  ##      - Process the CROPDMGEXP and PROPDMGEXP columns
  ##      - Save to a smaller and more manageable dataset
  ## -----------------------------------------------------------------

  storm[, EVTYPE := trimws(gsub("[^A-Z]", "", toupper(EVTYPE)))]
  
  ## Write a basic function to process the multiplication.
  ## Input  = The character column.
  ## Output = The converted value as numeric.
  EXP <- c("K", "M", "B")
  EXPV <- c("1000", "1e+06", "1e+09")
  
  EXPF <- function(invec) {
    as.numeric(
      as.character(
        factor(replace(invec, !invec %in% EXP, NA), EXP, EXPV)))}
  
  ## Replace the PROPDMG and CROPDMG columns.
  ## Drop the PROPDMGEXP and CROPDMGEXP columns.
  storm[, PROPDMG := PROPDMG * EXPF(PROPDMGEXP)][, PROPDMGEXP := NULL]
  storm[, CROPDMG := CROPDMG * EXPF(CROPDMGEXP)][, CROPDMGEXP := NULL]
  
  ## Resave and move on.
  save(storm, file = "storm_final.Rdata")
} else {

## -------------------------------------------------------------------
## 03C - LOADING THE PROCESSED DATA AND PROCEED WITH ANALYSIS --------
## -------------------------------------------------------------------  
  
  cat("Loading processed storm data file.")
  load("storm_final.Rdata")
}

## Loading processed storm data file.

names(storm)

##  [1] "BGN_DATE"   "TIME_ZONE"  "COUNTY"     "STATE"      "EVTYPE"    
##  [6] "F"          "MAG"        "FATALITIES" "INJURIES"   "PROPDMG"   
## [11] "CROPDMG"    "WFO"        "LATITUDE"   "LONGITUDE"

Results

Most harmful events to population health

The raw data includes many values in EVTYPE, too many to present meaningfully in a summary table. When looking at the events most harmful to a population’s head, the following steps were taken:

Group the data by EVTYPE.
Take the sum of fatalities or injuries for each EVTYPE.
Sort the data according to the sum (descending).
Select just the top 15.

The following table lists the top 15 events that have caused fatalities.

Fatalities <- storm %>%
  group_by(EVTYPE) %>%
  summarise(FATALITIES = sum(FATALITIES, na.rm = TRUE)) %>%
  arrange(desc(FATALITIES)) %>%
  head(15)

knitr::kable(Fatalities)

EVTYPE	FATALITIES
EXCESSIVEHEAT	1797
TORNADO	1511
FLASHFLOOD	887
LIGHTNING	651
FLOOD	414
RIPCURRENT	340
TSTMWIND	241
HEAT	237
HIGHWIND	235
AVALANCHE	223
RIPCURRENTS	202
WINTERSTORM	191
THUNDERSTORMWIND	130
EXTREMECOLDWINDCHILL	125
EXTREMECOLD	115

The same process was used to calculate the top 15 events that have cause injuries, as displayed in the following table.

Injuries <- storm %>%
  group_by(EVTYPE) %>%
  summarise(INJURIES = sum(INJURIES, na.rm = TRUE)) %>%
  arrange(desc(INJURIES)) %>%
  head(15)

knitr::kable(Injuries)

EVTYPE	INJURIES
TORNADO	20667
FLOOD	6758
EXCESSIVEHEAT	6391
LIGHTNING	4141
TSTMWIND	3629
FLASHFLOOD	1674
THUNDERSTORMWIND	1400
WINTERSTORM	1292
HURRICANETYPHOON	1275
HEAT	1222
HIGHWIND	1083
WILDFIRE	911
HAIL	713
FOG	712
HEAVYSNOW	698

There are several events which intersect:

matchedEvents <- sprintf("* %s\n", Injuries$EVTYPE[Injuries$EVTYPE %in% Fatalities$EVTYPE])
cat("", matchedEvents)

TORNADO
FLOOD
EXCESSIVEHEAT
LIGHTNING
TSTMWIND
FLASHFLOOD
THUNDERSTORMWIND
WINTERSTORM
HEAT
HIGHWIND

Events with the greatest economic consequences

A similar approach was taken to identify the events which had the greatest economic consequences. To understand this impact, totals for property damage (PROPDMG) and crop damage (CROPDMG) were calculated grouped by EVTYPE.

As in the previous section, only the top 15 values were identified.

When calculating the PROPDMG values, the totals for EVTYPE == "FLOOD" seemed to be inaccurate. The anomalous value was cross referenced with the original raw data, which includes a column titled REMARKS. In this column, the following was noted:

Major flooding continued into the early hours of January 1st, before the Napa River finally fell below flood stage and the water receeded. Flooding was severe in Downtown Napa from the Napa Creek and the City and Parks Department was hit with $6 million in damage alone. The City of Napa had 600 homes with moderate damage, 150 damaged businesses with costs of at least $70 million.

The assumption that can be made is that instead of a billion multiplier, the multiplier should have been million.

## Initial run -- but numbers seem off...
PD_SUM <- storm %>%
  group_by(EVTYPE) %>%
  summarise(PD_SUM = sum(PROPDMG, na.rm = TRUE)) %>%
  arrange(desc(PD_SUM)) %>%
  head(15)

## Identify incorrect FLOOD value and replace it with a more reasonable value
fixme <- storm[EVTYPE == "FLOOD", which.max(PROPDMG)]
storm[EVTYPE == "FLOOD"][fixme]$PROPDMG <- 
  storm[EVTYPE == "FLOOD"][fixme]$PROPDMG/1000

## Recalculate property damage totals
PD_SUM <- storm %>%
  group_by(EVTYPE) %>%
  summarise(PD_SUM = sum(PROPDMG, na.rm = TRUE)) %>%
  arrange(desc(PD_SUM)) %>%
  mutate(EVTYPE = as.character(EVTYPE)) %>%
  head(15)

## Required for proper sorting in the plot
PD_SUM$EVTYPE <- factor(PD_SUM$EVTYPE, PD_SUM$EVTYPE)

## Generate table and chart
PropertyDamage <- barchart(PD_SUM/1e+09 ~ EVTYPE, PD_SUM, 
         scales = list(x = list(rot = 45)),
         xlab = "Event Type",
         ylab = "Damage Cost (in USD Billion)",
         main = "Property Damage Cost of Top 15 Event Types")
knitr::kable(PD_SUM %>% 
               mutate("Property Damage (in Billions)" = 
                        sprintf("$ %6.2f", PD_SUM/1e+9)) %>% 
               select(-PD_SUM))

EVTYPE	Property Damage (in Billions)
HURRICANETYPHOON	$ 69.31
STORMSURGE	$ 43.19
FLOOD	$ 29.06
TORNADO	$ 24.62
FLASHFLOOD	$ 15.22
HAIL	$ 14.60
HURRICANE	$ 11.81
TROPICALSTORM	$ 7.64
HIGHWIND	$ 5.25
WILDFIRE	$ 4.76
STORMSURGETIDE	$ 4.64
TSTMWIND	$ 4.49
ICESTORM	$ 3.64
THUNDERSTORMWIND	$ 3.38
WILDFORESTFIRE	$ 3.00

PropertyDamage

The same approach was taken for calculating crop damage.

## Calculate crop damage totals
CROP_SUM <- storm %>%
  group_by(EVTYPE) %>%
  summarise(CROP_SUM = sum(CROPDMG, na.rm = TRUE)) %>%
  arrange(desc(CROP_SUM)) %>%
  head(15)

## Required for proper sorting in the plot
CROP_SUM$EVTYPE <- factor(CROP_SUM$EVTYPE, CROP_SUM$EVTYPE)

## Generate table and chart
CropDamage <- barchart(CROP_SUM/1e+09 ~ EVTYPE, CROP_SUM, 
         scales = list(x = list(rot = 45)),
         xlab = "Event Type",
         ylab = "Damage Cost (in USD Billion)",
         main = "Crop Damage Cost of Top 15 Event Types")
knitr::kable(CROP_SUM %>% 
               mutate("Crop Damage (in Billions)" = 
                        sprintf("$ %6.2f", CROP_SUM/1e+9)) %>% 
               select(-CROP_SUM))

EVTYPE	Crop Damage (in Billions)
DROUGHT	$ 13.37
FLOOD	$ 4.97
HURRICANE	$ 2.74
HURRICANETYPHOON	$ 2.61
HAIL	$ 2.48
FLASHFLOOD	$ 1.33
EXTREMECOLD	$ 1.31
FROSTFREEZE	$ 1.09
HEAVYRAIN	$ 0.73
TROPICALSTORM	$ 0.68
HIGHWIND	$ 0.63
TSTMWIND	$ 0.55
EXCESSIVEHEAT	$ 0.49
THUNDERSTORMWIND	$ 0.40
WILDFIRE	$ 0.30

CropDamage

The overall damage can be calculated as follows:

OVERALL <- storm %>% 
  mutate(PROPDMG = replace(PROPDMG, is.na(PROPDMG), 0), 
         CROPDMG = replace(CROPDMG, is.na(CROPDMG), 0), 
         DMG = PROPDMG + CROPDMG) %>% 
  group_by(EVTYPE) %>% 
  summarise(DMG = sum(DMG)) %>% 
  arrange(desc(DMG)) %>% 
  head(15)

## Required for proper sorting in the plot
OVERALL$EVTYPE <- factor(OVERALL$EVTYPE, OVERALL$EVTYPE)

## Generate table and chart
OverallDamage <- barchart(DMG/1e+09 ~ EVTYPE, OVERALL, 
         scales = list(x = list(rot = 45)),
         xlab = "Event Type",
         ylab = "Damage Cost (in USD Billion)",
         main = "Overall Damage Cost of Top 15 Event Types")
knitr::kable(OVERALL %>% 
               mutate("Overall Damage (in Billions)" = 
                        sprintf("$ %6.2f", DMG/1e+9)) %>% 
               select(-DMG))

EVTYPE	Overall Damage (in Billions)
HURRICANETYPHOON	$ 71.91
STORMSURGE	$ 43.19
FLOOD	$ 34.03
TORNADO	$ 24.90
HAIL	$ 17.07
FLASHFLOOD	$ 16.56
HURRICANE	$ 14.55
DROUGHT	$ 14.41
TROPICALSTORM	$ 8.32
HIGHWIND	$ 5.88
WILDFIRE	$ 5.05
TSTMWIND	$ 5.04
STORMSURGETIDE	$ 4.64
THUNDERSTORMWIND	$ 3.78
ICESTORM	$ 3.66

OverallDamage

Limitations of analysis

This should only be considered a very preliminary analysis. As is evident in the data summaries provided, there is still a fair amount of data cleaning still necessary. For instance, among the most economically impactful events were TSTMWIND and THUNDERSTORMWIND. It is likely that these are different terms for the same EVTYPE and should thus be added to gether. Similarly, there are different degrees mentioned for certain event types; these can also be treated as a common cluster in subsequent analyses.

See: http://www.ncdc.noaa.gov/stormevents/details.jsp ↩

Harmful and Costly Severe Weather Events

Ananda

23 October 2015