9/29/2018

Summary
This document shows the work done to clean up the NOAA storm database and identify the storm event types that are (1) the most harmful to human health and (2) have the greatest economic consequences.

Synopsis

The document is separated into a Data Processing and a Results section. The Data Processing section shows the process of cleaning the raw NOAA storm database. The main goal of the data processing completed here was to narrow down the list of event types are stored in the EVTYPE variable of the database. The Results section shows the process done to identify the event types that result in the worst human health consequences (as measured by injuries and fatalities) and the event types that have the worst economic consequences(as measured by property damages and crop damages).

DATA PROCESSING

First we need to load the libraries we will need:

library(lubridate)
library(dplyr)
library(ggplot2) 

Load the data into R.

fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile="project2.csv")
dat <- read.csv("project2.csv")

The variables of interest are EVTYPE, FATALITIES, INJURIES, PROPDMG, and CROPDMG. The other variables are not needed for the purposes of this project. Therefore, we will create a new data table only including the variables of interest. We will also keep BGN_DATE and STATE in case we want to do some further analysis.

dat <- with(dat, data.frame(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))

It might also help to convert the BGN_DATE variable into a Date format and create a YEAR column. We can then remove all occurences before 1996 since that’s when NOAA started recording all event types. We can also remove all observations for which fatalities, injuries, propdmg, and cropdmg are 0.

dat$BGN_DATE <- as.Date(as.character(dat$BGN_DATE), "%m/%d/%Y %H:%M:%S") 
dat$YEAR <- year(dat$BGN_DATE)
dat <- dat[dat$YEAR>=1996,]
nodamage <- dat$FATALITIES==0 & dat$INJURIES==0 & dat$PROPDMG==0 & dat$CROPDMG==0
dat <- dat[!nodamage,]
dim(dat)
## [1] 201318     10

Doing this decreases the number of observations from 902,297 to 201,318.

Exploring the Data Set

str(dat) 
## 'data.frame':    201318 obs. of  10 variables:
##  $ BGN_DATE  : Date, format: "1996-01-06" "1996-01-11" ...
##  $ STATE     : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ EVTYPE    : Factor w/ 985 levels "   HIGH SURF ADVISORY",..: 972 834 856 856 856 359 856 856 856 153 ...
##  $ FATALITIES: num  0 0 0 0 0 0 0 0 0 0 ...
##  $ INJURIES  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ PROPDMG   : num  380 100 3 5 2 400 12 8 12 75 ...
##  $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
##  $ CROPDMG   : num  38 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 1 1 1 1 1 1 1 1 1 ...
##  $ YEAR      : num  1996 1996 1996 1996 1996 ...

Let’s take a closer look at the EVTYPE variable.

head(sort(unique(dat$EVTYPE)),30)
##  [1]    HIGH SURF ADVISORY      FLASH FLOOD             
##  [3]  TSTM WIND                 TSTM WIND (G45)         
##  [5] AGRICULTURAL FREEZE       ASTRONOMICAL HIGH TIDE   
##  [7] ASTRONOMICAL LOW TIDE     AVALANCHE                
##  [9] Beach Erosion             BLACK ICE                
## [11] BLIZZARD                  BLOWING DUST             
## [13] blowing snow              BRUSH FIRE               
## [15] COASTAL  FLOODING/EROSION COASTAL EROSION          
## [17] Coastal Flood             COASTAL FLOOD            
## [19] Coastal Flooding          COASTAL FLOODING         
## [21] COASTAL FLOODING/EROSION  Coastal Storm            
## [23] COASTAL STORM             COASTALSTORM             
## [25] Cold                      COLD                     
## [27] COLD AND SNOW             Cold Temperature         
## [29] COLD WEATHER              COLD/WIND CHILL          
## 985 Levels:    HIGH SURF ADVISORY  COASTAL FLOOD ... WND
length(unique(dat$EVTYPE))
## [1] 222

There are 222 unique event types. Looking at the first 30 sorted unique EVTYPE items, we see that some data processing is required. For example, “AVALANCE” and “AVALANCHE” can probably be grouped into one as can “Black Ice” and “BLACK ICE”. Some type names also have spaces.

Clearning up the EVTYPE variable

First we’ll make all the characters lower case and remove the white spaces.

dat$EVTYPE <- tolower(dat$EVTYPE)
dat$EVTYPE <- gsub(" ","", dat$EVTYPE)
length(unique(dat$EVTYPE))
## [1] 179

This already decreases the numbe of unique event types to 179.

Looking further into the names, we can see that some are singular and some are plural. Assuming that “thunderstorm” and “thunderstorms” mean the same event, we can remove the “s” at the end of each entry.

dat$EVTYPE <- gsub("s$", "", dat$EVTYPE)
dat$EVTYPE <- gsub("ing$", "", dat$EVTYPE)
length(unique(dat$EVTYPE)) 
## [1] 169

Doing this reduces the number of unique types to 169. This is still very far from the 48 official event types.

We can further reduce the number of types by combining similar names.

evtype <- unique(dat$EVTYPE)
for (i in 1:length(evtype)){
  evtype[agrep(evtype[i], evtype, max.distance=1)] <- evtype[i]
}
dat$fTYPE <- dat$EVTYPE
for (i in 1:length(evtype)){
  dat$fTYPE[agrep(evtype[i], dat$fTYPE, max.distance=1)] <- evtype[i]
}
length(unique(dat$fTYPE))
## [1] 58

We now have 58 unique event types. While these don’t exactly match the official list of 48 event types, it is good enough for the data processing part of this analysis. We will stop at the 58 unique event types.

Converting the PROPDMG and CROPDMG values using

PROPDMG and CROPDMG come with corresponding PROPDMGEXP and CROPDMGEXP values.

summary(dat$PROPDMGEXP)
##             -      ?      +      0      1      2      3      4      5 
##   8448      0      0      0      0      0      0      0      0      0 
##      6      7      8      B      h      H      K      m      M 
##      0      0      0     32      0      0 185474      0   7364
summary(dat$CROPDMGEXP)
##             ?      0      2      B      k      K      m      M 
## 102767      0      0      0      2      0  96787      0   1762

From the summaries above, we can see that the only coded exponential values used are B, K, and M. The numeric values and the other character values are not used in the data set. Therefore, we only need to convert the values for which the exponential values are B, K, or M. The code here is a little long but process faster than writing a for loop.

propK <- dat$PROPDMGEXP=="K"
propM <- dat$PROPDMGEXP=="M"
propB <- dat$PROPDMGEXP=="B"
dat$fPROP <- dat$PROPDMG
dat$fPROP[propK] <- dat$PROPDMG[propK]*1000
dat$fPROP[propM] <- dat$PROPDMG[propM]*1000000
dat$fPROP[propB] <- dat$PROPDMG[propB]*1000000000

cropK <- dat$CROPDMGEXP=="K"
cropM <- dat$CROPDMGEXP=="M"
cropB <- dat$CROPDMGEXP=="B"
dat$fCROP <- dat$CROPDMG
dat$fCROP[cropK] <- dat$CROPDMG[cropK]*1000
dat$fCROP[cropM] <- dat$CROPDMG[cropM]*1000000
dat$fCROP[cropB] <- dat$CROPDMG[cropB]*1000000000

Summarizing the results by type.

We can reformat the data table so that the values for INJURIES, FATALITIES, PROPDMG, and CROPDMG can be summarized by EVTYPE.

bytype <- dat %>% group_by(YEAR, fTYPE) %>% summarize(injuries = sum(INJURIES), fatal = sum(FATALITIES), prop= sum(fPROP), crop = sum(fCROP))
bytype<-(as.data.frame(bytype))

bytype is the final processed data set we will work with.

RESULTS

Types of events most harmful to population health?

To determine what types of events are most harmful with respect to population health, it might be useful to create a new variable that combines injuries and fatalities. Since, from the perspective of health, a fatality is much worse than an injury, fatalities will weigh twice as much as injuries.

bytype$health <- bytype$injuries + (2*bytype$fatal)

We can get a subset of the data for all events that have health scores higher than the average to identify the event types with the highest health impacts.

healthsub <- bytype[bytype$health > mean(bytype$health),]
healthsub <- arrange(healthsub, desc(health), desc(injuries), desc(fatal))

We can visualize the data subset with a simple bar chart.

g <- ggplot(healthsub, aes(YEAR, health))+
  geom_bar(stat="identity", aes(fill=fTYPE)) 
g

This plot shows the event types most harmful to human health separated by year. Tornadoes and floods seem to be the most harmful event types.

mostharmful <- healthsub %>% group_by(fTYPE) %>% summarize(health=sum(health)) %>% arrange(desc(health))
worsthealth <- mostharmful$fTYPE

The event types most harmful to human health are tornado, heat, wind, flood, lightn, hurricane, blizzard, tropicalstorm, glaze, tsunami, fog, ripcurrent.

Event types with the greatest economic consequences.

In a similar manner as done above, we can create a new variable that aggregates the economic impacts. We’ll call this variable econ. We can then create a subset of only the events that have an econ value higher than the average.

bytype$econ <- bytype$prop + bytype$crop
econsub <- bytype[bytype$econ > mean(bytype$econ), ]
econsub <- arrange(econsub, desc(econ), desc(prop), desc(crop))

We can visualize this in the same way we visualized the human health outcomes.

g <- ggplot(econsub, aes(YEAR, econ)) + 
  geom_bar(stat="identity", aes(fill=fTYPE)) 
g

This plot shows the event types most damaging to the economy separated by year. Tornadoes and floods seem to be the most detrimental event types.

mostcostly <- econsub %>% group_by(fTYPE) %>% summarize(econ=sum(econ)) %>% arrange(desc(econ)) 
worstecon <- mostcostly$fTYPE

The event types most costly to the economy are flood, hurricane, stormsurge, wind, tornado, drought, hail, tropicalstorm, icestorm.