9/29/2018
Summary
This document shows the work done to clean up the NOAA storm database and identify the storm event types that are (1) the most harmful to human health and (2) have the greatest economic consequences.
The document is separated into a Data Processing and a Results section. The Data Processing section shows the process of cleaning the raw NOAA storm database. The main goal of the data processing completed here was to narrow down the list of event types are stored in the EVTYPE variable of the database. The Results section shows the process done to identify the event types that result in the worst human health consequences (as measured by injuries and fatalities) and the event types that have the worst economic consequences(as measured by property damages and crop damages).
First we need to load the libraries we will need:
library(lubridate)
library(dplyr)
library(ggplot2)
Load the data into R.
fileURL <- "http://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileURL, destfile="project2.csv")
dat <- read.csv("project2.csv")
The variables of interest are EVTYPE, FATALITIES, INJURIES, PROPDMG, and CROPDMG. The other variables are not needed for the purposes of this project. Therefore, we will create a new data table only including the variables of interest. We will also keep BGN_DATE and STATE in case we want to do some further analysis.
dat <- with(dat, data.frame(BGN_DATE, STATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP))
It might also help to convert the BGN_DATE variable into a Date format and create a YEAR column. We can then remove all occurences before 1996 since that’s when NOAA started recording all event types. We can also remove all observations for which fatalities, injuries, propdmg, and cropdmg are 0.
dat$BGN_DATE <- as.Date(as.character(dat$BGN_DATE), "%m/%d/%Y %H:%M:%S")
dat$YEAR <- year(dat$BGN_DATE)
dat <- dat[dat$YEAR>=1996,]
nodamage <- dat$FATALITIES==0 & dat$INJURIES==0 & dat$PROPDMG==0 & dat$CROPDMG==0
dat <- dat[!nodamage,]
dim(dat)
## [1] 201318 10
Doing this decreases the number of observations from 902,297 to 201,318.
str(dat)
## 'data.frame': 201318 obs. of 10 variables:
## $ BGN_DATE : Date, format: "1996-01-06" "1996-01-11" ...
## $ STATE : Factor w/ 72 levels "AK","AL","AM",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ EVTYPE : Factor w/ 985 levels " HIGH SURF ADVISORY",..: 972 834 856 856 856 359 856 856 856 153 ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 0 0 ...
## $ INJURIES : num 0 0 0 0 0 0 0 0 0 0 ...
## $ PROPDMG : num 380 100 3 5 2 400 12 8 12 75 ...
## $ PROPDMGEXP: Factor w/ 19 levels "","-","?","+",..: 17 17 17 17 17 17 17 17 17 17 ...
## $ CROPDMG : num 38 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: Factor w/ 9 levels "","?","0","2",..: 7 1 1 1 1 1 1 1 1 1 ...
## $ YEAR : num 1996 1996 1996 1996 1996 ...
Let’s take a closer look at the EVTYPE variable.
head(sort(unique(dat$EVTYPE)),30)
## [1] HIGH SURF ADVISORY FLASH FLOOD
## [3] TSTM WIND TSTM WIND (G45)
## [5] AGRICULTURAL FREEZE ASTRONOMICAL HIGH TIDE
## [7] ASTRONOMICAL LOW TIDE AVALANCHE
## [9] Beach Erosion BLACK ICE
## [11] BLIZZARD BLOWING DUST
## [13] blowing snow BRUSH FIRE
## [15] COASTAL FLOODING/EROSION COASTAL EROSION
## [17] Coastal Flood COASTAL FLOOD
## [19] Coastal Flooding COASTAL FLOODING
## [21] COASTAL FLOODING/EROSION Coastal Storm
## [23] COASTAL STORM COASTALSTORM
## [25] Cold COLD
## [27] COLD AND SNOW Cold Temperature
## [29] COLD WEATHER COLD/WIND CHILL
## 985 Levels: HIGH SURF ADVISORY COASTAL FLOOD ... WND
length(unique(dat$EVTYPE))
## [1] 222
There are 222 unique event types. Looking at the first 30 sorted unique EVTYPE items, we see that some data processing is required. For example, “AVALANCE” and “AVALANCHE” can probably be grouped into one as can “Black Ice” and “BLACK ICE”. Some type names also have spaces.
First we’ll make all the characters lower case and remove the white spaces.
dat$EVTYPE <- tolower(dat$EVTYPE)
dat$EVTYPE <- gsub(" ","", dat$EVTYPE)
length(unique(dat$EVTYPE))
## [1] 179
This already decreases the numbe of unique event types to 179.
Looking further into the names, we can see that some are singular and some are plural. Assuming that “thunderstorm” and “thunderstorms” mean the same event, we can remove the “s” at the end of each entry.
dat$EVTYPE <- gsub("s$", "", dat$EVTYPE)
dat$EVTYPE <- gsub("ing$", "", dat$EVTYPE)
length(unique(dat$EVTYPE))
## [1] 169
Doing this reduces the number of unique types to 169. This is still very far from the 48 official event types.
We can further reduce the number of types by combining similar names.
evtype <- unique(dat$EVTYPE)
for (i in 1:length(evtype)){
evtype[agrep(evtype[i], evtype, max.distance=1)] <- evtype[i]
}
dat$fTYPE <- dat$EVTYPE
for (i in 1:length(evtype)){
dat$fTYPE[agrep(evtype[i], dat$fTYPE, max.distance=1)] <- evtype[i]
}
length(unique(dat$fTYPE))
## [1] 58
We now have 58 unique event types. While these don’t exactly match the official list of 48 event types, it is good enough for the data processing part of this analysis. We will stop at the 58 unique event types.
PROPDMG and CROPDMG come with corresponding PROPDMGEXP and CROPDMGEXP values.
summary(dat$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 8448 0 0 0 0 0 0 0 0 0
## 6 7 8 B h H K m M
## 0 0 0 32 0 0 185474 0 7364
summary(dat$CROPDMGEXP)
## ? 0 2 B k K m M
## 102767 0 0 0 2 0 96787 0 1762
From the summaries above, we can see that the only coded exponential values used are B, K, and M. The numeric values and the other character values are not used in the data set. Therefore, we only need to convert the values for which the exponential values are B, K, or M. The code here is a little long but process faster than writing a for loop.
propK <- dat$PROPDMGEXP=="K"
propM <- dat$PROPDMGEXP=="M"
propB <- dat$PROPDMGEXP=="B"
dat$fPROP <- dat$PROPDMG
dat$fPROP[propK] <- dat$PROPDMG[propK]*1000
dat$fPROP[propM] <- dat$PROPDMG[propM]*1000000
dat$fPROP[propB] <- dat$PROPDMG[propB]*1000000000
cropK <- dat$CROPDMGEXP=="K"
cropM <- dat$CROPDMGEXP=="M"
cropB <- dat$CROPDMGEXP=="B"
dat$fCROP <- dat$CROPDMG
dat$fCROP[cropK] <- dat$CROPDMG[cropK]*1000
dat$fCROP[cropM] <- dat$CROPDMG[cropM]*1000000
dat$fCROP[cropB] <- dat$CROPDMG[cropB]*1000000000
We can reformat the data table so that the values for INJURIES, FATALITIES, PROPDMG, and CROPDMG can be summarized by EVTYPE.
bytype <- dat %>% group_by(YEAR, fTYPE) %>% summarize(injuries = sum(INJURIES), fatal = sum(FATALITIES), prop= sum(fPROP), crop = sum(fCROP))
bytype<-(as.data.frame(bytype))
bytype is the final processed data set we will work with.
To determine what types of events are most harmful with respect to population health, it might be useful to create a new variable that combines injuries and fatalities. Since, from the perspective of health, a fatality is much worse than an injury, fatalities will weigh twice as much as injuries.
bytype$health <- bytype$injuries + (2*bytype$fatal)
We can get a subset of the data for all events that have health scores higher than the average to identify the event types with the highest health impacts.
healthsub <- bytype[bytype$health > mean(bytype$health),]
healthsub <- arrange(healthsub, desc(health), desc(injuries), desc(fatal))
We can visualize the data subset with a simple bar chart.
g <- ggplot(healthsub, aes(YEAR, health))+
geom_bar(stat="identity", aes(fill=fTYPE))
g
This plot shows the event types most harmful to human health separated by year. Tornadoes and floods seem to be the most harmful event types.
mostharmful <- healthsub %>% group_by(fTYPE) %>% summarize(health=sum(health)) %>% arrange(desc(health))
worsthealth <- mostharmful$fTYPE
The event types most harmful to human health are tornado, heat, wind, flood, lightn, hurricane, blizzard, tropicalstorm, glaze, tsunami, fog, ripcurrent.
In a similar manner as done above, we can create a new variable that aggregates the economic impacts. We’ll call this variable econ. We can then create a subset of only the events that have an econ value higher than the average.
bytype$econ <- bytype$prop + bytype$crop
econsub <- bytype[bytype$econ > mean(bytype$econ), ]
econsub <- arrange(econsub, desc(econ), desc(prop), desc(crop))
We can visualize this in the same way we visualized the human health outcomes.
g <- ggplot(econsub, aes(YEAR, econ)) +
geom_bar(stat="identity", aes(fill=fTYPE))
g
This plot shows the event types most damaging to the economy separated by year. Tornadoes and floods seem to be the most detrimental event types.
mostcostly <- econsub %>% group_by(fTYPE) %>% summarize(econ=sum(econ)) %>% arrange(desc(econ))
worstecon <- mostcostly$fTYPE
The event types most costly to the economy are flood, hurricane, stormsurge, wind, tornado, drought, hail, tropicalstorm, icestorm.