Synopsis

This report analyzes the NOAA Storm Database to assess which types of events are most harmful to population health and which have the greatest economic consequences. For both, the data is limited to the United States and cover the period 1950 through November of 2011 and considers 48 different event types. In order to complete the analysis, significant data cleaning was required and is outlined in the following sections.

Overall, analysis shows that tornados have the most significant impact on population health while floods pose the greatest threat to local economies.

Data Processing

The data used in this analysis can be downloaded here

Load required libraries
This program utilizes functions from the data.table, lubridate, tidyverse, reshape2, and ggplot2 libraries.

File Download

data_in <- fread("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")

Check data to ensure loaded correctly:
We expect there to be 902,297 rows and 37 columns of data

dim(data_in)
## [1] 902297     37

Select only the fields that will be used in this analysis

df <- select(data_in, EVTYPE, FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,BGN_DATE)

Add month and year fields (for optional later analysis)

df$EVT_YEAR <- year(as.Date(df$BGN_DATE, "%m/%d/%Y"))
df$EVT_MTH <- month(as.Date(df$BGN_DATE, "%m/%d/%Y"))

Compute actual property and crop damages
As given, damages are split between two columns: df$PROPDMG and df$PROPDMGEXP for property damages, and df$CROPDMG and df$CROPDMGEXP for crop damages. The actual damage can be computed by muliplying the df$PROPDMG/df$CROPDMG values by a mapped value in the df$PROPDAMEXP/df$CROPDAMEXP field as shown here

givenEXP = c("","-","?","+",0,1,2,3,4,5,6,7,8,"h","H","k","K","m","M","b","B")
actualMULT = c(0,0,0,1,10,10,10,10,10,10,10,10,10,100,100,1000,1000,1000000,1000000,1000000000,1000000000)
EXPtable <- data.frame(givenEXP,actualMULT)

df$ACTUALPROPDAMAGE <- df$PROPDMG*EXPtable$actualMULT[match(df$PROPDMGEXP, EXPtable$givenEXP)]
df$ACTUALCROPDAMAGE <- df$CROPDMG*EXPtable$actualMULT[match(df$CROPDMGEXP, EXPtable$givenEXP)]

Reroder the columns and inspect first 6 rows of data

df <- df[,c(1,2,3,11,12,9,10)]
head(df)
##     EVTYPE FATALITIES INJURIES ACTUALPROPDAMAGE ACTUALCROPDAMAGE EVT_YEAR
## 1: TORNADO          0       15            25000                0     1950
## 2: TORNADO          0        0             2500                0     1950
## 3: TORNADO          0        2            25000                0     1951
## 4: TORNADO          0        2             2500                0     1951
## 5: TORNADO          0        2             2500                0     1951
## 6: TORNADO          0        6             2500                0     1951
##    EVT_MTH
## 1:       4
## 2:       4
## 3:       2
## 4:       6
## 5:      11
## 6:      11

Quick exploratory analysis of EVTYPE
From NOAA Documentation *(page 6, Table 2.1.1) there are supposed to be 48 official EVTYPEs in the data set.

EventTypes <- c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD", 
                "COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE","DROUGHT",
                "DUST DEVIL","DUST STORM","EXCESSIVE HEAT","EXTREME COLD/WIND CHILL",   
                "FLASH FLOOD","FLOOD","FREEZING FOG","FROST/FREEZE","FUNNEL CLOUD","HAIL",
                "HEAT","HEAVY RAIN","HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
                "ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING","MARINE HAIL",
                "MARINE HIGH WIND","MARINE STRONG WIND","MARINE THUNDERSTORM WIND",
                "RIP CURRENT","SEICHE","SLEET","STORM SURGE/TIDE","STRONG WIND",
                "THUNDERSTORM WIND","TORNADO","TROPICAL DEPRESSION","TROPICAL STORM","TSUNAMI",
                "VOLCANIC ASH","WATERSPOUT","WILDFIRE","WINTER STORM","WINTER WEATHER")

However, a quick query shows there are many more:

length(unique(df$EVTYPE))
## [1] 985

Further investigation revealed several issues with entries in the df$EVTYPE field; each will be addressed as appropriate.

Cleaning up EVTYPE field
First we’ll try and match each df$EVTYPE entry to one of the 48 official event types EventTypes

#Make sure all values for matching in df are uppercase
df$EVTYPE <- toupper(df$EVTYPE)

#Create an empty dataframe 'DummyEvents' with Columns from official Storm Data Event Table
DummyEvents <- data.frame(matrix(0, nrow = nrow(df),ncol = length(EventTypes)))
colnames(DummyEvents) <- gsub(" ", "_", EventTypes, fixed = TRUE)
colnames(DummyEvents) <- gsub("/", "_", colnames(DummyEvents), fixed = TRUE)
colnames(DummyEvents) <- gsub("(", "", colnames(DummyEvents), fixed = TRUE)
colnames(DummyEvents) <- gsub(")", "", colnames(DummyEvents), fixed = TRUE)

#Create a dataframe for keeping track of the count for each event type
InputEvents <- data.frame(matrix(0, nrow = length(EventTypes),ncol = 2))
colnames(InputEvents) <- c("EVTYPE","COUNT")

#Find records that match official event types using grepl
for(i in 1:length(EventTypes)){
  rows <- which(rowSums(`dim<-`(grepl(EventTypes[i], as.matrix(df), fixed=TRUE), dim(df))) > 0)
  DummyEvents[rows,i] <- 1
  InputEvents$EVTYPE[i] <- EventTypes[i]
  InputEvents$COUNT[i] <- sum(DummyEvents[,i])
}

Of the 902297 events in the initial dateset, this process matched 7.3007610^{5}, or 0.8091305 % to one of the 48 official types.

Next, we’ll address the 0.1908695% that weren’t initially matched:

Issue Approach
Abbreviations or alternate names are used for some events (e.g. TSTM WIND instead of THUNDERSTORM WIND) Map abbreviated or alternate names when clear and appropriate
Some events have sub-categories (e.g. there are 46 EVTYPEs that include “HAIL”) Roll-up sub-category names when clear and appropriate
Some events could be mapped into more than one fo the 48 official event types (e.g. THUNDERSTORM WINDS HAIL) Since we can’t determine which caused damage, drop from analysis
Some events do not map cleanly into any of the 48 official event types (e.g. BEACH EROSIN) Drop from analysis
Some entries contain typos (e.g. AVALANCHE vs.AVALANCE) Capture what we can with simple techniques; a more complete approach would be to create a mapping for each of the 898 EVTYPEs
NewEventTypes <- c("TSTM WIND","MARINE TSTM WIND", "LANDSLIDE", "HURRICANE", "TYPHOON")

NewDummyEvents <- data.frame(matrix(0, nrow = nrow(df),ncol = length(NewEventTypes)))
colnames(NewDummyEvents) <- gsub(" ", "_", NewEventTypes, fixed = TRUE)

NewInputEvents <- data.frame(matrix(0, nrow = length(NewEventTypes),ncol = 2))
colnames(NewInputEvents) <- c("EVTYPE","COUNT")

#Find records that contain new event types 
for(i in 1:length(NewEventTypes)){
  rows <- which(rowSums(`dim<-`(grepl(NewEventTypes[i], as.matrix(df), fixed=TRUE), dim(df))) > 0)
  NewDummyEvents[rows,i] <- 1
  NewInputEvents$EVTYPE[i] <- NewEventTypes[i]
  NewInputEvents$COUNT[i] <- sum(NewDummyEvents[,i])
}

This results in 2.3440510^{5} new matched entries, but contains some double counting of events initially captured. For instance, FLOOD is captured both as its own event but also counted in FLASH FLOOD. In order to combine these additional matches with the initial matches, we need to dedupe.

#Dedupe instances 
DummyEvents$HURRICANE_TYPHOON  <- DummyEvents$HURRICANE_TYPHOON +
                                  NewDummyEvents$HURRICANE +
                                  NewDummyEvents$TYPHOON

DummyEvents$THUNDERSTORM_WIND  <- DummyEvents$THUNDERSTORM_WIND -
                                  DummyEvents$MARINE_THUNDERSTORM_WIND +
                                  NewDummyEvents$TSTM_WIND - 
                                  NewDummyEvents$MARINE_TSTM_WIND 

DummyEvents$FLOOD              <- DummyEvents$FLOOD - 
                                  DummyEvents$FLASH_FLOOD - 
                                  DummyEvents$COASTAL_FLOOD - 
                                  DummyEvents$LAKESHORE_FLOOD

DummyEvents$HEAT               <- DummyEvents$HEAT - 
                                  DummyEvents$EXCESSIVE_HEAT

DummyEvents$HIGH_WIND          <- DummyEvents$HIGH_WIND - 
                                  DummyEvents$MARINE_HIGH_WIND

DummyEvents$STRONG_WIND        <- DummyEvents$STRONG_WIND - 
                                  DummyEvents$MARINE_STRONG_WIND

DummyEvents$HAIL               <- DummyEvents$HAIL - 
                                  DummyEvents$MARINE_HAIL

DummyEvents$COLD_WIND_CHILL    <- DummyEvents$COLD_WIND_CHILL - 
                                  DummyEvents$EXTREME_COLD_WIND_CHILL

DummyEvents$DEBRIS_FLOW        <- DummyEvents$DEBRIS_FLOW + 
                                  NewDummyEvents$LANDSLIDE
#Combine into single column of final Events; note which rows were not matched
DummyEvents$NOTMATCHED <- 0
Event <- max.col(DummyEvents, ties.method = "last" )
Event <- colnames(DummyEvents)[Event]

Finally, we’re ready to create our tidy dataframes for analysis:

dftidy <- cbind(Event,df)
dftidy <- dftidy[,c(1,3,4,5,6)]
dftidy <- dftidy[!dftidy$Event == "NOTMATCHED", ] # Remove rows that were unable to be matched to one of the 48 official event types

Aggreate the tidydf by Event to get the sums for fatalities, injuries, property damage, and crop damage:

dfsums <- aggregate(.~Event, dftidy, sum)
colnames(dfsums) <- c("Event_Type","FATALITIES","INJURIES","PROPDAMAGE","CROPDAMAGE")
dfsums
##                  Event_Type FATALITIES INJURIES   PROPDAMAGE  CROPDAMAGE
## 1     ASTRONOMICAL_LOW_TIDE          0        0       320000           0
## 2                 AVALANCHE        224      170      3721800           0
## 3                  BLIZZARD        101      805    659313950   112060000
## 4             COASTAL_FLOOD          6        7    412616060       56000
## 5           COLD_WIND_CHILL         95       12      1990000      600000
## 6               DEBRIS_FLOW         39       53    324701000    20017000
## 7                 DENSE_FOG         18      342      9674000           0
## 8               DENSE_SMOKE          0        0       100000           0
## 9                   DROUGHT          0        4   1046106000 13972566000
## 10               DUST_DEVIL          2       43       718630           0
## 11               DUST_STORM         22      440      5549000     3100000
## 12           EXCESSIVE_HEAT       1922     6525      7753700   492407780
## 13  EXTREME_COLD_WIND_CHILL        125       24      8648000       50000
## 14              FLASH_FLOOD       1035     1800  16906899203  1532197150
## 15                    FLOOD        483     6795 150180691730 10847452950
## 16             FREEZING_FOG          0        0      2182000           0
## 17             FROST_FREEZE          0        0     10480000  1094186000
## 18             FUNNEL_CLOUD          0        3       194600           0
## 19                     HAIL         15     1371  15974567877  3046937800
## 20                     HEAT       1216     2699     12572050   412061500
## 21               HEAVY_RAIN         99      255   3222510142   794252800
## 22               HEAVY_SNOW        127     1034    949827157   134653100
## 23                HIGH_SURF        146      204    113525000     1500000
## 24                HIGH_WIND        295     1507   6005003745   686821900
## 25        HURRICANE_TYPHOON        135     1333  85356410010  5516117800
## 26                ICE_STORM         89     1992   3945528310  5022113500
## 27         LAKE-EFFECT_SNOW          0        0     40115000           0
## 28          LAKESHORE_FLOOD          0        0      7540000           0
## 29                LIGHTNING        817     5232    938723917    12092090
## 30              MARINE_HAIL          0        0         4000           0
## 31         MARINE_HIGH_WIND          1        1      1297010           0
## 32       MARINE_STRONG_WIND         14       22       418330           0
## 33 MARINE_THUNDERSTORM_WIND         10       26       436400       50000
## 34              RIP_CURRENT        577      529       163000           0
## 35                   SEICHE          0        0       980000           0
## 36                    SLEET          2        0      2000000           0
## 37         STORM_SURGE_TIDE         11        5   4641188000      850000
## 38              STRONG_WIND        111      301    182874240    69953500
## 39        THUNDERSTORM_WIND        709     9459   9760798222  1253398700
## 40                  TORNADO       5658    91364  58541934147   417462960
## 41      TROPICAL_DEPRESSION          0        0      1737000           0
## 42           TROPICAL_STORM         66      383   7714390550   694896000
## 43                  TSUNAMI         33      129    144062000       20000
## 44             VOLCANIC_ASH          0        0       500000           0
## 45               WATERSPOUT          6       72     60730700           0
## 46                 WILDFIRE         75      911   4865614000   295972800
## 47             WINTER_STORM        217     1353   6749997260    32444000
## 48           WINTER_WEATHER         61      538     27298000    15000000

Results

Impact on Population Health
In order to determine which events had the greatest impact on population health, we combine injuries and fatalities and sort. The data shows that tornados have by far the greatest impact on population health.

dfhealth <- dfsums[,c(1,2,3)]
dfhealth <- dfhealth[order(dfsums$FATALITIES + dfsums$INJURIES, decreasing = TRUE),]  
colnames(dfhealth) <- c("Event_Type","Total_Fatalities","Total_Injuries")

dfhealth.m <- melt(head(dfhealth,10),id.vars = "Event_Type")

ggplot(dfhealth.m, aes(x = Event_Type, y = value,fill=variable)) +
       geom_bar(stat='identity') + 
       theme(axis.text.x = element_text(angle=65,vjust=0.6))+
       labs(title="Impact of Top 10 Event Types on Population Health", 
       subtitle="Total Injuries & Fatalities 1950-2001",
       y="Injuries + Fatalities",
       x="Event Type")

Impact on Local Economies
In order to determine which events had the greatest impact on local economies, we combine total property and crop damages and sort. The data shows that floods have the greatest impact on local economies, followed by Hurricanes / Typhoons and tornados.

dfeconomic <- dfsums[,c(1,4,5)]
dfeconomic <- dfeconomic[order(dfsums$PROPDAMAGE + dfsums$CROPDAMAGE, decreasing = TRUE),]  
colnames(dfeconomic) <- c("Event_Type","Total_Property_Damage","Total_Crop_Damage")

dfeconomic.m <- melt(head(dfeconomic,10),id.vars = "Event_Type")

ggplot(dfeconomic.m, aes(x = Event_Type, y = value,fill=variable)) +
       geom_bar(stat='identity') + 
       theme(axis.text.x = element_text(angle=65,vjust=0.6))+
       labs(title="Impact of Top 10 Event Types on Local Economies", 
       subtitle="Total Crop & Property Damage 1950-2001",
       y="Total Damage ($)",
       x="Event Type")