This report analyzes the NOAA Storm Database to assess which types of events are most harmful to population health and which have the greatest economic consequences. For both, the data is limited to the United States and cover the period 1950 through November of 2011 and considers 48 different event types. In order to complete the analysis, significant data cleaning was required and is outlined in the following sections.
Overall, analysis shows that tornados have the most significant impact on population health while floods pose the greatest threat to local economies.
The data used in this analysis can be downloaded here
Load required libraries
This program utilizes functions from the data.table, lubridate, tidyverse, reshape2, and ggplot2 libraries.
File Download
data_in <- fread("https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2")
Check data to ensure loaded correctly:
We expect there to be 902,297 rows and 37 columns of data
dim(data_in)
## [1] 902297 37
Select only the fields that will be used in this analysis
df <- select(data_in, EVTYPE, FATALITIES,INJURIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP,BGN_DATE)
Add month and year fields (for optional later analysis)
df$EVT_YEAR <- year(as.Date(df$BGN_DATE, "%m/%d/%Y"))
df$EVT_MTH <- month(as.Date(df$BGN_DATE, "%m/%d/%Y"))
Compute actual property and crop damages
As given, damages are split between two columns: df$PROPDMG and df$PROPDMGEXP for property damages, and df$CROPDMG and df$CROPDMGEXP for crop damages. The actual damage can be computed by muliplying the df$PROPDMG/df$CROPDMG values by a mapped value in the df$PROPDAMEXP/df$CROPDAMEXP field as shown here
givenEXP = c("","-","?","+",0,1,2,3,4,5,6,7,8,"h","H","k","K","m","M","b","B")
actualMULT = c(0,0,0,1,10,10,10,10,10,10,10,10,10,100,100,1000,1000,1000000,1000000,1000000000,1000000000)
EXPtable <- data.frame(givenEXP,actualMULT)
df$ACTUALPROPDAMAGE <- df$PROPDMG*EXPtable$actualMULT[match(df$PROPDMGEXP, EXPtable$givenEXP)]
df$ACTUALCROPDAMAGE <- df$CROPDMG*EXPtable$actualMULT[match(df$CROPDMGEXP, EXPtable$givenEXP)]
Reroder the columns and inspect first 6 rows of data
df <- df[,c(1,2,3,11,12,9,10)]
head(df)
## EVTYPE FATALITIES INJURIES ACTUALPROPDAMAGE ACTUALCROPDAMAGE EVT_YEAR
## 1: TORNADO 0 15 25000 0 1950
## 2: TORNADO 0 0 2500 0 1950
## 3: TORNADO 0 2 25000 0 1951
## 4: TORNADO 0 2 2500 0 1951
## 5: TORNADO 0 2 2500 0 1951
## 6: TORNADO 0 6 2500 0 1951
## EVT_MTH
## 1: 4
## 2: 4
## 3: 2
## 4: 6
## 5: 11
## 6: 11
Quick exploratory analysis of EVTYPE
From NOAA Documentation *(page 6, Table 2.1.1) there are supposed to be 48 official EVTYPEs in the data set.
EventTypes <- c("ASTRONOMICAL LOW TIDE","AVALANCHE","BLIZZARD","COASTAL FLOOD",
"COLD/WIND CHILL","DEBRIS FLOW","DENSE FOG","DENSE SMOKE","DROUGHT",
"DUST DEVIL","DUST STORM","EXCESSIVE HEAT","EXTREME COLD/WIND CHILL",
"FLASH FLOOD","FLOOD","FREEZING FOG","FROST/FREEZE","FUNNEL CLOUD","HAIL",
"HEAT","HEAVY RAIN","HEAVY SNOW","HIGH SURF","HIGH WIND","HURRICANE (TYPHOON)",
"ICE STORM","LAKE-EFFECT SNOW","LAKESHORE FLOOD","LIGHTNING","MARINE HAIL",
"MARINE HIGH WIND","MARINE STRONG WIND","MARINE THUNDERSTORM WIND",
"RIP CURRENT","SEICHE","SLEET","STORM SURGE/TIDE","STRONG WIND",
"THUNDERSTORM WIND","TORNADO","TROPICAL DEPRESSION","TROPICAL STORM","TSUNAMI",
"VOLCANIC ASH","WATERSPOUT","WILDFIRE","WINTER STORM","WINTER WEATHER")
However, a quick query shows there are many more:
length(unique(df$EVTYPE))
## [1] 985
Further investigation revealed several issues with entries in the df$EVTYPE field; each will be addressed as appropriate.
Cleaning up EVTYPE field
First we’ll try and match each df$EVTYPE entry to one of the 48 official event types EventTypes
#Make sure all values for matching in df are uppercase
df$EVTYPE <- toupper(df$EVTYPE)
#Create an empty dataframe 'DummyEvents' with Columns from official Storm Data Event Table
DummyEvents <- data.frame(matrix(0, nrow = nrow(df),ncol = length(EventTypes)))
colnames(DummyEvents) <- gsub(" ", "_", EventTypes, fixed = TRUE)
colnames(DummyEvents) <- gsub("/", "_", colnames(DummyEvents), fixed = TRUE)
colnames(DummyEvents) <- gsub("(", "", colnames(DummyEvents), fixed = TRUE)
colnames(DummyEvents) <- gsub(")", "", colnames(DummyEvents), fixed = TRUE)
#Create a dataframe for keeping track of the count for each event type
InputEvents <- data.frame(matrix(0, nrow = length(EventTypes),ncol = 2))
colnames(InputEvents) <- c("EVTYPE","COUNT")
#Find records that match official event types using grepl
for(i in 1:length(EventTypes)){
rows <- which(rowSums(`dim<-`(grepl(EventTypes[i], as.matrix(df), fixed=TRUE), dim(df))) > 0)
DummyEvents[rows,i] <- 1
InputEvents$EVTYPE[i] <- EventTypes[i]
InputEvents$COUNT[i] <- sum(DummyEvents[,i])
}
Of the 902297 events in the initial dateset, this process matched 7.3007610^{5}, or 0.8091305 % to one of the 48 official types.
Next, we’ll address the 0.1908695% that weren’t initially matched:
| Issue | Approach |
|---|---|
| Abbreviations or alternate names are used for some events (e.g. TSTM WIND instead of THUNDERSTORM WIND) | Map abbreviated or alternate names when clear and appropriate |
| Some events have sub-categories (e.g. there are 46 EVTYPEs that include “HAIL”) | Roll-up sub-category names when clear and appropriate |
| Some events could be mapped into more than one fo the 48 official event types (e.g. THUNDERSTORM WINDS HAIL) | Since we can’t determine which caused damage, drop from analysis |
| Some events do not map cleanly into any of the 48 official event types (e.g. BEACH EROSIN) | Drop from analysis |
| Some entries contain typos (e.g. AVALANCHE vs.AVALANCE) | Capture what we can with simple techniques; a more complete approach would be to create a mapping for each of the 898 EVTYPEs |
NewEventTypes <- c("TSTM WIND","MARINE TSTM WIND", "LANDSLIDE", "HURRICANE", "TYPHOON")
NewDummyEvents <- data.frame(matrix(0, nrow = nrow(df),ncol = length(NewEventTypes)))
colnames(NewDummyEvents) <- gsub(" ", "_", NewEventTypes, fixed = TRUE)
NewInputEvents <- data.frame(matrix(0, nrow = length(NewEventTypes),ncol = 2))
colnames(NewInputEvents) <- c("EVTYPE","COUNT")
#Find records that contain new event types
for(i in 1:length(NewEventTypes)){
rows <- which(rowSums(`dim<-`(grepl(NewEventTypes[i], as.matrix(df), fixed=TRUE), dim(df))) > 0)
NewDummyEvents[rows,i] <- 1
NewInputEvents$EVTYPE[i] <- NewEventTypes[i]
NewInputEvents$COUNT[i] <- sum(NewDummyEvents[,i])
}
This results in 2.3440510^{5} new matched entries, but contains some double counting of events initially captured. For instance, FLOOD is captured both as its own event but also counted in FLASH FLOOD. In order to combine these additional matches with the initial matches, we need to dedupe.
#Dedupe instances
DummyEvents$HURRICANE_TYPHOON <- DummyEvents$HURRICANE_TYPHOON +
NewDummyEvents$HURRICANE +
NewDummyEvents$TYPHOON
DummyEvents$THUNDERSTORM_WIND <- DummyEvents$THUNDERSTORM_WIND -
DummyEvents$MARINE_THUNDERSTORM_WIND +
NewDummyEvents$TSTM_WIND -
NewDummyEvents$MARINE_TSTM_WIND
DummyEvents$FLOOD <- DummyEvents$FLOOD -
DummyEvents$FLASH_FLOOD -
DummyEvents$COASTAL_FLOOD -
DummyEvents$LAKESHORE_FLOOD
DummyEvents$HEAT <- DummyEvents$HEAT -
DummyEvents$EXCESSIVE_HEAT
DummyEvents$HIGH_WIND <- DummyEvents$HIGH_WIND -
DummyEvents$MARINE_HIGH_WIND
DummyEvents$STRONG_WIND <- DummyEvents$STRONG_WIND -
DummyEvents$MARINE_STRONG_WIND
DummyEvents$HAIL <- DummyEvents$HAIL -
DummyEvents$MARINE_HAIL
DummyEvents$COLD_WIND_CHILL <- DummyEvents$COLD_WIND_CHILL -
DummyEvents$EXTREME_COLD_WIND_CHILL
DummyEvents$DEBRIS_FLOW <- DummyEvents$DEBRIS_FLOW +
NewDummyEvents$LANDSLIDE
#Combine into single column of final Events; note which rows were not matched
DummyEvents$NOTMATCHED <- 0
Event <- max.col(DummyEvents, ties.method = "last" )
Event <- colnames(DummyEvents)[Event]
Finally, we’re ready to create our tidy dataframes for analysis:
dftidy <- cbind(Event,df)
dftidy <- dftidy[,c(1,3,4,5,6)]
dftidy <- dftidy[!dftidy$Event == "NOTMATCHED", ] # Remove rows that were unable to be matched to one of the 48 official event types
Aggreate the tidydf by Event to get the sums for fatalities, injuries, property damage, and crop damage:
dfsums <- aggregate(.~Event, dftidy, sum)
colnames(dfsums) <- c("Event_Type","FATALITIES","INJURIES","PROPDAMAGE","CROPDAMAGE")
dfsums
## Event_Type FATALITIES INJURIES PROPDAMAGE CROPDAMAGE
## 1 ASTRONOMICAL_LOW_TIDE 0 0 320000 0
## 2 AVALANCHE 224 170 3721800 0
## 3 BLIZZARD 101 805 659313950 112060000
## 4 COASTAL_FLOOD 6 7 412616060 56000
## 5 COLD_WIND_CHILL 95 12 1990000 600000
## 6 DEBRIS_FLOW 39 53 324701000 20017000
## 7 DENSE_FOG 18 342 9674000 0
## 8 DENSE_SMOKE 0 0 100000 0
## 9 DROUGHT 0 4 1046106000 13972566000
## 10 DUST_DEVIL 2 43 718630 0
## 11 DUST_STORM 22 440 5549000 3100000
## 12 EXCESSIVE_HEAT 1922 6525 7753700 492407780
## 13 EXTREME_COLD_WIND_CHILL 125 24 8648000 50000
## 14 FLASH_FLOOD 1035 1800 16906899203 1532197150
## 15 FLOOD 483 6795 150180691730 10847452950
## 16 FREEZING_FOG 0 0 2182000 0
## 17 FROST_FREEZE 0 0 10480000 1094186000
## 18 FUNNEL_CLOUD 0 3 194600 0
## 19 HAIL 15 1371 15974567877 3046937800
## 20 HEAT 1216 2699 12572050 412061500
## 21 HEAVY_RAIN 99 255 3222510142 794252800
## 22 HEAVY_SNOW 127 1034 949827157 134653100
## 23 HIGH_SURF 146 204 113525000 1500000
## 24 HIGH_WIND 295 1507 6005003745 686821900
## 25 HURRICANE_TYPHOON 135 1333 85356410010 5516117800
## 26 ICE_STORM 89 1992 3945528310 5022113500
## 27 LAKE-EFFECT_SNOW 0 0 40115000 0
## 28 LAKESHORE_FLOOD 0 0 7540000 0
## 29 LIGHTNING 817 5232 938723917 12092090
## 30 MARINE_HAIL 0 0 4000 0
## 31 MARINE_HIGH_WIND 1 1 1297010 0
## 32 MARINE_STRONG_WIND 14 22 418330 0
## 33 MARINE_THUNDERSTORM_WIND 10 26 436400 50000
## 34 RIP_CURRENT 577 529 163000 0
## 35 SEICHE 0 0 980000 0
## 36 SLEET 2 0 2000000 0
## 37 STORM_SURGE_TIDE 11 5 4641188000 850000
## 38 STRONG_WIND 111 301 182874240 69953500
## 39 THUNDERSTORM_WIND 709 9459 9760798222 1253398700
## 40 TORNADO 5658 91364 58541934147 417462960
## 41 TROPICAL_DEPRESSION 0 0 1737000 0
## 42 TROPICAL_STORM 66 383 7714390550 694896000
## 43 TSUNAMI 33 129 144062000 20000
## 44 VOLCANIC_ASH 0 0 500000 0
## 45 WATERSPOUT 6 72 60730700 0
## 46 WILDFIRE 75 911 4865614000 295972800
## 47 WINTER_STORM 217 1353 6749997260 32444000
## 48 WINTER_WEATHER 61 538 27298000 15000000
Impact on Population Health
In order to determine which events had the greatest impact on population health, we combine injuries and fatalities and sort. The data shows that tornados have by far the greatest impact on population health.
dfhealth <- dfsums[,c(1,2,3)]
dfhealth <- dfhealth[order(dfsums$FATALITIES + dfsums$INJURIES, decreasing = TRUE),]
colnames(dfhealth) <- c("Event_Type","Total_Fatalities","Total_Injuries")
dfhealth.m <- melt(head(dfhealth,10),id.vars = "Event_Type")
ggplot(dfhealth.m, aes(x = Event_Type, y = value,fill=variable)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle=65,vjust=0.6))+
labs(title="Impact of Top 10 Event Types on Population Health",
subtitle="Total Injuries & Fatalities 1950-2001",
y="Injuries + Fatalities",
x="Event Type")
Impact on Local Economies
In order to determine which events had the greatest impact on local economies, we combine total property and crop damages and sort. The data shows that floods have the greatest impact on local economies, followed by Hurricanes / Typhoons and tornados.
dfeconomic <- dfsums[,c(1,4,5)]
dfeconomic <- dfeconomic[order(dfsums$PROPDAMAGE + dfsums$CROPDAMAGE, decreasing = TRUE),]
colnames(dfeconomic) <- c("Event_Type","Total_Property_Damage","Total_Crop_Damage")
dfeconomic.m <- melt(head(dfeconomic,10),id.vars = "Event_Type")
ggplot(dfeconomic.m, aes(x = Event_Type, y = value,fill=variable)) +
geom_bar(stat='identity') +
theme(axis.text.x = element_text(angle=65,vjust=0.6))+
labs(title="Impact of Top 10 Event Types on Local Economies",
subtitle="Total Crop & Property Damage 1950-2001",
y="Total Damage ($)",
x="Event Type")