In this report, we will try to answer the following questions about weather events in the United States:
1. Which types of events are most harmful with respect to population health?
2. Which types of events have the greatest economic consequences?
To answer these questions, we will explore the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, which includes data about health and economic consequences for most weather events between 1950 and 2011.
This document assumes you have already downloaded the appropriate file and unzipped it into a directory named ‘data’, which is a sub-directory of your current working directory. The following code loads the raw csv file into a data.frame and shows the dimensions:
data <- read.csv("data/repdata-data-StormData.csv")
dim(data)
## [1] 902297 37
There are 902297 observations of 37 variables. We first look at the variable names to see which ones are relevant to our study.
names(data)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
For this study, we are only interested in variables related to the event type, population health, and economic consequences. After looking at the documentation, we determine that the event type is in the “EVTYPE”" variable (column 8), health data is in the “FATALITIES” and “INJURIES” variables (columns 23 and 24), and economic consequences are in any variable containing “DMG” (columns 25-28). We create a data frame with only these variables for our analysis and view the summary statistics.
storm <- data[,c(8,23:28)]
summary(storm)
## EVTYPE FATALITIES INJURIES
## HAIL :288661 Min. : 0.0000 Min. : 0.0000
## TSTM WIND :219940 1st Qu.: 0.0000 1st Qu.: 0.0000
## THUNDERSTORM WIND: 82563 Median : 0.0000 Median : 0.0000
## TORNADO : 60652 Mean : 0.0168 Mean : 0.1557
## FLASH FLOOD : 54277 3rd Qu.: 0.0000 3rd Qu.: 0.0000
## FLOOD : 25326 Max. :583.0000 Max. :1700.0000
## (Other) :170878
## PROPDMG PROPDMGEXP CROPDMG CROPDMGEXP
## Min. : 0.00 :465934 Min. : 0.000 :618413
## 1st Qu.: 0.00 K :424665 1st Qu.: 0.000 K :281832
## Median : 0.00 M : 11330 Median : 0.000 M : 1994
## Mean : 12.06 0 : 216 Mean : 1.527 k : 21
## 3rd Qu.: 0.50 B : 40 3rd Qu.: 0.000 0 : 19
## Max. :5000.00 5 : 28 Max. :990.000 B : 9
## (Other): 84 (Other): 9
The documentation suggests that the “PROPDMGEXP” and “CROPDMGEXP” variables should only have the values ‘’, ’K’, ‘M’, or ‘B’ to indicate what factor the damage should be multiplied by. Our summary shows that there is a small percentage of rows that do not have one of these values, so we will filter out those rows.
library(dplyr)
exp <- c('','K','M','B')
df <- storm %>% filter(PROPDMGEXP %in% exp, CROPDMGEXP %in% exp)
We now do some processing to multiply the damage variables by their respective ‘EXP’ variables to create 2 new columns indicating the total amount of property damage and crop damage.
First create 2 tables mapping the value of each “EXP” variable to the appropriate multiplication factor.
propmult <- data.frame(PROPDMGEXP = exp, pmult = c(1, 1e3, 1e6, 1e9))
cropmult <- data.frame(CROPDMGEXP = exp, cmult = c(1, 1e3, 1e6, 1e9))
propmult
## PROPDMGEXP pmult
## 1 1e+00
## 2 K 1e+03
## 3 M 1e+06
## 4 B 1e+09
Next merge these tables with our main data frame to add the “pmult” and “cmult” columns.
df2 <- merge(df, propmult)
df3 <- merge(df2, cropmult)
Finally, create the PROPTOTAL and CROPTOTAL columns (total amount of property/crop damage) by multiplying PROPDMG and CROPDMG by pmult and cmult, respectively.
df3$PROPTOTAL <- df3$PROPDMG * df3$pmult
df3$CROPTOTAL <- df3$CROPDMG * df3$cmult
We can now get rid of the PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP, pmult, and cmult columns. This leaves 5 columns indicating the event type, fatalities, injuries, total property damage, and total crop damage.
final <- df3 %>% select(EVTYPE, FATALITIES, INJURIES, PROPTOTAL, CROPTOTAL)
In our final table, there are 985 different event types, which is too many to do a visual comparison that makes sense, so we will add one more column (EVENT) to place each event into one of 12 categories using regular expressions.
final$EVENT <- 'other'
final[grep('HEAT|WARM|DRY|DROUGHT|dry|DUST', final$EVTYPE), 'EVENT'] <- 'heat/drought'
final[grep('WIND|Wind|wind', final$EVTYPE), 'EVENT'] <- 'wind'
final[grep('RAIN', final$EVTYPE), 'EVENT'] <- 'rain'
final[grep('TSTM|LIGHTNING|THUNDERSTORM|STORM SURGE', final$EVTYPE),
'EVENT'] <- 'thunderstorm'
final[grep('CHILL|COLD|LOW TEMPERATURE|Cold|EXPOS|Expos|HYPO|FROST|FREEZE',
final$EVTYPE), 'EVENT'] <- 'cold'
final[grep('WINTER|SNOW|ICE|FREEZING|BLIZZ|SLEET|ICY|MIX|WINTRY|snow|Snow',
final$EVTYPE), 'EVENT'] <- 'snow/ice'
final[grep('HAIL', final$EVTYPE), 'EVENT'] <- 'hail'
final[grep('TORNADO|WATERSPOUT|FUNNEL', final$EVTYPE), 'EVENT'] <- 'tornado'
final[grep('CURRENT|SURF|MARINE|HIGH WAVES|SEAS|SWELLS|Surf|TROPICAL|COASTAL|HURRICANE|
surf|Marine|TIDE|TSUNAMI', final$EVTYPE), 'EVENT'] <- 'tropical'
final[grep('FLOOD|FLD|RISING|HIGH WATER|Flood', final$EVTYPE), 'EVENT'] <- 'flood'
final[grep('FIRE', final$EVTYPE), 'EVENT'] <- 'wildfire'
table(final$EVENT)
##
## cold flood hail heat/drought other
## 4191 86096 289896 6111 3507
## rain snow/ice thunderstorm tornado tropical
## 11829 42705 339465 71505 16253
## wildfire wind
## 4240 26123
The table above shows that nearly all the events fit into these categories with only 3507 events in the ‘other’ category.
To answer this question we will first find the total number of fatalities for each event type in decreasing order and create a bar plot to compare them visually.
fat_by_event <- aggregate(FATALITIES ~ EVENT, data = final, sum)
fat_by_event <- fat_by_event[order(fat_by_event$FATALITIES, decreasing=TRUE),]
barplot(fat_by_event[1:5, 'FATALITIES'], names.arg=fat_by_event[1:5, 'EVENT'],
ylim = c(0,6000), main = 'Total Number of Fatalities', xlab='Event Type')
The plot shows the top five causes of death among the different event categories. Tornadoes have caused nearly twice as many deaths as the second-ranked cause, which is heat/drought. Now we can do the same comparison with injuries.
inj_by_event <- aggregate(INJURIES ~ EVENT, data = final, sum)
inj_by_event <- inj_by_event[order(inj_by_event$INJURIES, decreasing=TRUE),]
barplot(inj_by_event[1:5, 'INJURIES'], names.arg=inj_by_event[1:5, 'EVENT'],
ylim = c(0,80000), main = 'Total Number of Injuries', xlab='Event Type')
This plot reinforces our belief that tornadoes are the most dangerous weather event, causing more than six times as many injuries as the next closest event, which is thunderstorms. I would rank heat/drought as the second-most dangerous weather event. Although it is less likely than thunderstorms to cause injuries, it is far more likely to cause fatalities.
To answer this question, we will look at the combined total amount of property damage and crop damage caused by each of the event categories. Then we will use a bar graph to compare them visually.
prop_by_event <- aggregate(PROPTOTAL ~ EVENT, data = final, sum)
crop_by_event <- aggregate(CROPTOTAL ~ EVENT, data = final, sum)
total_damage <- merge(prop_by_event, crop_by_event)
total_damage$TOTAL <- (total_damage$PROPTOTAL + total_damage$CROPTOTAL) / 1e9
total_damage <- arrange(total_damage, desc(TOTAL))
barplot(total_damage[1:5, 'TOTAL'], names.arg=total_damage[1:5, 'EVENT'],
xlab='Event Type', ylab = 'Damage in billions of dollars',
main='Economic Damage Caused By Weather Events')
Floods are the leading cause of damage at around $180 billion, nearly twice as much as tropical storms which came in second.