This document analyses what are the types of weather events that are a) most harmful to the health of the general population, and b) most economically damaging within the United States. The analysis is done based on publicly available data from the NOAA Storm Database.
We begin our data processing step by downloading the publicly available U.S. National Oceanic and Atmospheric Administration’s storm database. The data set contains events from 1950 up to November of 2011, and will form the base data for our analysis.
We begin by reading in the data:
library(data.table)
stormdata <- fread("repdata-data-StormData.csv")
##
Read 4.1% of 967216 rows
Read 27.9% of 967216 rows
Read 39.3% of 967216 rows
Read 53.8% of 967216 rows
Read 65.1% of 967216 rows
Read 78.6% of 967216 rows
Read 83.7% of 967216 rows
Read 902297 rows and 37 (of 37) columns from 0.523 GB file in 00:00:09
## Warning in fread("repdata-data-StormData.csv"): Read less rows (902297)
## than were allocated (967216). Run again with verbose=TRUE and please
## report.
And now we check if all of the 902,297 rows are present, and check some columns of the first few values:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
##
## between, last
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
dim(stormdata)
## [1] 902297 37
select(head(stormdata), BGN_DATE, EVTYPE, INJURIES, FATALITIES)
## BGN_DATE EVTYPE INJURIES FATALITIES
## 1: 4/18/1950 0:00:00 TORNADO 15 0
## 2: 4/18/1950 0:00:00 TORNADO 0 0
## 3: 2/20/1951 0:00:00 TORNADO 2 0
## 4: 6/8/1951 0:00:00 TORNADO 2 0
## 5: 11/15/1951 0:00:00 TORNADO 2 0
## 6: 11/15/1951 0:00:00 TORNADO 6 0
This is a very wide dataset, so we select the few interesting values to study.
impacts <- select(stormdata, BGN_DATE, EVTYPE, INJURIES, FATALITIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP)
head(impacts)
## BGN_DATE EVTYPE INJURIES FATALITIES PROPDMG PROPDMGEXP
## 1: 4/18/1950 0:00:00 TORNADO 15 0 25.0 K
## 2: 4/18/1950 0:00:00 TORNADO 0 0 2.5 K
## 3: 2/20/1951 0:00:00 TORNADO 2 0 25.0 K
## 4: 6/8/1951 0:00:00 TORNADO 2 0 2.5 K
## 5: 11/15/1951 0:00:00 TORNADO 2 0 2.5 K
## 6: 11/15/1951 0:00:00 TORNADO 6 0 2.5 K
## CROPDMG CROPDMGEXP
## 1: 0
## 2: 0
## 3: 0
## 4: 0
## 5: 0
## 6: 0
We confirm that the variables we will use in our study do not have a high proportion of missing values, which could skew our analysis.
mean(is.na(impacts$PROPDMG))
## [1] 0
mean(is.na(impacts$CROPDMG))
## [1] 0
mean(is.na(impacts$INJURIES))
## [1] 0
mean(is.na(impacts$FATALITIES))
## [1] 0
To determine the events that do the most harm to population health, we look into the subset of events that resulted in injuries or fatalities.
health <- filter(impacts, INJURIES > 0 | FATALITIES > 0)
dim(health)
## [1] 21929 8
In order to understand which are the most harmful events, we total the number of injuries and fatalities, and order the events by casualties in descending order.
health_by_event <-
group_by(health, EVTYPE) %>%
summarize(injuries = sum(INJURIES),
fatalities = sum(FATALITIES)) %>%
mutate(casualties = injuries + fatalities)
head(arrange(health_by_event, desc(casualties)))
## EVTYPE injuries fatalities casualties
## 1 TORNADO 91346 5633 96979
## 2 EXCESSIVE HEAT 6525 1903 8428
## 3 TSTM WIND 6957 504 7461
## 4 FLOOD 6789 470 7259
## 5 LIGHTNING 5230 816 6046
## 6 HEAT 2100 937 3037
We not plot the top four most impacting event types.
library(ggplot2)
h <- arrange(health_by_event, desc(casualties))[1:4]
qplot(data = h, x = EVTYPE, y = casualties, geom = "bar", stat="identity")
We can see that tornados have caused the most impact on populations by a very large amount; roughly ten times more than the next most impactful event, excessive heat.
For the economic impact we pick the events that caused monetary damage of some kind:
economy <- filter(impacts, PROPDMG > 0 | CROPDMG > 0)
dim(economy)
## [1] 245031 8
For correctly calculating the total economic impact of an event, we need to convert the multiplier columns into a numeric value, which we can use to arrive at the actual recorded damage.
conv.magnitude <- function(mag) {
sapply(mag, function(magnitude) {
r <- 1
if (grepl("k", magnitude, ignore.case = TRUE)) {
r <- 1000
} else if (grepl("m", magnitude, ignore.case = TRUE)) {
r <- 1000000
} else if (grepl("b", magnitude, ignore.case = TRUE)) {
r <- 1000000000
}
r
})
}
economy <- mutate(economy,
prop_mag = conv.magnitude(PROPDMGEXP),
crop_mag = conv.magnitude(CROPDMGEXP))
economy <- mutate(economy,
damage = (PROPDMG * prop_mag) + (CROPDMG * crop_mag))
We can then total up the economic damage caused by each event type.
economy_by_event <-
group_by(economy, EVTYPE) %>%
summarize(total_damage = sum(damage)) %>%
arrange(desc(total_damage))
head(economy_by_event)
## EVTYPE total_damage
## 1 FLOOD 150319678257
## 2 HURRICANE/TYPHOON 71913712800
## 3 TORNADO 57352114049
## 4 STORM SURGE 43323541000
## 5 HAIL 18758221521
## 6 FLASH FLOOD 17562129167
We can now the top four most economically damaging events.
e <- economy_by_event[1:4]
qplot(data = e, x = EVTYPE, y = total_damage, geom = "bar", stat = "identity")
Here we see that floods cause the most impact, and are roughly twice as impactfull as the next-worse event, hurricanes.