We analyse the U.S. NOAA Storm Database, looking specifically at the harm caused to both populations and property. We consider the entire data set (1950-2011). To assist with setting resource priorities, we show the top 10 harm-causing weather events in each case.
Required libraries:
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(reshape2)
We retrieve the US NOAA Storm database, which is a BZipped CSV file. We also download the supplemental PDF documentation. This analysis assumes R has the working directory set to analysis directory already (i.e. ‘.’ is the analysis directory).
get_file <- function(url,datafile) {
if (!file.exists(datafile)) {
download.file(url,datafile)
}
}
get_file('https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2','./noaa-storm-db.csv.bz2')
get_file('https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2Fpd01016005curr.pdf','./noaa-storm-data-docs.pdf')
get_file('https://d396qusza40orc.cloudfront.net/repdata%2Fpeer2_doc%2FNCDC%20Storm%20Events-FAQ%20Page.pdf','./noaa-storm-data-faq.pdf')
storm_db <- as_tibble(read.csv('./noaa-storm-db.csv.bz2'))
A quick overview of the dataset.
names(storm_db)
## [1] "STATE__" "BGN_DATE" "BGN_TIME" "TIME_ZONE" "COUNTY"
## [6] "COUNTYNAME" "STATE" "EVTYPE" "BGN_RANGE" "BGN_AZI"
## [11] "BGN_LOCATI" "END_DATE" "END_TIME" "COUNTY_END" "COUNTYENDN"
## [16] "END_RANGE" "END_AZI" "END_LOCATI" "LENGTH" "WIDTH"
## [21] "F" "MAG" "FATALITIES" "INJURIES" "PROPDMG"
## [26] "PROPDMGEXP" "CROPDMG" "CROPDMGEXP" "WFO" "STATEOFFIC"
## [31] "ZONENAMES" "LATITUDE" "LONGITUDE" "LATITUDE_E" "LONGITUDE_"
## [36] "REMARKS" "REFNUM"
head(storm_db)
## # A tibble: 6 x 37
## STATE… BGN_… BGN_… TIME… COUN… COUN… STATE EVTY… BGN_… BGN_… BGN_… END_…
## <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct> <dbl> <fct> <fct> <fct>
## 1 1.00 4/18… 0130 CST 97.0 MOBI… AL TORN… 0 "" "" ""
## 2 1.00 4/18… 0145 CST 3.00 BALD… AL TORN… 0 "" "" ""
## 3 1.00 2/20… 1600 CST 57.0 FAYE… AL TORN… 0 "" "" ""
## 4 1.00 6/8/… 0900 CST 89.0 MADI… AL TORN… 0 "" "" ""
## 5 1.00 11/1… 1500 CST 43.0 CULL… AL TORN… 0 "" "" ""
## 6 1.00 11/1… 2000 CST 77.0 LAUD… AL TORN… 0 "" "" ""
## # ... with 25 more variables: END_TIME <fctr>, COUNTY_END <dbl>,
## # COUNTYENDN <lgl>, END_RANGE <dbl>, END_AZI <fctr>, END_LOCATI <fctr>,
## # LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>, FATALITIES <dbl>,
## # INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <fctr>, CROPDMG <dbl>,
## # CROPDMGEXP <fctr>, WFO <fctr>, STATEOFFIC <fctr>, ZONENAMES <fctr>,
## # LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>, LONGITUDE_ <dbl>,
## # REMARKS <fctr>, REFNUM <dbl>
We can see that the EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, and CROPDMGEXP columns will be the focus of this analysis.
For the first question, the impact on public health, we examine the FATALITIES and INJURIES columns, grouped by EVTYPE. This will tell us the most impactful weather event types, and for interest we keep both the fatalities and the injuries, as well as the sum.
health <- storm_db %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
group_by(EVTYPE) %>%
mutate(IMPACT = FATALITIES + INJURIES) %>%
summarize_all(sum) %>%
arrange(desc(IMPACT))
For the second question, the economic impact, we first need to construct a cost estimate. We note there are four relevant fields - PROPDMG, PROPDMGEXP, CROPDMG, & CROPDMGEXP. The documents do not show the usage of these columns, but we assume these form pairs of the form $PROPDMG x10 ^ $PROPDMGEXP.
Looking at PROPDMGEXP, we note several types of usage.
summary(storm_db$PROPDMGEXP)
## - ? + 0 1 2 3 4 5
## 465934 1 8 5 216 25 13 4 4 28
## 6 7 8 B h H K m M
## 4 5 1 40 1 6 424665 7 11330
There are numbers, letters, and other characters here. With no clear index of what these mean, we make the following assumptions:
To do this, we use mutate with a case statement to convert the character fields to the equivalent numerical exponent, and then multiply the raw values by the exponent to get a final value of the damage.
raw_costs <- storm_db %>% select(EVTYPE, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(PROPDMGEXP = case_when(PROPDMGEXP == 'K' ~ '3', #thousands
PROPDMGEXP == 'm' ~ '6', # millions
PROPDMGEXP == 'M' ~ '6',
PROPDMGEXP == 'B' ~ '9', #billions
PROPDMGEXP == '' ~ '0', #empty value
TRUE ~ as.character(PROPDMGEXP))) %>%
mutate(CROPDMGEXP = case_when(CROPDMGEXP == 'k' ~ '3', #thousands
CROPDMGEXP == 'K' ~ '3',
CROPDMGEXP == 'm' ~ '6', # millions
CROPDMGEXP == 'M' ~ '6',
CROPDMGEXP == 'B' ~ '9', #billions
CROPDMGEXP == '' ~ '0', #empty value
TRUE ~ as.character(CROPDMGEXP))) %>%
filter(PROPDMGEXP %in% as.character(seq(0,9))) %>%
filter(CROPDMGEXP %in% as.character(seq(0,9)))
costs <- raw_costs %>%
mutate(PROPCOST = PROPDMG * 10 ^ as.numeric(PROPDMGEXP)) %>%
mutate(CROPCOST = CROPDMG * 10 ^ as.numeric(PROPDMGEXP)) %>%
mutate(COST = PROPCOST + CROPCOST) %>%
select(EVTYPE, PROPCOST, CROPCOST, COST) %>%
group_by(EVTYPE) %>%
summarize_all(sum) %>%
arrange(desc(COST))
“Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?”
To answer this, we’ll take the top 10 event types by total of fatalities and injuries, and then plot the resulting data.
health_10 <- head(health, n = 10) %>%
select(EVTYPE,FATALITIES,INJURIES) %>%
melt(id.vars=c("EVTYPE"))
ggplot(health_10, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_discrete(name='') +
labs(title = 'Top ten weather events by harm to U.S. population',
x = 'Event Type', y = 'Total Injuries & Fatalities')
We can clearly see that tornadoes represent the vast majority of both the injuries and fatalities to the U.S. population. Excluding that, we see largly similar values for ‘Excessive Heat’, ‘Thunderstorm Wind’, ‘Flood’, and ‘Lightning’.
“Across the United States, which types of events have the greatest economic consequences?”
Again, we’ll take the top 10 event types by total of property damages and crop damages, and then plot the resulting data. We’ll divide the damage totals by 1 billion so that the y-axis is readable.
costs_10 <- head(costs, n = 10) %>%
select(EVTYPE,PROPCOST,CROPCOST) %>%
mutate(PROPCOST = PROPCOST/1000000000, CROPCOST = CROPCOST/1000000000) %>%
melt(id.vars=c("EVTYPE"))
ggplot(costs_10, aes(x = reorder(EVTYPE, -value), y = value, fill = variable)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_fill_discrete(name = '', labels = c('Property', 'Crops')) +
labs(title = 'Top ten weather events by damage to U.S. Economy',
x = 'Event Type', y = 'Damages (billions of dollars)')
The largest damage by far comes from the ‘Hurricane’ and ‘Hurricane/Typhoon’ category, and we also see that this is driven strongly by damage to crops. Looking more closely at property damage, we see that tornadoes and flood-related events (flash floods, storm surges, etc) are also noteworthy.