Brief exploration of the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database, looking for the most harmful weather events with respect to population and economy. The database documentation is here.
We start by certifying that all the codes are shown and needed packages are loaded.
knitr::opts_chunk$set(echo = TRUE)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(gsubfn)
## Loading required package: proto
library(readr)
library(reshape2)
The database is ready for a simple read_csv() call, however, the call is cached, because the database is huge, so it would be slow rerun it all the time.
noaa <- read_csv("repdata%2Fdata%2FStormData.csv.bz2")
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
head(noaa)
## # A tibble: 6 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 1 4/18/1950 0:00:00 0130 CST 97 MOBILE AL
## 2 1 4/18/1950 0:00:00 0145 CST 3 BALDWIN AL
## 3 1 2/20/1951 0:00:00 1600 CST 57 FAYETTE AL
## 4 1 6/8/1951 0:00:00 0900 CST 89 MADISON AL
## 5 1 11/15/1951 0:00:00 1500 CST 43 CULLMAN AL
## 6 1 11/15/1951 0:00:00 2000 CST 77 LAUDERDALE AL
## # ... with 30 more variables: EVTYPE <chr>, BGN_RANGE <dbl>,
## # BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>, END_TIME <chr>,
## # COUNTY_END <dbl>, COUNTYENDN <chr>, END_RANGE <dbl>, END_AZI <chr>,
## # END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>, MAG <dbl>,
## # FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>, PROPDMGEXP <chr>,
## # CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>, STATEOFFIC <chr>,
## # ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>, LATITUDE_E <dbl>,
## # LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
For the most harmful weather events with respect to population health, we look for the events with high fatality and injury rates. Tornado-related events are the ones that kill more.
pop_harm_tot <- noaa %>% group_by(EVTYPE) %>%
summarise(tot_fal = sum(FATALITIES),
tot_inj = sum(INJURIES)) %>%
filter(tot_fal >= 500 | tot_inj > 2000) %>%
arrange(tot_fal) %>%
melt(id = "EVTYPE")
ggplot(pop_harm_tot, aes(x = EVTYPE, y = value, fill = variable)) +
geom_col(position = "dodge") +
theme_bw() +
labs(x = "Event", y = "", title = "Total Harm to Population Health",
fill = "Harm") +
scale_fill_manual(labels = c("Fatality", "Injury"), values = c("blue", "red"))
pop_harm_mean <- noaa %>% group_by(EVTYPE) %>%
summarise(mean_fal = mean(FATALITIES),
mean_inj = mean(INJURIES)) %>%
filter(mean_fal >= 10 | mean_inj >= 40) %>%
melt(id = "EVTYPE")
ggplot(pop_harm_mean, aes(x = EVTYPE, y = value, fill = variable)) +
geom_col(position = "dodge") +
theme_bw() +
labs(x = "Event", y = "", title = "Mean Harm to Population Health",
fill = "Harm") +
scale_fill_manual(labels = c("Fatality", "Injury"), values = c("blue", "red"))
For weather events with the greatest economic consequences, we look for the cases with biggest loss for property and crop.
econ_harm_prop <- noaa %>% group_by(EVTYPE) %>%
filter(PROPDMGEXP == "B") %>% arrange(desc(PROPDMG))
econ_harm_crop <- noaa %>% group_by(EVTYPE) %>%
filter(CROPDMGEXP == "B") %>% arrange(desc(CROPDMG))
The event with the biggest property loss is FLOOD and the biggest crop loss was given by RIVER FLOOD, ICE STORM.