An exploration the NOAA Storm Database regarding the impact severe weather events. Data was inputted and dates were broken out to years to evaluate which events had the largest impact.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
Storm Data There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation National Climatic Data Center Storm Events FAQ The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.4.4
## -- Attaching packages ------------------------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.4
## v tidyr 0.8.0 v stringr 1.3.0
## v readr 1.1.1 v forcats 0.3.0
## Warning: package 'ggplot2' was built under R version 3.4.4
## Warning: package 'tibble' was built under R version 3.4.4
## Warning: package 'tidyr' was built under R version 3.4.4
## Warning: package 'readr' was built under R version 3.4.4
## Warning: package 'purrr' was built under R version 3.4.4
## Warning: package 'stringr' was built under R version 3.4.4
## Warning: package 'forcats' was built under R version 3.4.4
## -- Conflicts ---------------------------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(data.table)
## Warning: package 'data.table' was built under R version 3.4.4
##
## Attaching package: 'data.table'
## The following objects are masked from 'package:dplyr':
##
## between, first, last
## The following object is masked from 'package:purrr':
##
## transpose
library(lubridate)
## Warning: package 'lubridate' was built under R version 3.4.4
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:data.table':
##
## hour, isoweek, mday, minute, month, quarter, second, wday,
## week, yday, year
## The following object is masked from 'package:base':
##
## date
library(anytime)
## Warning: package 'anytime' was built under R version 3.4.4
library(R.utils)
## Warning: package 'R.utils' was built under R version 3.4.4
## Loading required package: R.oo
## Warning: package 'R.oo' was built under R version 3.4.4
## Loading required package: R.methodsS3
## R.methodsS3 v1.7.1 (2016-02-15) successfully loaded. See ?R.methodsS3 for help.
## R.oo v1.22.0 (2018-04-21) successfully loaded. See ?R.oo for help.
##
## Attaching package: 'R.oo'
## The following objects are masked from 'package:methods':
##
## getClasses, getMethods
## The following objects are masked from 'package:base':
##
## attach, detach, gc, load, save
## R.utils v2.7.0 successfully loaded. See ?R.utils for help.
##
## Attaching package: 'R.utils'
## The following object is masked from 'package:tidyr':
##
## extract
## The following object is masked from 'package:utils':
##
## timestamp
## The following objects are masked from 'package:base':
##
## cat, commandArgs, getOption, inherits, isOpen, parse, warnings
if(!file.exists(".data")){dir.create("./data")}
## Warning in dir.create("./data"): '.\data' already exists
fileUrl <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(fileUrl, destfile = "./data/stormdata.csv.bz2")
bunzip2("./data/stormdata.csv.bz2", "dataset.csv", remove = FALSE, skip = TRUE)
## [1] "dataset.csv"
## attr(,"temporary")
## [1] FALSE
dataset <- read_csv("./data/dataset.csv")
## Parsed with column specification:
## cols(
## .default = col_character(),
## STATE__ = col_double(),
## COUNTY = col_double(),
## BGN_RANGE = col_double(),
## COUNTY_END = col_double(),
## END_RANGE = col_double(),
## LENGTH = col_double(),
## WIDTH = col_double(),
## F = col_integer(),
## MAG = col_double(),
## FATALITIES = col_double(),
## INJURIES = col_double(),
## PROPDMG = col_double(),
## CROPDMG = col_double(),
## LATITUDE = col_double(),
## LONGITUDE = col_double(),
## LATITUDE_E = col_double(),
## LONGITUDE_ = col_double(),
## REFNUM = col_double()
## )
## See spec(...) for full column specifications.
time_merge <- function(time, date, timezone) {
t <- strptime(time, format = "%H%M")
t <- format(t, format="%H:%M:%S")
t <- paste(t, timezone)
x <- mdy_hms(date)
x <- paste(x, t)
new_date <- as.POSIXlt(x,format="%Y-%m-%d %H:%M:%S")
return(new_date)
}
start_dates <- time_merge(dataset$BGN_TIME, dataset$BGN_DATE, dataset$TIME_ZONE)
end_dates <- time_merge(dataset$END_TIME, dataset$END_DATE, dataset$TIME_ZONE)
dataset.dates <- cbind(start_dates, end_dates, dataset)
remove(start_dates)
remove(end_dates)
dataset.dates <- dataset.dates %>%
mutate(Year = year((dataset.dates$start_dates)))
## Warning: package 'bindrcpp' was built under R version 3.4.4
glimpse(dataset.dates)
## Observations: 902,297
## Variables: 40
## $ start_dates <dttm> 1950-04-18 01:30:00, 1950-04-18 01:45:00, 1951-02...
## $ end_dates <dttm> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ STATE__ <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ BGN_DATE <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/19...
## $ BGN_TIME <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0...
## $ TIME_ZONE <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "...
## $ COUNTY <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73,...
## $ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMA...
## $ STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "A...
## $ EVTYPE <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNA...
## $ BGN_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ BGN_AZI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ BGN_LOCATI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ END_DATE <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ END_TIME <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ COUNTYENDN <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ END_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ END_AZI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ END_LOCATI <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ LENGTH <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3,...
## $ WIDTH <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 40...
## $ F <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1,...
## $ MAG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0,...
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 5...
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25...
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", ...
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
## $ CROPDMGEXP <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ WFO <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ STATEOFFIC <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ ZONENAMES <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ LATITUDE <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 33...
## $ LONGITUDE <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 87...
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404,...
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640,...
## $ REMARKS <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA...
## $ REFNUM <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,...
## $ Year <dbl> 1950, 1950, 1951, 1951, 1951, 1951, 1951, 1952, 19...
To determine which EVTYPEs were most damaging to population health I explore the impact on population health I explore fatalities and injuries
deaths <- dataset.dates %>%
mutate(EVTYPE = (stringr::str_to_upper(EVTYPE))) %>%
group_by(EVTYPE, Year) %>%
filter(!is.na(Year)) %>%
summarise(Deaths = sum(FATALITIES, na.rm = T), Injuries = sum(INJURIES, na.rm = T))
deaths.final <- deaths %>%
mutate(Cause = case_when(str_detect(EVTYPE, ("HEAT|WARM")) ~ "HEAT",
str_detect(EVTYPE, ("FLOOD|STORM")) ~ "STORM",
str_detect(EVTYPE, ("TORNADO|WIND")) ~ "WIND",
str_detect(EVTYPE, ("COLD"))~ "COLD"))
deaths.final$Cause[is.na(deaths.final$Cause)] <- "OTHER"
deaths.final %>%
group_by(Cause) %>%
summarise(Deaths = sum(Deaths))
## # A tibble: 5 x 2
## Cause Deaths
## <chr> <dbl>
## 1 COLD 78.
## 2 HEAT 1142.
## 3 OTHER 359.
## 4 STORM 349.
## 5 WIND 4485.
deaths_aggr <- aggregate(Deaths ~ Cause, data = deaths.final, mean)
Let’s plot by average deaths in years recorded.
ggplot(deaths_aggr, aes(x = reorder(Cause, -Deaths), y = Deaths)) + geom_bar(stat = "identity") +
labs(x = "Cause", y = "Average Deaths in Years Recorded")
It looks like Heat and Wind are the two biggest contributers to death, is that the same for economics?
To determine which EVTYPEs were most damaging to economic health I look at PROPDMG and CROPDMG.
dmg <- dataset.dates %>%
mutate(EVTYPE = (stringr::str_to_upper(EVTYPE))) %>%
group_by(EVTYPE, Year) %>%
filter(!is.na(Year)) %>%
summarise(Damage = sum(PROPDMG, na.rm = T) + sum(CROPDMG, na.rm = T))
dmg.final <- dmg %>%
mutate(Cause = case_when(str_detect(EVTYPE, ("HEAT|WARM")) ~ "HEAT",
str_detect(EVTYPE, ("FLOOD|STORM")) ~ "STORM",
str_detect(EVTYPE, ("TORNADO|WIND")) ~ "WIND",
str_detect(EVTYPE, ("COLD"))~ "COLD"))
dmg.final$Cause[is.na(dmg.final$Cause)] <- "OTHER"
dmg_aggr <- aggregate(Damage ~ Cause, data = dmg.final, mean)
ggplot(dmg_aggr, aes(x = reorder(Cause, -Damage), y = Damage)) + geom_bar(stat = "identity") +
labs(x = "Cause", y = "Average Damage in Years Recorded")
Once again Wind is major cause, however Heat is now the lowest.
In this quick exploration we see that Wind, usually in the form of tornadoes are the most damaging force when looking at economics, but Heat is actual a bigger source of fatalities. While tornadoes are relatively rare events, heat will be a reoccuring problem that can be planned against. My reccomendation based on this would be to look into the current spend on prevention of Heat related deaths versus other measures to see if it is in line with the level of impact.