This project involves exploring the U.S. NOAA storm dataset which contains information about weather events in the United States during the period from 1950 to 2011. The analysis below is concentrated on estimating which types of weather events have a major impact on the economy and public health.
The dataset can be downloaded from here.
National weather service instruction about dataset preparation.
Questions
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
knitr::opts_chunk$set(echo = T, warning = F, message = F)
## loading required packages
library(kableExtra)
library(lubridate)
library(magrittr)
library(ggplot2)
library(dplyr)
## Some additional functions will be loaded via namespaces (see "Session Info" section)
To reproduce the analysis a working directory has to contain following files:
## Working directory structure
dir_ls <- list.files('..', full.names = T, recursive = T)
dir_ls <- plyr::rbind.fill(
lapply(strsplit(dir_ls, "/"), function(x) as.data.frame(t(x)))
)
dir_ls$pathString <- apply(dir_ls, 1, function(x) paste(trimws(x), collapse="/"))
data.tree::as.Node(data.frame(dir_ls))
levelName
1 ..
2 °--reproducibility
3 ¦--repdata_data_StormData.csv.bz2
4 ¦--reproducibility.Rproj
5 °--storms_report.Rmd
## Reading data
df <- read.csv('repdata_data_StormData.csv.bz2', stringsAsFactors = F)
## Variables selection
df %<>%
select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
mutate(YEAR = year(mdy_hms(BGN_DATE))) ## year of the start of event
The original dataset consisted of 902297 observations on 9 variables during peiod 1950 from to 2011. To speed up further analysis only variables related to our questions were selected.
As we can see in the dataset numeric variables PROPDMG and CROPDMG have to be multiplied by *EXP 1:
## Preparation of lookup table to replace multipliers codes
lookup <- data.frame(EXP = sort(unique(c(df$PROPDMGEXP, df$CROPDMGEXP))),
VAL = c(0,0,0,1,rep(10, 9),10^9,10^2,10^2,10^3,10^3,10^6,10^6))
Using these rules we can transform numeric values in target variables.
## Transforming values using lookup table from previous step
df %<>%
left_join(lookup, by = c('PROPDMGEXP' = 'EXP')) %>%
mutate(PROPDMGEXP = VAL) %>%
select(-VAL) %>%
left_join(lookup, by = c('CROPDMGEXP' = 'EXP')) %>%
mutate(CROPDMGEXP = VAL) %>%
select(-VAL)
## Grouping observations by type of event and calculating total economic damage
df %>%
mutate(`Type of Event` = factor(EVTYPE),
`Economic Impact` = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP) %>%
group_by(`Type of Event`) %>%
summarise(`Economic Impact` = sum(`Economic Impact`, na.rm = T)) %>%
arrange(desc(`Economic Impact`)) -> economic
Workcloud of the top 50 types of events with the greatest economic impact
set.seed(48)
## Worcloud building
wordcloud::wordcloud(words = economic$`Type of Event`[1:50],
freq = economic$`Economic Impact`[1:50]/sum(economic$`Economic Impact`),
random.order = T)
The letter sizes on the figure are proportional to the contribution of corresponding events.
Detailed information about total economic damage caused by the top 10 events:
kable(economic[1:10, ], align = "l",
caption = 'Top 10 types of events with their economic impact') %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)
| Type of Event | Economic Impact |
|---|---|
| FLOOD | 150319678250 |
| HURRICANE/TYPHOON | 71913712800 |
| TORNADO | 57352117607 |
| STORM SURGE | 43323541000 |
| HAIL | 18758224527 |
| FLASH FLOOD | 17562132111 |
| DROUGHT | 15018672000 |
| HURRICANE | 14610229010 |
| RIVER FLOOD | 10148404500 |
| ICE STORM | 8967041810 |
According to results flood has the greatest economic consequences.
## Grouping observations by type of event and calculating total injuries
df %>%
mutate(`Type of Event` = EVTYPE) %>%
group_by(`Type of Event`) %>%
summarise(`Total Injuries` = sum(INJURIES, na.rm = T)) %>%
arrange(desc(`Total Injuries`)) -> injuries
## Grouping observations by type of event and calculating total fatalities
df %>%
mutate(`Type of Event` = EVTYPE) %>%
group_by(`Type of Event`) %>%
summarise(`Total Fatalities` = sum(FATALITIES, na.rm = T)) %>%
arrange(desc(`Total Fatalities`)) -> fatalities
## Custom theme for plots
my_theme <- theme(axis.title = element_text(size = 12, face = 'bold'),
axis.text.y = element_text(size = 8, color = 'black'),
axis.text.x = element_text(angle = 90, vjust = 0.5,
hjust = 1, size = 8, color = 'black'))
## Total injuries barplot
a <- ggplot(injuries[1:10, ], aes(reorder(`Type of Event`, desc(`Total Injuries`)), `Total Injuries`))+
geom_bar(stat = 'identity', fill = 'wheat', color = 'black')+
theme_bw()+
labs(x = 'Type of Event', y = 'Total Injuries',
title = 'Injuries') + my_theme
## Total fatalities barplot
b <- ggplot(fatalities[1:10, ], aes(reorder(`Type of Event`, desc(`Total Fatalities`)), `Total Fatalities`))+
geom_bar(stat = 'identity', fill = 'wheat', color = 'black')+
theme_bw()+
labs(x = 'Type of Event', y = 'Total Fatalities',
title = 'Fatalities') + my_theme
ggpubr::ggarrange(a, b, ncol = 2, labels = c('A', 'B'), align = 'hv')
According to barplots tornado has the greatest public health consequences (both total injuries A and total fatalities B)
## Tornado events selection
df %>%
filter(EVTYPE == 'TORNADO') %>%
group_by(YEAR) %>%
summarise(`Number of Events` = n()) %>%
mutate(`Type of Event` = 'Tornado') -> t
## Flood events selection
df %>%
filter(EVTYPE == 'FLOOD') %>%
group_by(YEAR) %>%
summarise(`Number of Events` = n()) %>%
mutate(`Type of Event` = 'Flood') -> f
nb <- 'NB! In the earlier years there are\ngenerally fewer events recorded,\nmost likely due to\na lack of good records.'
rbind(t, f) %>%
ggplot(aes(x = YEAR, y = `Number of Events`, group = `Type of Event`))+
geom_line(aes(color = `Type of Event`), stat = 'identity')+
labs(x = 'Year', y = 'Number of Events', color = 'Type of Event')+
annotate(geom = 'text', x = 1965, y = 3050, label = nb)+
scale_x_continuous(breaks = seq(min(df$YEAR), max(df$YEAR), 10))+
theme_bw()+my_theme
sessionInfo()
R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=ru_RU.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=ru_RU.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=ru_RU.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] bindrcpp_0.2.2 dplyr_0.7.7 ggplot2_3.1.0 magrittr_1.5
[5] lubridate_1.7.4 kableExtra_0.9.0
loaded via a namespace (and not attached):
[1] wordcloud_2.6 tidyselect_0.2.5 Rook_1.1-1
[4] purrr_0.2.5 colorspace_1.3-2 htmltools_0.3.6
[7] viridisLite_0.3.0 yaml_2.2.0 XML_3.98-1.16
[10] rlang_0.3.0.1 ggpubr_0.2 pillar_1.3.0
[13] glue_1.3.0 withr_2.1.2 RColorBrewer_1.1-2
[16] bindr_0.1.1 plyr_1.8.4 stringr_1.3.1
[19] munsell_0.5.0 gtable_0.2.0 rvest_0.3.2
[22] data.tree_0.7.8 visNetwork_2.0.5 htmlwidgets_1.3
[25] evaluate_0.12 labeling_0.3 knitr_1.20
[28] DiagrammeR_1.0.0 highr_0.7 Rcpp_0.12.19
[31] readr_1.1.1 scales_1.0.0 backports_1.1.2
[34] jsonlite_1.5 rgexf_0.15.3 gridExtra_2.3
[37] brew_1.0-6 hms_0.4.2 digest_0.6.18
[40] stringi_1.2.4 cowplot_0.9.3 grid_3.4.4
[43] rprojroot_1.3-2 influenceR_0.1.0 tools_3.4.4
[46] lazyeval_0.2.1 tibble_1.4.2 crayon_1.3.4
[49] tidyr_0.8.1 pkgconfig_2.0.2 xml2_1.2.0
[52] downloader_0.4 assertthat_0.2.0 rmarkdown_1.10
[55] httr_1.3.1 rstudioapi_0.8 viridis_0.5.1
[58] R6_2.3.0 igraph_1.2.2 compiler_3.4.4