Exploring the U.S. NOAA weather events dataset

Synopsis

This project involves exploring the U.S. NOAA storm dataset which contains information about weather events in the United States during the period from 1950 to 2011. The analysis below is concentrated on estimating which types of weather events have a major impact on the economy and public health.

The dataset can be downloaded from here.

National weather service instruction about dataset preparation.

Questions

Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

knitr::opts_chunk$set(echo = T, warning = F, message = F)
## loading required packages
library(kableExtra)
library(lubridate)
library(magrittr)
library(ggplot2)
library(dplyr)
## Some additional functions will be loaded via namespaces (see "Session Info" section)

To reproduce the analysis a working directory has to contain following files:

## Working directory structure
dir_ls <- list.files('..', full.names = T, recursive = T)
dir_ls <- plyr::rbind.fill(
    lapply(strsplit(dir_ls, "/"), function(x) as.data.frame(t(x)))
)
dir_ls$pathString <- apply(dir_ls, 1, function(x) paste(trimws(x), collapse="/"))
data.tree::as.Node(data.frame(dir_ls))

                               levelName
1 ..                                    
2  °--reproducibility                   
3      ¦--repdata_data_StormData.csv.bz2
4      ¦--reproducibility.Rproj         
5      °--storms_report.Rmd

Data Processing

## Reading data
df <- read.csv('repdata_data_StormData.csv.bz2', stringsAsFactors = F)

## Variables selection
df %<>% 
    select(BGN_DATE, EVTYPE, FATALITIES, INJURIES, PROPDMG, PROPDMGEXP, CROPDMG, CROPDMGEXP) %>%
    mutate(YEAR = year(mdy_hms(BGN_DATE))) ## year of the start of event

The original dataset consisted of 902297 observations on 9 variables during peiod 1950 from to 2011. To speed up further analysis only variables related to our questions were selected.

As we can see in the dataset numeric variables PROPDMG and CROPDMG have to be multiplied by *EXP ¹:

## Preparation of lookup table to replace multipliers codes
lookup <- data.frame(EXP = sort(unique(c(df$PROPDMGEXP, df$CROPDMGEXP))),
                     VAL = c(0,0,0,1,rep(10, 9),10^9,10^2,10^2,10^3,10^3,10^6,10^6))

Using these rules we can transform numeric values in target variables.

## Transforming values using lookup table from previous step
df %<>% 
    left_join(lookup, by = c('PROPDMGEXP' = 'EXP')) %>% 
    mutate(PROPDMGEXP = VAL) %>% 
    select(-VAL) %>% 
    left_join(lookup, by = c('CROPDMGEXP' = 'EXP')) %>% 
    mutate(CROPDMGEXP = VAL) %>% 
    select(-VAL)

Results

Types of events which have the greatest economic consequences

## Grouping observations by type of event and calculating total economic damage
df %>% 
    mutate(`Type of Event` = factor(EVTYPE),
           `Economic Impact` = PROPDMG * PROPDMGEXP + CROPDMG * CROPDMGEXP) %>% 
    group_by(`Type of Event`) %>% 
    summarise(`Economic Impact` = sum(`Economic Impact`, na.rm = T)) %>% 
    arrange(desc(`Economic Impact`)) -> economic

Workcloud of the top 50 types of events with the greatest economic impact

set.seed(48)
## Worcloud building
wordcloud::wordcloud(words = economic$`Type of Event`[1:50], 
                     freq = economic$`Economic Impact`[1:50]/sum(economic$`Economic Impact`),
                     random.order = T)

The letter sizes on the figure are proportional to the contribution of corresponding events.

Detailed information about total economic damage caused by the top 10 events:

kable(economic[1:10, ], align = "l",
      caption = 'Top 10 types of events with their economic impact') %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = F)

Top 10 types of events with their economic impact
Type of Event	Economic Impact
FLOOD	150319678250
HURRICANE/TYPHOON	71913712800
TORNADO	57352117607
STORM SURGE	43323541000
HAIL	18758224527
FLASH FLOOD	17562132111
DROUGHT	15018672000
HURRICANE	14610229010
RIVER FLOOD	10148404500
ICE STORM	8967041810

According to results flood has the greatest economic consequences.

Types of events which have the greatest public health consequences

## Grouping observations by type of event and calculating total injuries
df %>% 
    mutate(`Type of Event` = EVTYPE) %>% 
    group_by(`Type of Event`) %>% 
    summarise(`Total Injuries` = sum(INJURIES, na.rm = T)) %>% 
    arrange(desc(`Total Injuries`)) -> injuries

## Grouping observations by type of event and calculating total fatalities
df %>% 
    mutate(`Type of Event` = EVTYPE) %>% 
    group_by(`Type of Event`) %>% 
    summarise(`Total Fatalities` = sum(FATALITIES, na.rm = T)) %>% 
    arrange(desc(`Total Fatalities`)) -> fatalities

## Custom theme for plots
my_theme <- theme(axis.title = element_text(size = 12, face = 'bold'), 
                  axis.text.y = element_text(size = 8, color = 'black'),
                  axis.text.x = element_text(angle = 90, vjust = 0.5, 
                                             hjust = 1, size = 8, color = 'black'))
## Total injuries barplot
a <- ggplot(injuries[1:10, ], aes(reorder(`Type of Event`, desc(`Total Injuries`)), `Total Injuries`))+
    geom_bar(stat = 'identity', fill = 'wheat', color = 'black')+
    theme_bw()+
    labs(x = 'Type of Event', y = 'Total Injuries', 
         title = 'Injuries') + my_theme

## Total fatalities barplot
b <- ggplot(fatalities[1:10, ], aes(reorder(`Type of Event`, desc(`Total Fatalities`)), `Total Fatalities`))+
    geom_bar(stat = 'identity', fill = 'wheat', color = 'black')+
    theme_bw()+
    labs(x = 'Type of Event', y = 'Total Fatalities', 
         title = 'Fatalities') + my_theme   

ggpubr::ggarrange(a, b, ncol = 2, labels = c('A', 'B'), align = 'hv')

According to barplots tornado has the greatest public health consequences (both total injuries A and total fatalities B)

Number of the most harmful weather events by year

## Tornado events selection
df %>% 
    filter(EVTYPE == 'TORNADO') %>% 
    group_by(YEAR) %>% 
    summarise(`Number of Events` = n()) %>% 
    mutate(`Type of Event` = 'Tornado') -> t

## Flood events selection
df %>% 
    filter(EVTYPE == 'FLOOD') %>% 
    group_by(YEAR) %>% 
    summarise(`Number of Events` = n()) %>% 
    mutate(`Type of Event` = 'Flood') -> f

nb <- 'NB! In the earlier years there are\ngenerally fewer events recorded,\nmost likely due to\na lack of good records.'

rbind(t, f) %>% 
    ggplot(aes(x = YEAR, y = `Number of Events`, group = `Type of Event`))+
    geom_line(aes(color = `Type of Event`), stat = 'identity')+
    labs(x = 'Year', y = 'Number of Events', color = 'Type of Event')+
    annotate(geom = 'text', x = 1965, y = 3050, label = nb)+
    scale_x_continuous(breaks = seq(min(df$YEAR), max(df$YEAR), 10))+
    theme_bw()+my_theme

Session Info

sessionInfo()

R version 3.4.4 (2018-03-15)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=ru_RU.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=ru_RU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=ru_RU.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=ru_RU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] bindrcpp_0.2.2   dplyr_0.7.7      ggplot2_3.1.0    magrittr_1.5    
[5] lubridate_1.7.4  kableExtra_0.9.0

loaded via a namespace (and not attached):
 [1] wordcloud_2.6      tidyselect_0.2.5   Rook_1.1-1        
 [4] purrr_0.2.5        colorspace_1.3-2   htmltools_0.3.6   
 [7] viridisLite_0.3.0  yaml_2.2.0         XML_3.98-1.16     
[10] rlang_0.3.0.1      ggpubr_0.2         pillar_1.3.0      
[13] glue_1.3.0         withr_2.1.2        RColorBrewer_1.1-2
[16] bindr_0.1.1        plyr_1.8.4         stringr_1.3.1     
[19] munsell_0.5.0      gtable_0.2.0       rvest_0.3.2       
[22] data.tree_0.7.8    visNetwork_2.0.5   htmlwidgets_1.3   
[25] evaluate_0.12      labeling_0.3       knitr_1.20        
[28] DiagrammeR_1.0.0   highr_0.7          Rcpp_0.12.19      
[31] readr_1.1.1        scales_1.0.0       backports_1.1.2   
[34] jsonlite_1.5       rgexf_0.15.3       gridExtra_2.3     
[37] brew_1.0-6         hms_0.4.2          digest_0.6.18     
[40] stringi_1.2.4      cowplot_0.9.3      grid_3.4.4        
[43] rprojroot_1.3-2    influenceR_0.1.0   tools_3.4.4       
[46] lazyeval_0.2.1     tibble_1.4.2       crayon_1.3.4      
[49] tidyr_0.8.1        pkgconfig_2.0.2    xml2_1.2.0        
[52] downloader_0.4     assertthat_0.2.0   rmarkdown_1.10    
[55] httr_1.3.1         rstudioapi_0.8     viridis_0.5.1     
[58] R6_2.3.0           igraph_1.2.2       compiler_3.4.4

Thanks to gcctang1 for the useful information↩