Impact of Weather Events on Public Health and Economics

Binder

Synopsis

According to the analysis performed to answer questions 1 and 2, the event type with the most harmful consequences is the TORNADO due to its high frequency and high average number of injuries and fatalities. On the other hand, HURRICANE and TSUNAMI do not have the highest number of harmful to the person. Also, both have a low probability of happening. From the given dataset, we have recorded 299 hurricanes and 20 tsunamis. On the other hand, there are more than 60 thousand tornadoes observations. Still, the average number of deaths and injuries is greater than the TORNADO, which we could interpret as both event types being more deadly.

Lastly, FLOOD is the event type with the most significant economic impact in absolute values, which reach more than USD 180 billion (most of it is related to Properties Damages). On the other hand, weather problems such as EXCESSIVE HEAT and EXCESSIVE COLDc also significantly impact economic affairs. For example, the EXCESSIVE HEAT has generated more than USD 15.8 billion in Damages to Crops. Another critical point is HURRICANE does not happen regularly. Still, when it happens, it causes severe Damage to Properties and Crops. Based on the given dataset, hurricanes average generate USD 300 million of Damage for each event.


1. Introduction

Course Project 2 explores the database from U.S. National Oceanic and Atmospheric Administration (NOAA) and will be used to create a Reproducible Research following the instructions and reviews criteria stated by the Course instructor. This reproducible research should be hosted on Rpubs and will be evaluated in a peer-graded assignment. In addition, Course Project 2 also aims to evaluate the economic impacts and the consequences caused by storms, hurricanes, tornadoes, floods, and other environmental disasters.

2. Objectivies

The objective of Course Project 2 aims to answer two questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
  2. Across the United States, which types of events have the greatest economic consequences?

The publication will present the answers to Course Project 2 questions in section 7. Results.

3. Requirements and Settings

Requirements necessary to reproduce this analysis:

# Loading libraries
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)

Following the Course Project 2 instruction, it is necessary to set everything up in English.

# Force the R environment to use English settings
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Finally, if you want to reproduce it on your computer, I have printed below the sessionInfo(), which will help you to load all packages used to perform this publication.

## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## system code page: 65001
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.23          kableExtra_1.3.4 magrittr_2.0.3   forcats_0.5.1   
##  [5] stringr_1.4.0    dplyr_1.0.9      purrr_0.3.4      readr_2.1.2     
##  [9] tidyr_1.2.0      tibble_3.1.7     tidyverse_1.3.1  ggplot2_3.3.6   
## [13] rmarkdown_2.14  
## 
## loaded via a namespace (and not attached):
##  [1] svglite_2.1.0     lubridate_1.8.0   assertthat_0.2.1  digest_0.6.29    
##  [5] utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.15     httr_1.4.3        pillar_1.7.0     
## [13] rlang_1.0.2       readxl_1.4.0      rstudioapi_0.13   jquerylib_0.1.4  
## [17] webshot_0.5.3     htmlwidgets_1.5.4 munsell_0.5.0     broom_0.8.0      
## [21] compiler_4.2.0    modelr_0.1.8      xfun_0.31         pkgconfig_2.0.3  
## [25] systemfonts_1.0.4 htmltools_0.5.2   tidyselect_1.1.2  codetools_0.2-18 
## [29] fansi_1.0.3       viridisLite_0.4.0 crayon_1.5.1      tzdb_0.3.0       
## [33] dbplyr_2.2.0      withr_2.5.0       grid_4.2.0        jsonlite_1.8.0   
## [37] gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.3         scales_1.2.0     
## [41] cli_3.3.0         stringi_1.7.6     fs_1.5.2          xml2_1.3.3       
## [45] bslib_0.3.1       ellipsis_0.3.2    generics_0.1.2    vctrs_0.4.1      
## [49] tools_4.2.0       glue_1.6.2        hms_1.1.1         fastmap_1.1.0    
## [53] yaml_2.3.5        colorspace_2.0-3  rvest_1.0.2       knitr_1.39.3     
## [57] haven_2.5.0       sass_0.4.1

4. Data Processing

The data source used for this Course Project is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA), this dataset has recorded data about storms, tornados, floods, and other environmental and natural disasters between 1950 and 2011. In addition, the dataset also includes the number of fatalities, injuries, and property damages.

4.1. Downloading

Creating a data folder to store the dataset from NOAA.

# 1. Create a data directory
if(!base::file.exists("data")) {
    base::dir.create("data")
}

# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
    utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                         destfile = "./data/repdata_data_StormData.csv.bz2")
}

4.2. Loading

Although the bz2 is a compressed file, the base system function read.csv can read it without uncompressing it.

# Loading the raw dataset.
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")

Due to the facility to manage data frames using the tidyverse package, I have converted the raw_data object into a tibble class.

# Convert the regular data frame into a dplyr table
raw_data <- dplyr::tbl_df(raw_data)

5. Data Assessing

This section presents a few characteristics of the raw dataset, and does not have the intention of performing an exploratory data analysis. Instead, it is just a big picture of the raw dataset.

5.1. Dimensions

The raw dataset loaded from the bz2 file comprises 902.297 observations and 37 columns (variables), as it is possible to confirmed by the dim() function.

# Checking the observations and varibles in raw dataset.
base::dim(raw_data) # [Observations/rows] [variables/columns]
## [1] 902297     37

5.2. Head and Tail

The first 5 lines of the raw data set:

# Printing the first 5 rows.
head(raw_data,5)
## # A tibble: 5 x 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/1950~ 0130     CST           97 MOBILE     AL    TORNA~         0
## 2       1 4/18/1950~ 0145     CST            3 BALDWIN    AL    TORNA~         0
## 3       1 2/20/1951~ 1600     CST           57 FAYETTE    AL    TORNA~         0
## 4       1 6/8/1951 ~ 0900     CST           89 MADISON    AL    TORNA~         0
## 5       1 11/15/195~ 1500     CST           43 CULLMAN    AL    TORNA~         0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

The last 5 lines of the raw data set:

# Printing the last 5 rows.
tail(raw_data,5)
## # A tibble: 5 x 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1      56 11/30/201~ 10:30:0~ MST            7 WYZ007 - ~ WY    HIGH ~         0
## 2      30 11/10/201~ 02:48:0~ MST            9 MTZ009 - ~ MT    HIGH ~         0
## 3       2 11/8/2011~ 02:58:0~ AKS          213 AKZ213     AK    HIGH ~         0
## 4       2 11/9/2011~ 10:21:0~ AKS          202 AKZ202     AK    BLIZZ~         0
## 5       1 11/28/201~ 08:00:0~ CST            6 ALZ006     AL    HEAVY~         0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

5.3. Variables

The dataset from NOAA has 37 variables, as seen from the str function.

# Printing the variables' names.
utils::str(raw_data)
## tibble [902,297 x 37] (S3: tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] "" "" "" "" ...
##  $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ END_DATE  : chr [1:902297] "" "" "" "" ...
##  $ END_TIME  : chr [1:902297] "" "" "" "" ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] "" "" "" "" ...
##  $ END_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
##  $ WFO       : chr [1:902297] "" "" "" "" ...
##  $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
##  $ ZONENAMES : chr [1:902297] "" "" "" "" ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] "" "" "" "" ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

The read.csv() function did not correctly import the following variables:

  • BGN_DATE, END_DATE, BGN_TIME and END_TIME are Date class variables;

5.3. NA Presence

The variables LATITUDE and LATITUDE_E have the presence of NA observations.

# Printing the summary of LATITUDE.
summary(raw_data$LATITUDE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    2802    3540    2875    4019    9706      47
# Printing the summary of LATITUDE_E.
summary(raw_data$LATITUDE_E)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0    1452    3549    9706      40

All those NA observations are related to the AMERICAN SAMOA territory.

# Filtering the NA observations and finding the origins.
raw_data %>% 
    filter(is.na(LATITUDE) | is.na(LATITUDE_E)) %>% 
    select(STATEOFFIC) %>%
    unique()
## # A tibble: 1 x 1
##   STATEOFFIC    
##   <chr>         
## 1 AMERICAN SAMOA

5.4. Exponential Variable

The PROPDMGEXP and CROPDMGEXP (Properties Exponential and Crop Exponential) variables store the base 10 exponential value of PROPDMG and CROPDMG, respectively. For this reason, it is necessary to combine PROPDMG and PROPDMGEXP into a single variable and do the same to CROPDMG and CROPDMGEXP.

Example

[PROPDMG] * 10^[PROPDMGEXP]

1.5 * 10^[B] = 1.5 billion USD

Therefore, it is essential to understand how the PROPDMGEXP and CROPDMGEXP were recorded.

# Printing the unique values to exponential to Properties.
unique(raw_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Finally, the notation is not uniform because sometimes they use Capital or Lowercase (e.g., K or k). In other cases, they use a number or insert a question mark (?) or plus/minus signal (e.g., - or +).

  • PROPDMGEXP
    • ?: Has 8 (eight) observations with this notation, none of the rows of PROPDM and CROPDMG has values;
    • +: Has 5 (five) observations with this notation, and;
    • -: Has 1 (one) observation with this notation and 1 (one) fatality. I will consider like “0”.
  • CROPDMGEXP
    • ?: Has 7 (seven) observations with this notation, all those with CROPDMG as zero;

5.5. Event Type

The raw dataset has 985 Event Types, the code below will present the first 10.

raw_data %>%
    group_by(EVTYPE) %>%
    summarise(n = n()) %>%
    arrange(desc(n)) %>%
    head(10)
## # A tibble: 10 x 2
##    EVTYPE                  n
##    <chr>               <int>
##  1 HAIL               288661
##  2 TSTM WIND          219940
##  3 THUNDERSTORM WIND   82563
##  4 TORNADO             60652
##  5 FLASH FLOOD         54277
##  6 FLOOD               25326
##  7 THUNDERSTORM WINDS  20843
##  8 HIGH WIND           20212
##  9 LIGHTNING           15754
## 10 HEAVY SNOW          15708

Unfortunately, there is no standardization, as seen in TSTM WIND, THUNDERSTORM WIND, and THUNDERSTORM WINDS. For this reason, it is necessary to adjust this variable, consolidating those events using a single notation.

For further understanding of each variable, please visit the codebook.

6. Data Manipulation

Although I have observed problems in BGN_DATE, END_DATE, BGN_TIME and END_TIME for this Course Project 2, those variables will be not used for further analysis. For this reason, I will subset the raw dataset to work with part of it.

6.1. Subsetting the raw data

Based on the questions posted on 2. Objectivies, they are related to the most harmful concerning population health and its economic consequences. Thus, the raw dataset subset will select the following variables:

  • EVTYPE: Event type;
  • INJURIES: Quantity of injuries;
  • FATALITIES: Quantity of fatalities;
  • PROPDMG: Amount in USD in Damages in properties;
  • CROPDMG: Amount in USD in Damages in crops;
  • PROPDMGEXP: The exponential in base 10 used to notate the PROPDMG, and;
  • CROPDMGEXP: The exponential in base 10 used to notate the CROPDMG.
# Subsetting the raw dataset.
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

6.2. Merging Damage and Exponential Variables

It is necessary to merge PROPDMG and PROPDMGEXP into a single variable, so I will create the PROPDMGABS using the following rules.

  • All 0, and - will be converted into 1;
  • All 1 will be converted into 10;
  • All H, h and 2 will be converted into 100;
  • All K, k and 3 will be converted into 1000;
  • All 4 will be converted into 10000;
  • All 5 will be converted into 100000;
  • All M, m and 6 will be converted into 1000000;
  • All 7 will be converted into 10000000;
  • All 8 will be converted into 100000000;
  • All B, b and 9 will be converted into 1000000000;

Following the same principles, I will merge CROPDMG and CROPDMG to create CROPDMGABS:

  • All 0, and ? will be converted into 1;
  • All H, h and 2 will be converted into 100;
  • All K, k and 3 will be converted into 1000;
  • All M, m and 6 will be converted into 1000000;
  • All B, b and 9 will be converted into 1000000000;

The PROPDMGABS and CROPDMGABS will be in charge of storing the absolute values of Properties Damages and Crops Damages.

# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
    
    # Creating a new column PROPDMGABS
    mutate(PROPDMGABS = case_when(
        PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
        PROPDMGEXP %in% c(1)         ~ PROPDMG * 10,
        PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
        PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
        PROPDMGEXP %in% c(4)         ~ PROPDMG * 10000,
        PROPDMGEXP %in% c(5)         ~ PROPDMG * 100000,
        PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
        PROPDMGEXP %in% c(7)         ~ PROPDMG * 10000000,
        PROPDMGEXP %in% c(8)         ~ PROPDMG * 100000000,
        PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%
    
    # Creating a new column CROPDMGABS
    mutate(CROPDMGABS = case_when(
        CROPDMGEXP %in% c("?","", 0)   ~ CROPDMG * 1,
        CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
        CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
        CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
        CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000)) 

6.3. Uniforming Event Types

The EVTYPE variable is not tidy because there are a lot of classifications with the same meaning but different notations. Hence, I will standardize this variable following the rules inside of the next chunk.

# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
    
    # Converting all value to be UPPERCASE.
    mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
    
    # Rules to aggregate events into a new variable called EVTYPE_2
    mutate(EVTYPE_2 = case_when(
        grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
        grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
        grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
        grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
        grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
        grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
        grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
        grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
        grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
        grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
        grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
        grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
        grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
        grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
        grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
        grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
        grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
        grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
    
    # Creating a OTHER type of event to aggregate minor events.
    mutate(EVTYPE_2 = case_when(
        is.na(EVTYPE_2) ~ "OTHER",
        !is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
    
    # Dropping EVTYPE because it is not necessary to carry it.
    select(-EVTYPE) %>%
    
    # Renaming the EVTYPE_2 to be the new EVTYPE
    mutate(EVTYPE = EVTYPE_2)

After the standardization, the new EVTYPE has 19 categories.

# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)
##  [1] "TORNADO"           "THUNDERSTORM WIND" "HAIL"             
##  [4] "EXCESSIVE COLD"    "ICE STORM"         "FLOOD"            
##  [7] "HURRICANE"         "EXCESSIVE HEAT"    "LIGHTNING"        
## [10] "FOG"               "CURRENT"           "HIGH WIND"        
## [13] "OTHER"             "WATERSPOUT"        "AVALANCHE"        
## [16] "DUST"              "FOREST FIRE"       "LANDSLIDE"        
## [19] "TSUNAMI"

6.3. Tidy dataset

The tidy dataset will be composite by the EVTYPE, INJURIES, FATALITIES, PROPDMGABS and CROPDMGABS.

# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
    
    # Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
    select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))

7. Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Answer

I will define harmful as the summation of INJURIES and FATALITIES of each EVTYPE. Therefore, it is necessary to create a new variable, the so-called HARMFUL, to store these aggregation. I also will create 3 new variables to store the average of injuries (INJURIES_AVG), fatalities (FATALITIES_AVG), and harmful (HARMFUL_AVG) based on the number of observations (QTY) of each EVTYPE.

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(HARMFUL = INJURIES + FATALITIES) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(HARMFUL = sum(HARMFUL),
              QTY = n(),
              FATALITIES = sum(FATALITIES),
              INJURIES = sum(INJURIES)) %>%
    
    # Arrange to be descendant.
    arrange(desc(HARMFUL)) %>%
    
    # Adding averages variables.
    mutate(HARMFUL_AVG = HARMFUL/QTY,
           FATALITIES_AVG = FATALITIES/QTY,
           INJURIES_AVG = INJURIES/QTY)

The adjusted dataset will be shown below.

# Printing the 
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
EVTYPE HARMFUL QTY FATALITIES INJURIES HARMFUL_AVG FATALITIES_AVG INJURIES_AVG
TORNADO 97068 60700 5661 91407 1.5991433 0.0932619 1.5058814
EXCESSIVE HEAT 14132 33788 3467 10665 0.4182550 0.1026104 0.3156446
THUNDERSTORM WIND 10276 343796 729 9547 0.0298898 0.0021204 0.0277694
FLOOD 10236 86133 1553 8683 0.1188395 0.0180303 0.1008092
LIGHTNING 6048 15763 817 5231 0.3836833 0.0518302 0.3318531
EXCESSIVE COLD 3972 14341 483 3489 0.2769681 0.0336797 0.2432885
HIGH WIND 2690 28173 696 1994 0.0954815 0.0247045 0.0707770
ICE STORM 2565 15973 335 2230 0.1605835 0.0209729 0.1396106
FOREST FIRE 1696 4232 90 1606 0.4007561 0.0212665 0.3794896
HURRICANE 1468 299 135 1333 4.9096990 0.4515050 4.4581940
CURRENT 1442 1937 712 730 0.7444502 0.3675787 0.3768715
HAIL 1386 289281 15 1371 0.0047912 0.0000519 0.0047393
FOG 1156 1855 80 1076 0.6231806 0.0431267 0.5800539
DUST 507 586 24 483 0.8651877 0.0409556 0.8242321
AVALANCHE 394 386 224 170 1.0207254 0.5803109 0.4404145
OTHER 344 546 44 300 0.6300366 0.0805861 0.5494505
TSUNAMI 162 20 33 129 8.1000000 1.6500000 6.4500000
LANDSLIDE 99 643 44 55 0.1539658 0.0684292 0.0855365
WATERSPOUT 32 3845 3 29 0.0083225 0.0007802 0.0075423

Based on the df_tidy_q1, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q1 %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q1 %>%
    arrange(desc(HARMFUL_AVG)) %>%
    select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Most Harmful Event Type") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Health Harmful") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

In term of absolute value, there are no doubt that TORNADO events are the most harmful to health due to the high frequency of tornadoes. The number of tornadoes observed in the dataset is more than 60 thousand (between 1950 and 2011), which results in 5.661 FATALITIES, 91.407 INJURIES, and 97.068 health HARMFUL.

However, whether you analyze it in terms of average, in that case, the TSUNAMI event type has the highest averages of INJURIES and FATALITIES, resulting in increased health HARMFUL. The meaning of seeing the average is to identify the most deadly event type.

Question 2

Across the United States, which types of events have the greatest economic consequences?

Answer

Following the same principles used in Question 1, I will define as economic consequence the sum of Properties Damages (PROPDMGABS) and Crop Damages (CROPDMGABS). Also, I will compare the average of PROPDMGABS and CROPDMGABS to analyze which one has the higher economic impact.

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
              QTY = n(),
              PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
              CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
    
    # Arrange to be descendant.
    arrange(desc(EXPENSES)) %>%
    
    # Adding averages variables.
    mutate(EXPENSES_AVG = EXPENSES/QTY,
           PROPDMGABS_AVG = PROPDMGABS/QTY,
           CROPDMGABS_AVG = CROPDMGABS/QTY) 

The adjusted dataset will be shown below.

# Printing the 
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
EVTYPE EXPENSES QTY PROPDMGABS CROPDMGABS EXPENSES_AVG PROPDMGABS_AVG CROPDMGABS_AVG
FLOOD 180658667663 86133 168270100463 12388567200 2.097438e+06 1.953608e+06 1.438307e+05
HURRICANE 90872527810 299 85356410010 5516117800 3.039215e+08 2.854729e+08 1.844855e+07
TORNADO 59020779387 60700 58603317867 417461520 9.723357e+05 9.654583e+05 6.877455e+03
CURRENT 48056407000 1937 48055552000 855000 2.480971e+07 2.480927e+07 4.414042e+02
EXCESSIVE HEAT 21065038320 33788 5257925140 15807113180 6.234473e+05 1.556152e+05 4.678322e+05
EXCESSIVE COLD 21015496360 14341 11877309560 9138186800 1.465414e+06 8.282065e+05 6.372071e+05
HAIL 19024452136 289281 15977564513 3046887623 6.576461e+04 5.523199e+04 1.053262e+04
THUNDERSTORM WIND 12456657288 343796 11184948300 1271708988 3.623270e+04 3.253368e+04 3.699022e+03
FOREST FIRE 8899845130 4232 8496563500 403281630 2.102988e+06 2.007695e+06 9.529339e+04
ICE STORM 7551932751 15973 7412418751 139514000 4.727936e+05 4.640593e+05 8.734364e+03
HIGH WIND 7048888488 28173 6268408088 780480400 2.502001e+05 2.224970e+05 2.770313e+04
LIGHTNING 952534520 15763 940442430 12092090 6.042850e+04 5.966139e+04 7.671186e+02
LANDSLIDE 346693100 643 326676100 20017000 5.391806e+05 5.080499e+05 3.113064e+04
OTHER 169455250 546 10666300 158788950 3.103576e+05 1.953535e+04 2.908223e+05
TSUNAMI 144082000 20 144062000 20000 7.204100e+06 7.203100e+06 1.000000e+03
FOG 22929500 1855 22929500 0 1.236092e+04 1.236092e+04 0.000000e+00
WATERSPOUT 9564200 3845 9564200 0 2.487438e+03 2.487438e+03 0.000000e+00
DUST 9388130 586 6288130 3100000 1.602070e+04 1.073060e+04 5.290102e+03
AVALANCHE 3721800 386 3721800 0 9.641969e+03 9.641969e+03 0.000000e+00

Based on the df_tidy_q2, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q2 %>%
    select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS = PROPDMGABS/1000000,
           CROPDMGABS = CROPDMGABS/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q2 %>%
    arrange(desc(EXPENSES_AVG)) %>%
    select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
           CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Economic Consequence of Event Types") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Economic Impact [Million USD]") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

FLOOD is the event with the greatest economic consequence (most of it related to Properties Damages). Between 1950 and 2011, it causes Damages to Properties and Crops that reach more than USD 180 billion. However, the HURRICANE is one event type with the highest average in terms of damage. On average, one hurricane’s outcome is around USD 300 million in damages.

Finally, it is also important to point out that EXCESSIVE HEAT is the worst event related to Crops, which is almost USD 15.8 billion in Damages to Crops.