Course Project 2 Reproducible Research

📚 Specialization: Data Science: Foundations using R Specialization
📖 Course: Reproducible Research
- 🧑‍🏫 Instructor: Roger D Peng
📆 Week 4
- 🚦 Start: Friday, 24 June 2022
- 🏁 Finish: Tuesday, 28 June 2022

Impact of Weather Events on Public Health and Economics

Synopsis

According to the analysis performed to answer questions 1 and 2, the event type with the most harmful consequences is the TORNADO due to its high frequency and high average number of injuries and fatalities. On the other hand, HURRICANE and TSUNAMI do not have the highest number of harmful to the person. Also, both have a low probability of happening. From the given dataset, we have recorded 299 hurricanes and 20 tsunamis. On the other hand, there are more than 60 thousand tornadoes observations. Still, the average number of deaths and injuries is greater than the TORNADO, which we could interpret as both event types being more deadly.

Lastly, FLOOD is the event type with the most significant economic impact in absolute values, which reach more than USD 180 billion (most of it is related to Properties Damages). On the other hand, weather problems such as EXCESSIVE HEAT and EXCESSIVE COLDc also significantly impact economic affairs. For example, the EXCESSIVE HEAT has generated more than USD 15.8 billion in Damages to Crops. Another critical point is HURRICANE does not happen regularly. Still, when it happens, it causes severe Damage to Properties and Crops. Based on the given dataset, hurricanes average generate USD 300 million of Damage for each event.

1. Introduction

Course Project 2 explores the database from U.S. National Oceanic and Atmospheric Administration (NOAA) and will be used to create a Reproducible Research following the instructions and reviews criteria stated by the Course instructor. This reproducible research should be hosted on Rpubs and will be evaluated in a peer-graded assignment. In addition, Course Project 2 also aims to evaluate the economic impacts and the consequences caused by storms, hurricanes, tornadoes, floods, and other environmental disasters.

2. Objectivies

The objective of Course Project 2 aims to answer two questions:

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

The publication will present the answers to Course Project 2 questions in section 7. Results.

3. Requirements and Settings

Requirements necessary to reproduce this analysis:

# Loading libraries
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)

Following the Course Project 2 instruction, it is necessary to set everything up in English.

# Force the R environment to use English settings
Sys.setlocale("LC_ALL","English")

## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

Finally, if you want to reproduce it on your computer, I have printed below the sessionInfo(), which will help you to load all packages used to perform this publication.

## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_United States.1252 
## [2] LC_CTYPE=English_United States.1252   
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C                          
## [5] LC_TIME=English_United States.1252    
## system code page: 65001
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
##  [1] DT_0.23          kableExtra_1.3.4 magrittr_2.0.3   forcats_0.5.1   
##  [5] stringr_1.4.0    dplyr_1.0.9      purrr_0.3.4      readr_2.1.2     
##  [9] tidyr_1.2.0      tibble_3.1.7     tidyverse_1.3.1  ggplot2_3.3.6   
## [13] rmarkdown_2.14  
## 
## loaded via a namespace (and not attached):
##  [1] svglite_2.1.0     lubridate_1.8.0   assertthat_0.2.1  digest_0.6.29    
##  [5] utf8_1.2.2        R6_2.5.1          cellranger_1.1.0  backports_1.4.1  
##  [9] reprex_2.0.1      evaluate_0.15     httr_1.4.3        pillar_1.7.0     
## [13] rlang_1.0.2       readxl_1.4.0      rstudioapi_0.13   jquerylib_0.1.4  
## [17] webshot_0.5.3     htmlwidgets_1.5.4 munsell_0.5.0     broom_0.8.0      
## [21] compiler_4.2.0    modelr_0.1.8      xfun_0.31         pkgconfig_2.0.3  
## [25] systemfonts_1.0.4 htmltools_0.5.2   tidyselect_1.1.2  codetools_0.2-18 
## [29] fansi_1.0.3       viridisLite_0.4.0 crayon_1.5.1      tzdb_0.3.0       
## [33] dbplyr_2.2.0      withr_2.5.0       grid_4.2.0        jsonlite_1.8.0   
## [37] gtable_0.3.0      lifecycle_1.0.1   DBI_1.1.3         scales_1.2.0     
## [41] cli_3.3.0         stringi_1.7.6     fs_1.5.2          xml2_1.3.3       
## [45] bslib_0.3.1       ellipsis_0.3.2    generics_0.1.2    vctrs_0.4.1      
## [49] tools_4.2.0       glue_1.6.2        hms_1.1.1         fastmap_1.1.0    
## [53] yaml_2.3.5        colorspace_2.0-3  rvest_1.0.2       knitr_1.39.3     
## [57] haven_2.5.0       sass_0.4.1

4. Data Processing

The data source used for this Course Project is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA), this dataset has recorded data about storms, tornados, floods, and other environmental and natural disasters between 1950 and 2011. In addition, the dataset also includes the number of fatalities, injuries, and property damages.

4.1. Downloading

Creating a data folder to store the dataset from NOAA.

# 1. Create a data directory
if(!base::file.exists("data")) {
    base::dir.create("data")
}

# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
    utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                         destfile = "./data/repdata_data_StormData.csv.bz2")
}

4.2. Loading

Although the bz2 is a compressed file, the base system function read.csv can read it without uncompressing it.

# Loading the raw dataset.
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")

Due to the facility to manage data frames using the tidyverse package, I have converted the raw_data object into a tibble class.

# Convert the regular data frame into a dplyr table
raw_data <- dplyr::tbl_df(raw_data)

5. Data Assessing

This section presents a few characteristics of the raw dataset, and does not have the intention of performing an exploratory data analysis. Instead, it is just a big picture of the raw dataset.

5.1. Dimensions

The raw dataset loaded from the bz2 file comprises 902.297 observations and 37 columns (variables), as it is possible to confirmed by the dim() function.

# Checking the observations and varibles in raw dataset.
base::dim(raw_data) # [Observations/rows] [variables/columns]

## [1] 902297     37

5.2. Head and Tail

The first 5 lines of the raw data set:

# Printing the first 5 rows.
head(raw_data,5)

## # A tibble: 5 x 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/1950~ 0130     CST           97 MOBILE     AL    TORNA~         0
## 2       1 4/18/1950~ 0145     CST            3 BALDWIN    AL    TORNA~         0
## 3       1 2/20/1951~ 1600     CST           57 FAYETTE    AL    TORNA~         0
## 4       1 6/8/1951 ~ 0900     CST           89 MADISON    AL    TORNA~         0
## 5       1 11/15/195~ 1500     CST           43 CULLMAN    AL    TORNA~         0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

The last 5 lines of the raw data set:

# Printing the last 5 rows.
tail(raw_data,5)

## # A tibble: 5 x 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1      56 11/30/201~ 10:30:0~ MST            7 WYZ007 - ~ WY    HIGH ~         0
## 2      30 11/10/201~ 02:48:0~ MST            9 MTZ009 - ~ MT    HIGH ~         0
## 3       2 11/8/2011~ 02:58:0~ AKS          213 AKZ213     AK    HIGH ~         0
## 4       2 11/9/2011~ 10:21:0~ AKS          202 AKZ202     AK    BLIZZ~         0
## 5       1 11/28/201~ 08:00:0~ CST            6 ALZ006     AL    HEAVY~         0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

5.3. Variables

The dataset from NOAA has 37 variables, as seen from the str function.

# Printing the variables' names.
utils::str(raw_data)

## tibble [902,297 x 37] (S3: tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] "" "" "" "" ...
##  $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ END_DATE  : chr [1:902297] "" "" "" "" ...
##  $ END_TIME  : chr [1:902297] "" "" "" "" ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] "" "" "" "" ...
##  $ END_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
##  $ WFO       : chr [1:902297] "" "" "" "" ...
##  $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
##  $ ZONENAMES : chr [1:902297] "" "" "" "" ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] "" "" "" "" ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

The read.csv() function did not correctly import the following variables:

BGN_DATE, END_DATE, BGN_TIME and END_TIME are Date class variables;

5.3. `NA` Presence

The variables LATITUDE and LATITUDE_E have the presence of NA observations.

# Printing the summary of LATITUDE.
summary(raw_data$LATITUDE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    2802    3540    2875    4019    9706      47

# Printing the summary of LATITUDE_E.
summary(raw_data$LATITUDE_E)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0    1452    3549    9706      40

All those NA observations are related to the AMERICAN SAMOA territory.

# Filtering the NA observations and finding the origins.
raw_data %>% 
    filter(is.na(LATITUDE) | is.na(LATITUDE_E)) %>% 
    select(STATEOFFIC) %>%
    unique()

## # A tibble: 1 x 1
##   STATEOFFIC    
##   <chr>         
## 1 AMERICAN SAMOA

5.4. Exponential Variable

The PROPDMGEXP and CROPDMGEXP (Properties Exponential and Crop Exponential) variables store the base 10 exponential value of PROPDMG and CROPDMG, respectively. For this reason, it is necessary to combine PROPDMG and PROPDMGEXP into a single variable and do the same to CROPDMG and CROPDMGEXP.

Example

[PROPDMG] * 10^[PROPDMGEXP]

1.5 * 10^[B] = 1.5 billion USD

Therefore, it is essential to understand how the PROPDMGEXP and CROPDMGEXP were recorded.

# Printing the unique values to exponential to Properties.
unique(raw_data$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Finally, the notation is not uniform because sometimes they use Capital or Lowercase (e.g., K or k). In other cases, they use a number or insert a question mark (?) or plus/minus signal (e.g., - or +).

PROPDMGEXP
- ?: Has 8 (eight) observations with this notation, none of the rows of PROPDM and CROPDMG has values;
- +: Has 5 (five) observations with this notation, and;
- -: Has 1 (one) observation with this notation and 1 (one) fatality. I will consider like “0”.
CROPDMGEXP
- ?: Has 7 (seven) observations with this notation, all those with CROPDMG as zero;

5.5. Event Type

The raw dataset has 985 Event Types, the code below will present the first 10.

raw_data %>%
    group_by(EVTYPE) %>%
    summarise(n = n()) %>%
    arrange(desc(n)) %>%
    head(10)

## # A tibble: 10 x 2
##    EVTYPE                  n
##    <chr>               <int>
##  1 HAIL               288661
##  2 TSTM WIND          219940
##  3 THUNDERSTORM WIND   82563
##  4 TORNADO             60652
##  5 FLASH FLOOD         54277
##  6 FLOOD               25326
##  7 THUNDERSTORM WINDS  20843
##  8 HIGH WIND           20212
##  9 LIGHTNING           15754
## 10 HEAVY SNOW          15708

Unfortunately, there is no standardization, as seen in TSTM WIND, THUNDERSTORM WIND, and THUNDERSTORM WINDS. For this reason, it is necessary to adjust this variable, consolidating those events using a single notation.

For further understanding of each variable, please visit the codebook.

6. Data Manipulation

Although I have observed problems in BGN_DATE, END_DATE, BGN_TIME and END_TIME for this Course Project 2, those variables will be not used for further analysis. For this reason, I will subset the raw dataset to work with part of it.

6.1. Subsetting the raw data

Based on the questions posted on 2. Objectivies, they are related to the most harmful concerning population health and its economic consequences. Thus, the raw dataset subset will select the following variables:

EVTYPE: Event type;
INJURIES: Quantity of injuries;
FATALITIES: Quantity of fatalities;
PROPDMG: Amount in USD in Damages in properties;
CROPDMG: Amount in USD in Damages in crops;
PROPDMGEXP: The exponential in base 10 used to notate the PROPDMG, and;
CROPDMGEXP: The exponential in base 10 used to notate the CROPDMG.

# Subsetting the raw dataset.
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

6.2. Merging Damage and Exponential Variables

It is necessary to merge PROPDMG and PROPDMGEXP into a single variable, so I will create the PROPDMGABS using the following rules.

All 0, and - will be converted into 1;
All 1 will be converted into 10;
All H, h and 2 will be converted into 100;
All K, k and 3 will be converted into 1000;
All 4 will be converted into 10000;
All 5 will be converted into 100000;
All M, m and 6 will be converted into 1000000;
All 7 will be converted into 10000000;
All 8 will be converted into 100000000;
All B, b and 9 will be converted into 1000000000;

Following the same principles, I will merge CROPDMG and CROPDMG to create CROPDMGABS:

All 0, and ? will be converted into 1;
All H, h and 2 will be converted into 100;
All K, k and 3 will be converted into 1000;
All M, m and 6 will be converted into 1000000;
All B, b and 9 will be converted into 1000000000;

The PROPDMGABS and CROPDMGABS will be in charge of storing the absolute values of Properties Damages and Crops Damages.

# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
    
    # Creating a new column PROPDMGABS
    mutate(PROPDMGABS = case_when(
        PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
        PROPDMGEXP %in% c(1)         ~ PROPDMG * 10,
        PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
        PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
        PROPDMGEXP %in% c(4)         ~ PROPDMG * 10000,
        PROPDMGEXP %in% c(5)         ~ PROPDMG * 100000,
        PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
        PROPDMGEXP %in% c(7)         ~ PROPDMG * 10000000,
        PROPDMGEXP %in% c(8)         ~ PROPDMG * 100000000,
        PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%
    
    # Creating a new column CROPDMGABS
    mutate(CROPDMGABS = case_when(
        CROPDMGEXP %in% c("?","", 0)   ~ CROPDMG * 1,
        CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
        CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
        CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
        CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000))

6.3. Uniforming Event Types

The EVTYPE variable is not tidy because there are a lot of classifications with the same meaning but different notations. Hence, I will standardize this variable following the rules inside of the next chunk.

# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
    
    # Converting all value to be UPPERCASE.
    mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
    
    # Rules to aggregate events into a new variable called EVTYPE_2
    mutate(EVTYPE_2 = case_when(
        grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
        grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
        grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
        grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
        grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
        grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
        grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
        grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
        grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
        grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
        grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
        grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
        grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
        grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
        grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
        grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
        grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
        grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
    
    # Creating a OTHER type of event to aggregate minor events.
    mutate(EVTYPE_2 = case_when(
        is.na(EVTYPE_2) ~ "OTHER",
        !is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
    
    # Dropping EVTYPE because it is not necessary to carry it.
    select(-EVTYPE) %>%
    
    # Renaming the EVTYPE_2 to be the new EVTYPE
    mutate(EVTYPE = EVTYPE_2)

After the standardization, the new EVTYPE has 19 categories.

# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)

##  [1] "TORNADO"           "THUNDERSTORM WIND" "HAIL"             
##  [4] "EXCESSIVE COLD"    "ICE STORM"         "FLOOD"            
##  [7] "HURRICANE"         "EXCESSIVE HEAT"    "LIGHTNING"        
## [10] "FOG"               "CURRENT"           "HIGH WIND"        
## [13] "OTHER"             "WATERSPOUT"        "AVALANCHE"        
## [16] "DUST"              "FOREST FIRE"       "LANDSLIDE"        
## [19] "TSUNAMI"

6.3. Tidy dataset

The tidy dataset will be composite by the EVTYPE, INJURIES, FATALITIES, PROPDMGABS and CROPDMGABS.

# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
    
    # Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
    select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))

7. Results

Question 1

Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?

Answer

I will define harmful as the summation of INJURIES and FATALITIES of each EVTYPE. Therefore, it is necessary to create a new variable, the so-called HARMFUL, to store these aggregation. I also will create 3 new variables to store the average of injuries (INJURIES_AVG), fatalities (FATALITIES_AVG), and harmful (HARMFUL_AVG) based on the number of observations (QTY) of each EVTYPE.

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(HARMFUL = INJURIES + FATALITIES) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(HARMFUL = sum(HARMFUL),
              QTY = n(),
              FATALITIES = sum(FATALITIES),
              INJURIES = sum(INJURIES)) %>%
    
    # Arrange to be descendant.
    arrange(desc(HARMFUL)) %>%
    
    # Adding averages variables.
    mutate(HARMFUL_AVG = HARMFUL/QTY,
           FATALITIES_AVG = FATALITIES/QTY,
           INJURIES_AVG = INJURIES/QTY)

The adjusted dataset will be shown below.

# Printing the 
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()

EVTYPE	HARMFUL	QTY	FATALITIES	INJURIES	HARMFUL_AVG	FATALITIES_AVG	INJURIES_AVG
TORNADO	97068	60700	5661	91407	1.5991433	0.0932619	1.5058814
EXCESSIVE HEAT	14132	33788	3467	10665	0.4182550	0.1026104	0.3156446
THUNDERSTORM WIND	10276	343796	729	9547	0.0298898	0.0021204	0.0277694
FLOOD	10236	86133	1553	8683	0.1188395	0.0180303	0.1008092
LIGHTNING	6048	15763	817	5231	0.3836833	0.0518302	0.3318531
EXCESSIVE COLD	3972	14341	483	3489	0.2769681	0.0336797	0.2432885
HIGH WIND	2690	28173	696	1994	0.0954815	0.0247045	0.0707770
ICE STORM	2565	15973	335	2230	0.1605835	0.0209729	0.1396106
FOREST FIRE	1696	4232	90	1606	0.4007561	0.0212665	0.3794896
HURRICANE	1468	299	135	1333	4.9096990	0.4515050	4.4581940
CURRENT	1442	1937	712	730	0.7444502	0.3675787	0.3768715
HAIL	1386	289281	15	1371	0.0047912	0.0000519	0.0047393
FOG	1156	1855	80	1076	0.6231806	0.0431267	0.5800539
DUST	507	586	24	483	0.8651877	0.0409556	0.8242321
AVALANCHE	394	386	224	170	1.0207254	0.5803109	0.4404145
OTHER	344	546	44	300	0.6300366	0.0805861	0.5494505
TSUNAMI	162	20	33	129	8.1000000	1.6500000	6.4500000
LANDSLIDE	99	643	44	55	0.1539658	0.0684292	0.0855365
WATERSPOUT	32	3845	3	29	0.0083225	0.0007802	0.0075423

Based on the df_tidy_q1, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q1 %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q1 %>%
    arrange(desc(HARMFUL_AVG)) %>%
    select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Most Harmful Event Type") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Health Harmful") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

In term of absolute value, there are no doubt that TORNADO events are the most harmful to health due to the high frequency of tornadoes. The number of tornadoes observed in the dataset is more than 60 thousand (between 1950 and 2011), which results in 5.661 FATALITIES, 91.407 INJURIES, and 97.068 health HARMFUL.

However, whether you analyze it in terms of average, in that case, the TSUNAMI event type has the highest averages of INJURIES and FATALITIES, resulting in increased health HARMFUL. The meaning of seeing the average is to identify the most deadly event type.

Question 2

Across the United States, which types of events have the greatest economic consequences?

Answer

Following the same principles used in Question 1, I will define as economic consequence the sum of Properties Damages (PROPDMGABS) and Crop Damages (CROPDMGABS). Also, I will compare the average of PROPDMGABS and CROPDMGABS to analyze which one has the higher economic impact.

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
              QTY = n(),
              PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
              CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
    
    # Arrange to be descendant.
    arrange(desc(EXPENSES)) %>%
    
    # Adding averages variables.
    mutate(EXPENSES_AVG = EXPENSES/QTY,
           PROPDMGABS_AVG = PROPDMGABS/QTY,
           CROPDMGABS_AVG = CROPDMGABS/QTY)

The adjusted dataset will be shown below.

# Printing the 
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()

EVTYPE	EXPENSES	QTY	PROPDMGABS	CROPDMGABS	EXPENSES_AVG	PROPDMGABS_AVG	CROPDMGABS_AVG
FLOOD	180658667663	86133	168270100463	12388567200	2.097438e+06	1.953608e+06	1.438307e+05
HURRICANE	90872527810	299	85356410010	5516117800	3.039215e+08	2.854729e+08	1.844855e+07
TORNADO	59020779387	60700	58603317867	417461520	9.723357e+05	9.654583e+05	6.877455e+03
CURRENT	48056407000	1937	48055552000	855000	2.480971e+07	2.480927e+07	4.414042e+02
EXCESSIVE HEAT	21065038320	33788	5257925140	15807113180	6.234473e+05	1.556152e+05	4.678322e+05
EXCESSIVE COLD	21015496360	14341	11877309560	9138186800	1.465414e+06	8.282065e+05	6.372071e+05
HAIL	19024452136	289281	15977564513	3046887623	6.576461e+04	5.523199e+04	1.053262e+04
THUNDERSTORM WIND	12456657288	343796	11184948300	1271708988	3.623270e+04	3.253368e+04	3.699022e+03
FOREST FIRE	8899845130	4232	8496563500	403281630	2.102988e+06	2.007695e+06	9.529339e+04
ICE STORM	7551932751	15973	7412418751	139514000	4.727936e+05	4.640593e+05	8.734364e+03
HIGH WIND	7048888488	28173	6268408088	780480400	2.502001e+05	2.224970e+05	2.770313e+04
LIGHTNING	952534520	15763	940442430	12092090	6.042850e+04	5.966139e+04	7.671186e+02
LANDSLIDE	346693100	643	326676100	20017000	5.391806e+05	5.080499e+05	3.113064e+04
OTHER	169455250	546	10666300	158788950	3.103576e+05	1.953535e+04	2.908223e+05
TSUNAMI	144082000	20	144062000	20000	7.204100e+06	7.203100e+06	1.000000e+03
FOG	22929500	1855	22929500	0	1.236092e+04	1.236092e+04	0.000000e+00
WATERSPOUT	9564200	3845	9564200	0	2.487438e+03	2.487438e+03	0.000000e+00
DUST	9388130	586	6288130	3100000	1.602070e+04	1.073060e+04	5.290102e+03
AVALANCHE	3721800	386	3721800	0	9.641969e+03	9.641969e+03	0.000000e+00

Based on the df_tidy_q2, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q2 %>%
    select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS = PROPDMGABS/1000000,
           CROPDMGABS = CROPDMGABS/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q2 %>%
    arrange(desc(EXPENSES_AVG)) %>%
    select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
           CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Economic Consequence of Event Types") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Economic Impact [Million USD]") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

FLOOD is the event with the greatest economic consequence (most of it related to Properties Damages). Between 1950 and 2011, it causes Damages to Properties and Crops that reach more than USD 180 billion. However, the HURRICANE is one event type with the highest average in terms of damage. On average, one hurricane’s outcome is around USD 300 million in damages.

Finally, it is also important to point out that EXCESSIVE HEAT is the worst event related to Crops, which is almost USD 15.8 billion in Damages to Crops.

`Course Project 2` Reproducible Research

👨🏻‍💻 Anderson H Uyekita

Impact of Weather Events on Public Health and Economics

Synopsis

1. Introduction

2. Objectivies

3. Requirements and Settings

4. Data Processing

4.1. Downloading

4.2. Loading

5. Data Assessing

5.1. Dimensions

5.2. Head and Tail

5.3. Variables

5.3. `NA` Presence

5.4. Exponential Variable

5.5. Event Type

6. Data Manipulation

6.1. Subsetting the raw data

6.2. Merging Damage and Exponential Variables

6.3. Uniforming Event Types

6.3. Tidy dataset

7. Results

Question 1

Question 2

Course Project 2 Reproducible Research

👨🏻‍💻 Anderson H Uyekita

Impact of Weather Events on Public Health and Economics

Synopsis

1. Introduction

2. Objectivies

3. Requirements and Settings

4. Data Processing

4.1. Downloading

4.2. Loading

5. Data Assessing

5.1. Dimensions

5.2. Head and Tail

5.3. Variables

5.3. NA Presence

5.4. Exponential Variable

5.5. Event Type

6. Data Manipulation

6.1. Subsetting the raw data

6.2. Merging Damage and Exponential Variables

6.3. Uniforming Event Types

6.3. Tidy dataset

7. Results

Question 1

Question 2

`Course Project 2` Reproducible Research

5.3. `NA` Presence