Course Project 2 Reproducible
ResearchAccording to the analysis performed to answer questions 1 and 2, the
event type with the most harmful consequences is the
TORNADO due to its high frequency and high average number
of injuries and fatalities. On the other hand, HURRICANE
and TSUNAMI do not have the highest number of harmful to
the person. Also, both have a low probability of happening. From the
given dataset, we have recorded 299 hurricanes and 20 tsunamis. On the
other hand, there are more than 60 thousand tornadoes observations.
Still, the average number of deaths and injuries is greater than the
TORNADO, which we could interpret as both event types being
more deadly.
Lastly, FLOOD is the event type with the most
significant economic impact in absolute values, which reach more than
USD 180 billion (most of it is related to Properties Damages). On the
other hand, weather problems such as EXCESSIVE HEAT and
EXCESSIVE COLDc also significantly impact economic affairs.
For example, the EXCESSIVE HEAT has generated more than USD
15.8 billion in Damages to Crops. Another critical point is HURRICANE
does not happen regularly. Still, when it happens, it causes severe
Damage to Properties and Crops. Based on the given dataset, hurricanes
average generate USD 300 million of Damage for each event.
Course Project 2 explores the database from U.S. National Oceanic and Atmospheric Administration (NOAA) and will be used to create a Reproducible Research following the instructions and reviews criteria stated by the Course instructor. This reproducible research should be hosted on Rpubs and will be evaluated in a peer-graded assignment. In addition, Course Project 2 also aims to evaluate the economic impacts and the consequences caused by storms, hurricanes, tornadoes, floods, and other environmental disasters.
The objective of Course Project 2 aims to answer two questions:
EVTYPE variable) are most harmful with respect to
population health?The publication will present the answers to Course Project 2 questions in section 7. Results.
Requirements necessary to reproduce this analysis:
# Loading libraries
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)
Following the Course Project 2 instruction, it is necessary to set everything up in English.
# Force the R environment to use English settings
Sys.setlocale("LC_ALL","English")
## [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
Finally, if you want to reproduce it on your computer, I have printed
below the sessionInfo(), which will help you to load all
packages used to perform this publication.
## R version 4.2.0 (2022-04-22 ucrt)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 22000)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
## system code page: 65001
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] DT_0.23 kableExtra_1.3.4 magrittr_2.0.3 forcats_0.5.1
## [5] stringr_1.4.0 dplyr_1.0.9 purrr_0.3.4 readr_2.1.2
## [9] tidyr_1.2.0 tibble_3.1.7 tidyverse_1.3.1 ggplot2_3.3.6
## [13] rmarkdown_2.14
##
## loaded via a namespace (and not attached):
## [1] svglite_2.1.0 lubridate_1.8.0 assertthat_0.2.1 digest_0.6.29
## [5] utf8_1.2.2 R6_2.5.1 cellranger_1.1.0 backports_1.4.1
## [9] reprex_2.0.1 evaluate_0.15 httr_1.4.3 pillar_1.7.0
## [13] rlang_1.0.2 readxl_1.4.0 rstudioapi_0.13 jquerylib_0.1.4
## [17] webshot_0.5.3 htmlwidgets_1.5.4 munsell_0.5.0 broom_0.8.0
## [21] compiler_4.2.0 modelr_0.1.8 xfun_0.31 pkgconfig_2.0.3
## [25] systemfonts_1.0.4 htmltools_0.5.2 tidyselect_1.1.2 codetools_0.2-18
## [29] fansi_1.0.3 viridisLite_0.4.0 crayon_1.5.1 tzdb_0.3.0
## [33] dbplyr_2.2.0 withr_2.5.0 grid_4.2.0 jsonlite_1.8.0
## [37] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.3 scales_1.2.0
## [41] cli_3.3.0 stringi_1.7.6 fs_1.5.2 xml2_1.3.3
## [45] bslib_0.3.1 ellipsis_0.3.2 generics_0.1.2 vctrs_0.4.1
## [49] tools_4.2.0 glue_1.6.2 hms_1.1.1 fastmap_1.1.0
## [53] yaml_2.3.5 colorspace_2.0-3 rvest_1.0.2 knitr_1.39.3
## [57] haven_2.5.0 sass_0.4.1
The data source used for this Course Project is from the U.S. National Oceanic and Atmospheric Administration’s (NOAA), this dataset has recorded data about storms, tornados, floods, and other environmental and natural disasters between 1950 and 2011. In addition, the dataset also includes the number of fatalities, injuries, and property damages.
Creating a data folder to store the dataset from NOAA.
# 1. Create a data directory
if(!base::file.exists("data")) {
base::dir.create("data")
}
# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./data/repdata_data_StormData.csv.bz2")
}
Although the bz2 is a compressed file, the base system function
read.csv can read it without uncompressing it.
# Loading the raw dataset.
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")
Due to the facility to manage data frames using the
tidyverse package, I have converted the
raw_data object into a tibble class.
# Convert the regular data frame into a dplyr table
raw_data <- dplyr::tbl_df(raw_data)
This section presents a few characteristics of the raw dataset, and does not have the intention of performing an exploratory data analysis. Instead, it is just a big picture of the raw dataset.
The raw dataset loaded from the bz2 file comprises 902.297
observations and 37 columns (variables), as it is possible to confirmed
by the dim() function.
# Checking the observations and varibles in raw dataset.
base::dim(raw_data) # [Observations/rows] [variables/columns]
## [1] 902297 37
The first 5 lines of the raw data set:
# Printing the first 5 rows.
head(raw_data,5)
## # A tibble: 5 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 4/18/1950~ 0130 CST 97 MOBILE AL TORNA~ 0
## 2 1 4/18/1950~ 0145 CST 3 BALDWIN AL TORNA~ 0
## 3 1 2/20/1951~ 1600 CST 57 FAYETTE AL TORNA~ 0
## 4 1 6/8/1951 ~ 0900 CST 89 MADISON AL TORNA~ 0
## 5 1 11/15/195~ 1500 CST 43 CULLMAN AL TORNA~ 0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
The last 5 lines of the raw data set:
# Printing the last 5 rows.
tail(raw_data,5)
## # A tibble: 5 x 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 56 11/30/201~ 10:30:0~ MST 7 WYZ007 - ~ WY HIGH ~ 0
## 2 30 11/10/201~ 02:48:0~ MST 9 MTZ009 - ~ MT HIGH ~ 0
## 3 2 11/8/2011~ 02:58:0~ AKS 213 AKZ213 AK HIGH ~ 0
## 4 2 11/9/2011~ 10:21:0~ AKS 202 AKZ202 AK BLIZZ~ 0
## 5 1 11/28/201~ 08:00:0~ CST 6 ALZ006 AL HEAVY~ 0
## # ... with 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
The dataset from NOAA has 37 variables, as seen from the
str function.
# Printing the variables' names.
utils::str(raw_data)
## tibble [902,297 x 37] (S3: tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr [1:902297] "" "" "" "" ...
## $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
## $ END_DATE : chr [1:902297] "" "" "" "" ...
## $ END_TIME : chr [1:902297] "" "" "" "" ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr [1:902297] "" "" "" "" ...
## $ END_LOCATI: chr [1:902297] "" "" "" "" ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
## $ WFO : chr [1:902297] "" "" "" "" ...
## $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
## $ ZONENAMES : chr [1:902297] "" "" "" "" ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : chr [1:902297] "" "" "" "" ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
The read.csv() function did not correctly import the
following variables:
BGN_DATE, END_DATE, BGN_TIME
and END_TIME are Date class variables;NA PresenceThe variables LATITUDE and LATITUDE_E have the presence of
NA observations.
# Printing the summary of LATITUDE.
summary(raw_data$LATITUDE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2802 3540 2875 4019 9706 47
# Printing the summary of LATITUDE_E.
summary(raw_data$LATITUDE_E)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 1452 3549 9706 40
All those NA observations are related to the
AMERICAN SAMOA territory.
# Filtering the NA observations and finding the origins.
raw_data %>%
filter(is.na(LATITUDE) | is.na(LATITUDE_E)) %>%
select(STATEOFFIC) %>%
unique()
## # A tibble: 1 x 1
## STATEOFFIC
## <chr>
## 1 AMERICAN SAMOA
The PROPDMGEXP and CROPDMGEXP (Properties
Exponential and Crop Exponential) variables store the base 10
exponential value of PROPDMG and CROPDMG,
respectively. For this reason, it is necessary to combine
PROPDMG and PROPDMGEXP into a single variable
and do the same to CROPDMG and CROPDMGEXP.
Example
[PROPDMG] * 10^[PROPDMGEXP]
1.5 * 10^[B] = 1.5 billion USD
Therefore, it is essential to understand how the
PROPDMGEXP and CROPDMGEXP were recorded.
# Printing the unique values to exponential to Properties.
unique(raw_data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Finally, the notation is not uniform because sometimes they use
Capital or Lowercase (e.g., K or k). In other
cases, they use a number or insert a question mark (?) or
plus/minus signal (e.g., - or +).
PROPDMGEXP
?: Has 8 (eight) observations with this notation, none
of the rows of PROPDM and CROPDMG has
values;+: Has 5 (five) observations with this notation,
and;-: Has 1 (one) observation with this notation and 1
(one) fatality. I will consider like “0”.CROPDMGEXP
?: Has 7 (seven) observations with this notation, all
those with CROPDMG as zero;The raw dataset has 985 Event Types, the code below will present the first 10.
raw_data %>%
group_by(EVTYPE) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## EVTYPE n
## <chr> <int>
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15754
## 10 HEAVY SNOW 15708
Unfortunately, there is no standardization, as seen in
TSTM WIND, THUNDERSTORM WIND, and
THUNDERSTORM WINDS. For this reason, it is necessary to
adjust this variable, consolidating those events using a single
notation.
For further understanding of each variable, please visit the codebook.
Although I have observed problems in BGN_DATE,
END_DATE, BGN_TIME and END_TIME
for this Course Project 2, those variables will be not used for further
analysis. For this reason, I will subset the raw dataset to work with
part of it.
Based on the questions posted on 2. Objectivies, they are related to the most harmful concerning population health and its economic consequences. Thus, the raw dataset subset will select the following variables:
EVTYPE: Event type;INJURIES: Quantity of injuries;FATALITIES: Quantity of fatalities;PROPDMG: Amount in USD in Damages in properties;CROPDMG: Amount in USD in Damages in crops;PROPDMGEXP: The exponential in base 10 used to notate
the PROPDMG, and;CROPDMGEXP: The exponential in base 10 used to notate
the CROPDMG.# Subsetting the raw dataset.
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
It is necessary to merge PROPDMG and
PROPDMGEXP into a single variable, so I will create the
PROPDMGABS using the following rules.
0, and - will be
converted into 1;1 will be converted into 10;H, h and 2 will be
converted into 100;K, k and 3 will be
converted into 1000;4 will be converted into 10000;5 will be converted into 100000;M, m and 6 will be
converted into 1000000;7 will be converted into 10000000;8 will be converted into 100000000;B, b and 9 will be
converted into 1000000000;Following the same principles, I will merge CROPDMG and
CROPDMG to create CROPDMGABS:
0, and ? will be
converted into 1;H, h and 2 will be
converted into 100;K, k and 3 will be
converted into 1000;M, m and 6 will be
converted into 1000000;B, b and 9 will be
converted into 1000000000;The PROPDMGABS and CROPDMGABS will be in
charge of storing the absolute values of Properties Damages and Crops
Damages.
# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
# Creating a new column PROPDMGABS
mutate(PROPDMGABS = case_when(
PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
PROPDMGEXP %in% c(1) ~ PROPDMG * 10,
PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
PROPDMGEXP %in% c(4) ~ PROPDMG * 10000,
PROPDMGEXP %in% c(5) ~ PROPDMG * 100000,
PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
PROPDMGEXP %in% c(7) ~ PROPDMG * 10000000,
PROPDMGEXP %in% c(8) ~ PROPDMG * 100000000,
PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%
# Creating a new column CROPDMGABS
mutate(CROPDMGABS = case_when(
CROPDMGEXP %in% c("?","", 0) ~ CROPDMG * 1,
CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000))
The EVTYPE variable is not tidy because there are a lot of classifications with the same meaning but different notations. Hence, I will standardize this variable following the rules inside of the next chunk.
# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
# Converting all value to be UPPERCASE.
mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
# Rules to aggregate events into a new variable called EVTYPE_2
mutate(EVTYPE_2 = case_when(
grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
# Creating a OTHER type of event to aggregate minor events.
mutate(EVTYPE_2 = case_when(
is.na(EVTYPE_2) ~ "OTHER",
!is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
# Dropping EVTYPE because it is not necessary to carry it.
select(-EVTYPE) %>%
# Renaming the EVTYPE_2 to be the new EVTYPE
mutate(EVTYPE = EVTYPE_2)
After the standardization, the new EVTYPE has 19
categories.
# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)
## [1] "TORNADO" "THUNDERSTORM WIND" "HAIL"
## [4] "EXCESSIVE COLD" "ICE STORM" "FLOOD"
## [7] "HURRICANE" "EXCESSIVE HEAT" "LIGHTNING"
## [10] "FOG" "CURRENT" "HIGH WIND"
## [13] "OTHER" "WATERSPOUT" "AVALANCHE"
## [16] "DUST" "FOREST FIRE" "LANDSLIDE"
## [19] "TSUNAMI"
The tidy dataset will be composite by the EVTYPE,
INJURIES, FATALITIES, PROPDMGABS
and CROPDMGABS.
# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
# Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Answer
I will define harmful as the summation of
INJURIES and FATALITIES of each
EVTYPE. Therefore, it is necessary to create a new
variable, the so-called HARMFUL, to store these
aggregation. I also will create 3 new variables to store the average of
injuries (INJURIES_AVG), fatalities
(FATALITIES_AVG), and harmful (HARMFUL_AVG)
based on the number of observations (QTY) of each
EVTYPE.
# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
# Creating the HARMFUL variable.
mutate(HARMFUL = INJURIES + FATALITIES) %>%
# Aggregate everything using EVTYPE.
group_by(EVTYPE) %>%
# Summarizes the data using the group by EVTYPE.
summarise(HARMFUL = sum(HARMFUL),
QTY = n(),
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES)) %>%
# Arrange to be descendant.
arrange(desc(HARMFUL)) %>%
# Adding averages variables.
mutate(HARMFUL_AVG = HARMFUL/QTY,
FATALITIES_AVG = FATALITIES/QTY,
INJURIES_AVG = INJURIES/QTY)
The adjusted dataset will be shown below.
# Printing the
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
| EVTYPE | HARMFUL | QTY | FATALITIES | INJURIES | HARMFUL_AVG | FATALITIES_AVG | INJURIES_AVG |
|---|---|---|---|---|---|---|---|
| TORNADO | 97068 | 60700 | 5661 | 91407 | 1.5991433 | 0.0932619 | 1.5058814 |
| EXCESSIVE HEAT | 14132 | 33788 | 3467 | 10665 | 0.4182550 | 0.1026104 | 0.3156446 |
| THUNDERSTORM WIND | 10276 | 343796 | 729 | 9547 | 0.0298898 | 0.0021204 | 0.0277694 |
| FLOOD | 10236 | 86133 | 1553 | 8683 | 0.1188395 | 0.0180303 | 0.1008092 |
| LIGHTNING | 6048 | 15763 | 817 | 5231 | 0.3836833 | 0.0518302 | 0.3318531 |
| EXCESSIVE COLD | 3972 | 14341 | 483 | 3489 | 0.2769681 | 0.0336797 | 0.2432885 |
| HIGH WIND | 2690 | 28173 | 696 | 1994 | 0.0954815 | 0.0247045 | 0.0707770 |
| ICE STORM | 2565 | 15973 | 335 | 2230 | 0.1605835 | 0.0209729 | 0.1396106 |
| FOREST FIRE | 1696 | 4232 | 90 | 1606 | 0.4007561 | 0.0212665 | 0.3794896 |
| HURRICANE | 1468 | 299 | 135 | 1333 | 4.9096990 | 0.4515050 | 4.4581940 |
| CURRENT | 1442 | 1937 | 712 | 730 | 0.7444502 | 0.3675787 | 0.3768715 |
| HAIL | 1386 | 289281 | 15 | 1371 | 0.0047912 | 0.0000519 | 0.0047393 |
| FOG | 1156 | 1855 | 80 | 1076 | 0.6231806 | 0.0431267 | 0.5800539 |
| DUST | 507 | 586 | 24 | 483 | 0.8651877 | 0.0409556 | 0.8242321 |
| AVALANCHE | 394 | 386 | 224 | 170 | 1.0207254 | 0.5803109 | 0.4404145 |
| OTHER | 344 | 546 | 44 | 300 | 0.6300366 | 0.0805861 | 0.5494505 |
| TSUNAMI | 162 | 20 | 33 | 129 | 8.1000000 | 1.6500000 | 6.4500000 |
| LANDSLIDE | 99 | 643 | 44 | 55 | 0.1539658 | 0.0684292 | 0.0855365 |
| WATERSPOUT | 32 | 3845 | 3 | 29 | 0.0083225 | 0.0007802 | 0.0075423 |
Based on the df_tidy_q1, let’s visualize it in a
graph.
# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
# Gathering the dataset to have absolute data in rows.
df_tidy_q1 %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
mutate(VALUE_TYPE = factor("ABSOLUTE")),
# Gathering the dataset to average have average data in rows.
df_tidy_q1 %>%
arrange(desc(HARMFUL_AVG)) %>%
select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
# Plotting the graph using GGPLOT2
ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) +
# Defining a stacked barplot.
geom_bar(position = "stack", stat = "identity") +
# Defining the graph title.
ggplot2::labs(title = "Most Harmful Event Type") +
# Defining the x-axis label
ggplot2::xlab("Event Type") +
# Defining the y-axis label
ggplot2::ylab("Health Harmful") +
# Adjusting the x-axis to be 90 degrees.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
# Removing the legend title.
ggplot2::guides(fill = ggplot2::guide_legend(title = "")) +
# Creating facet to show absolute and average values.
facet_grid(rows = vars(VALUE_TYPE), scales = "free")
In term of absolute value, there are no doubt that
TORNADO events are the most harmful to health due to the
high frequency of tornadoes. The number of tornadoes observed in the
dataset is more than 60 thousand (between 1950 and 2011), which results
in 5.661 FATALITIES, 91.407 INJURIES, and
97.068 health HARMFUL.
However, whether you analyze it in terms of average,
in that case, the TSUNAMI event type has the highest
averages of INJURIES and FATALITIES, resulting in increased health
HARMFUL. The meaning of seeing the average is to identify the most
deadly event type.
Across the United States, which types of events have the greatest economic consequences?
Answer
Following the same principles used in Question 1, I will define as
economic consequence the sum of Properties Damages
(PROPDMGABS) and Crop Damages (CROPDMGABS).
Also, I will compare the average of PROPDMGABS and
CROPDMGABS to analyze which one has the higher economic
impact.
# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
# Creating the HARMFUL variable.
mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
# Aggregate everything using EVTYPE.
group_by(EVTYPE) %>%
# Summarizes the data using the group by EVTYPE.
summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
QTY = n(),
PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
# Arrange to be descendant.
arrange(desc(EXPENSES)) %>%
# Adding averages variables.
mutate(EXPENSES_AVG = EXPENSES/QTY,
PROPDMGABS_AVG = PROPDMGABS/QTY,
CROPDMGABS_AVG = CROPDMGABS/QTY)
The adjusted dataset will be shown below.
# Printing the
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
| EVTYPE | EXPENSES | QTY | PROPDMGABS | CROPDMGABS | EXPENSES_AVG | PROPDMGABS_AVG | CROPDMGABS_AVG |
|---|---|---|---|---|---|---|---|
| FLOOD | 180658667663 | 86133 | 168270100463 | 12388567200 | 2.097438e+06 | 1.953608e+06 | 1.438307e+05 |
| HURRICANE | 90872527810 | 299 | 85356410010 | 5516117800 | 3.039215e+08 | 2.854729e+08 | 1.844855e+07 |
| TORNADO | 59020779387 | 60700 | 58603317867 | 417461520 | 9.723357e+05 | 9.654583e+05 | 6.877455e+03 |
| CURRENT | 48056407000 | 1937 | 48055552000 | 855000 | 2.480971e+07 | 2.480927e+07 | 4.414042e+02 |
| EXCESSIVE HEAT | 21065038320 | 33788 | 5257925140 | 15807113180 | 6.234473e+05 | 1.556152e+05 | 4.678322e+05 |
| EXCESSIVE COLD | 21015496360 | 14341 | 11877309560 | 9138186800 | 1.465414e+06 | 8.282065e+05 | 6.372071e+05 |
| HAIL | 19024452136 | 289281 | 15977564513 | 3046887623 | 6.576461e+04 | 5.523199e+04 | 1.053262e+04 |
| THUNDERSTORM WIND | 12456657288 | 343796 | 11184948300 | 1271708988 | 3.623270e+04 | 3.253368e+04 | 3.699022e+03 |
| FOREST FIRE | 8899845130 | 4232 | 8496563500 | 403281630 | 2.102988e+06 | 2.007695e+06 | 9.529339e+04 |
| ICE STORM | 7551932751 | 15973 | 7412418751 | 139514000 | 4.727936e+05 | 4.640593e+05 | 8.734364e+03 |
| HIGH WIND | 7048888488 | 28173 | 6268408088 | 780480400 | 2.502001e+05 | 2.224970e+05 | 2.770313e+04 |
| LIGHTNING | 952534520 | 15763 | 940442430 | 12092090 | 6.042850e+04 | 5.966139e+04 | 7.671186e+02 |
| LANDSLIDE | 346693100 | 643 | 326676100 | 20017000 | 5.391806e+05 | 5.080499e+05 | 3.113064e+04 |
| OTHER | 169455250 | 546 | 10666300 | 158788950 | 3.103576e+05 | 1.953535e+04 | 2.908223e+05 |
| TSUNAMI | 144082000 | 20 | 144062000 | 20000 | 7.204100e+06 | 7.203100e+06 | 1.000000e+03 |
| FOG | 22929500 | 1855 | 22929500 | 0 | 1.236092e+04 | 1.236092e+04 | 0.000000e+00 |
| WATERSPOUT | 9564200 | 3845 | 9564200 | 0 | 2.487438e+03 | 2.487438e+03 | 0.000000e+00 |
| DUST | 9388130 | 586 | 6288130 | 3100000 | 1.602070e+04 | 1.073060e+04 | 5.290102e+03 |
| AVALANCHE | 3721800 | 386 | 3721800 | 0 | 9.641969e+03 | 9.641969e+03 | 0.000000e+00 |
Based on the df_tidy_q2, let’s visualize it in a
graph.
# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
# Gathering the dataset to have absolute data in rows.
df_tidy_q2 %>%
select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
PROPDMGABS = PROPDMGABS/1000000,
CROPDMGABS = CROPDMGABS/1000000) %>%
pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
mutate(VALUE_TYPE = factor("ABSOLUTE")),
# Gathering the dataset to average have average data in rows.
df_tidy_q2 %>%
arrange(desc(EXPENSES_AVG)) %>%
select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
# Plotting the graph using GGPLOT2
ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) +
# Defining a stacked barplot.
geom_bar(position = "stack", stat = "identity") +
# Defining the graph title.
ggplot2::labs(title = "Economic Consequence of Event Types") +
# Defining the x-axis label
ggplot2::xlab("Event Type") +
# Defining the y-axis label
ggplot2::ylab("Economic Impact [Million USD]") +
# Adjusting the x-axis to be 90 degrees.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
# Removing the legend title.
ggplot2::guides(fill = ggplot2::guide_legend(title = "")) +
# Creating facet to show absolute and average values.
facet_grid(rows = vars(VALUE_TYPE), scales = "free")
FLOOD is the event with the greatest economic
consequence (most of it related to Properties Damages). Between 1950 and
2011, it causes Damages to Properties and Crops that reach more than USD
180 billion. However, the HURRICANE is one event type with
the highest average in terms of damage. On average, one hurricane’s
outcome is around USD 300 million in damages.
Finally, it is also important to point out that
EXCESSIVE HEAT is the worst event related to Crops, which
is almost USD 15.8 billion in Damages to Crops.