1.Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

2.Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

3.Questions to be answered

The data analysis must address the following questions:

  1. Across the United States, which types of events (as indicated in the EVTYPE start color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health?

  2. Across the United States, which types of events have the greatest economic consequences?

4.Loading libraries

library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)

5.Getting the data

5.1 Creating a data directory

# 1. Create a data directory
if(!base::file.exists("data")) {
    base::dir.create("data")
}

# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
    utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                         destfile = "./data/repdata_data_StormData.csv.bz2")
}

5.2 Loading the data

# 1. the raw data
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")
# 2. converting into a data frame 
raw_data <- dplyr::tbl_df(raw_data)

6 Some EDA on the data

This section presents a few remarks in the EDA on the data from NOAA.

6.1 Dimensions

base::dim(raw_data)
## [1] 902297     37

6.2 Head and Tail

Printing the first 5 rows from dataset

# printing the first 5 rows from raw data
head(raw_data,5)
## # A tibble: 5 × 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/1950… 0130     CST           97 MOBILE     AL    TORNA…         0
## 2       1 4/18/1950… 0145     CST            3 BALDWIN    AL    TORNA…         0
## 3       1 2/20/1951… 1600     CST           57 FAYETTE    AL    TORNA…         0
## 4       1 6/8/1951 … 0900     CST           89 MADISON    AL    TORNA…         0
## 5       1 11/15/195… 1500     CST           43 CULLMAN    AL    TORNA…         0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

Printing the last 5 rows from dataset

# printing the last 5 rows from raw data
tail(raw_data,5)
## # A tibble: 5 × 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1      56 11/30/201… 10:30:0… MST            7 WYZ007 - … WY    HIGH …         0
## 2      30 11/10/201… 02:48:0… MST            9 MTZ009 - … MT    HIGH …         0
## 3       2 11/8/2011… 02:58:0… AKS          213 AKZ213     AK    HIGH …         0
## 4       2 11/9/2011… 10:21:0… AKS          202 AKZ202     AK    BLIZZ…         0
## 5       1 11/28/201… 08:00:0… CST            6 ALZ006     AL    HEAVY…         0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

6.3 Variables

The dataset from NOAA has 37 variables, as wee seen using the str function

utils::str(raw_data)
## tibble [902,297 × 37] (S3: tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] "" "" "" "" ...
##  $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ END_DATE  : chr [1:902297] "" "" "" "" ...
##  $ END_TIME  : chr [1:902297] "" "" "" "" ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] "" "" "" "" ...
##  $ END_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
##  $ WFO       : chr [1:902297] "" "" "" "" ...
##  $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
##  $ ZONENAMES : chr [1:902297] "" "" "" "" ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] "" "" "" "" ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

6.3 NA Presence

# Printing the summ of LATITUDE
summary(raw_data$LATITUDE)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    2802    3540    2875    4019    9706      47
# Printing the summ of Latitude_e
summary(raw_data$LATITUDE_E)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0    1452    3549    9706      40

6.4 Exponential variables´s treatment

The PROPDMGEXP and CROPDMGEXP(Properties Exponential and Crop Exponential) variables store the base 10 exponential value of PROPDMG and CROPDMG, respectively. For this reason, it is necessary to combine PROPDMG and PROPDMGEXP into a single variable and do the same to CROPDMG and CROPDMGEXP

Therefore, it is essential to understand how the PROPDMGEXP and CROPDMGEXP were recorded.

# Printing the unique values to exponential to properties
unique(raw_data$PROPDMGEXP)
##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)
## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Finally, the notation is not uniform because sometimes they use Capital or Lowercase (e.g., K or k). In other cases, they use a number or insert a question mark (?) or plus/minus signal (e.g., - or +).

*PROPDMGEXP

*CROPDMGEXP

6.5 Grouping data by Event type

The raw dataset has 985 Event Types, the code below will present the first 10.

raw_data %>%
    group_by(EVTYPE) %>%
    summarise(n = n()) %>%
    arrange(desc(n)) %>%
    head(10)
## # A tibble: 10 × 2
##    EVTYPE                  n
##    <chr>               <int>
##  1 HAIL               288661
##  2 TSTM WIND          219940
##  3 THUNDERSTORM WIND   82563
##  4 TORNADO             60652
##  5 FLASH FLOOD         54277
##  6 FLOOD               25326
##  7 THUNDERSTORM WINDS  20843
##  8 HIGH WIND           20212
##  9 LIGHTNING           15754
## 10 HEAVY SNOW          15708

7 Working on the data

There are some troubles working on these date variables: BGN_DATE, END_DATE, and END_TIME, helpfully we are not using these variables as a part of the course project.

Keeping up with the main goal of this project, the dataset will be subsetted as follows:

7.1 Subsseting the raw data

As we move on the project, we are looking for the most harmful events concerning in: population health and the economic consequences (loses in USD american dollars).

In that order, the data that will be subsetted from the raw_data will be:

*Variables:

+EVTYPE: Event type;

+INJURIES: Quantity of injuries;

+FATALITIES: Quantity of fatalities;

+PROPDMG: Amount in USD american dollars in Damages in properties;

+CROPDMG: Amount in USD american dollars in Damages in crops;

+PROPDMGEXP: he exponential in base 10 used to notate the PROPDMG;

+CROPDMGEXP: The exponential in base 10 used to notate the CROPDMG

# Subsetting the dataset
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

7.2 Merging damage and exponential variables

It is necessary to merge PROPDMG and PROPDMGEXP into a single variable, so I will create the PROPDMGABS using the following rules.

*Adjusted variables:

+All 0, and - will be converted into 1;

+All 1 will be converted into 10;

+All H, h and 2 will be converted into 100;

+All K, k and 3 will be converted into 1000;

+All 4 will be converted into 10000;

+All 5 will be converted into 100000;

+All M, m and 6 will be converted into 1000000;

+All 7 will be converted into 10000000;

+All 8 will be converted into 100000000;

+All B, b and 9 will be converted into 1000000000;

Following the same principles, I will merge CROPDMG and CROPDMG to create CROPDMGABS:

*Adjusted variables:

+All 0, and ? will be converted into 1;

+All H,h and 2 will be converted into 100;

+All K, k and 3 will be converted into 1000;

+All M, m and 6 will be converted into 1000000;

+All B, b and 9 will be converted into 1000000000

The PROPDMGABS and CROPDMGABS will be in charge of storing the absolute values of Properties Damages and Crops Damages.

# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
  # Creating a new column PROPDMGABS
    mutate(PROPDMGABS = case_when(
        PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
        PROPDMGEXP %in% c(1)         ~ PROPDMG * 10,
        PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
        PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
        PROPDMGEXP %in% c(4)         ~ PROPDMG * 10000,
        PROPDMGEXP %in% c(5)         ~ PROPDMG * 100000,
        PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
        PROPDMGEXP %in% c(7)         ~ PROPDMG * 10000000,
        PROPDMGEXP %in% c(8)         ~ PROPDMG * 100000000,
        PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%

# Creating a new column CROPDMGABS
    mutate(CROPDMGABS = case_when(
        CROPDMGEXP %in% c("?","", 0)   ~ CROPDMG * 1,
        CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
        CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
        CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
        CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000)) 

7.3 Uniform Event type

The EVTYPE variable is not tidy because there are a lot of classifications with the same meaning but different notations. Hence, I will standardize this variable following the rules inside of the next chunk.

# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
    
    # Converting all value to be UPPERCASE.
    mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
    
    # Rules to aggregate events into a new variable called EVTYPE_2
    mutate(EVTYPE_2 = case_when(
        grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
        grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
        grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
        grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
        grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
        grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
        grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
        grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
        grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
        grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
        grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
        grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
        grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
        grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
        grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
        grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
        grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
        grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
    
    # Creating a OTHER type of event to aggregate minor events.
    mutate(EVTYPE_2 = case_when(
        is.na(EVTYPE_2) ~ "OTHER",
        !is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
    
    # Dropping EVTYPE because it is not necessary to carry it.
    select(-EVTYPE) %>%
    
    # Renaming the EVTYPE_2 to be the new EVTYPE
    mutate(EVTYPE = EVTYPE_2)

After the standardization, the new EVTYPE has 19 categories.

# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)
##  [1] "TORNADO"           "THUNDERSTORM WIND" "HAIL"             
##  [4] "EXCESSIVE COLD"    "ICE STORM"         "FLOOD"            
##  [7] "HURRICANE"         "EXCESSIVE HEAT"    "LIGHTNING"        
## [10] "FOG"               "CURRENT"           "HIGH WIND"        
## [13] "OTHER"             "WATERSPOUT"        "AVALANCHE"        
## [16] "DUST"              "FOREST FIRE"       "LANDSLIDE"        
## [19] "TSUNAMI"

7.4 Getting the Tidy Data

The Tidy Data is result of combining: EVTYPE, INJURIES, FATALITIES, PROPDMGABS and CROPDMGABS

# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
    
    # Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
    select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))

7.5 Getting the Results

#Question 1:

Across the United States, which types of events (as indicated in the EVTYPE start color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health?

As first option, I define harmful as the summation of: INJURIES and FATALITIES for each EVTYPE (as it was defined before)

In that order, the next step was creating the following variables:

*New variables:

+INJURIES_AVG: average of injuries as a proportion of its quantities;

+FATALITIES_AVG: average of fatalities as a proportion of its quantities;

+HARMFUL_AVG: average of harmful as a proportion of its quantities;

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(HARMFUL = INJURIES + FATALITIES) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(HARMFUL = sum(HARMFUL),
              QTY = n(),
              FATALITIES = sum(FATALITIES),
              INJURIES = sum(INJURIES)) %>%
    
    # Arrange to be descendant.
    arrange(desc(HARMFUL)) %>%
    
    # Adding averages variables.
    mutate(HARMFUL_AVG = HARMFUL/QTY,
           FATALITIES_AVG = FATALITIES/QTY,
           INJURIES_AVG = INJURIES/QTY)

The adjusted dataset will be shown below:

# Printing the table
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
EVTYPE HARMFUL QTY FATALITIES INJURIES HARMFUL_AVG FATALITIES_AVG INJURIES_AVG
TORNADO 97068 60700 5661 91407 1.5991433 0.0932619 1.5058814
EXCESSIVE HEAT 14132 33788 3467 10665 0.4182550 0.1026104 0.3156446
THUNDERSTORM WIND 10276 343796 729 9547 0.0298898 0.0021204 0.0277694
FLOOD 10236 86133 1553 8683 0.1188395 0.0180303 0.1008092
LIGHTNING 6048 15763 817 5231 0.3836833 0.0518302 0.3318531
EXCESSIVE COLD 3972 14341 483 3489 0.2769681 0.0336797 0.2432885
HIGH WIND 2690 28173 696 1994 0.0954815 0.0247045 0.0707770
ICE STORM 2565 15973 335 2230 0.1605835 0.0209729 0.1396106
FOREST FIRE 1696 4232 90 1606 0.4007561 0.0212665 0.3794896
HURRICANE 1468 299 135 1333 4.9096990 0.4515050 4.4581940
CURRENT 1442 1937 712 730 0.7444502 0.3675787 0.3768715
HAIL 1386 289281 15 1371 0.0047912 0.0000519 0.0047393
FOG 1156 1855 80 1076 0.6231806 0.0431267 0.5800539
DUST 507 586 24 483 0.8651877 0.0409556 0.8242321
AVALANCHE 394 386 224 170 1.0207254 0.5803109 0.4404145
OTHER 344 546 44 300 0.6300366 0.0805861 0.5494505
TSUNAMI 162 20 33 129 8.1000000 1.6500000 6.4500000
LANDSLIDE 99 643 44 55 0.1539658 0.0684292 0.0855365
WATERSPOUT 32 3845 3 29 0.0083225 0.0007802 0.0075423

Once, based on the Tidy Data on the df_tidy_q1 dataframe the plot shows as follows:

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q1 %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q1 %>%
    arrange(desc(HARMFUL_AVG)) %>%
    select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Most Harmful Event Type") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Health Harmful") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

Remarks:

In term of absolute value, there are no doubt that TORNADO events are the most harmful to health due to the high frequency of tornadoes. The number of tornadoes observed in the dataset is more than 60 thousand (between 1950 and 2011), which results in 5.661 FATALITIES, 91.407 INJURIES, and 97.068 health HARMFUL.

However, whether you analyze it in terms of average, in that case, the TSUNAMI event type has the highest averages of INJURIES and FATALITIES, resulting in increased health HARMFUL. The meaning of seeing the average is to identify the most deadly event type.

Question 2:

Across the United States, which types of events have the greatest economic consequences?

As was response in question 1, the next step was defining a new variable for economic consequences, as the result of aggregating properties damages: Properties Damages (PROPDMGABS), and Crop Damage (CROPDMGABS).

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
              QTY = n(),
              PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
              CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
    
    # Arrange to be descendant.
    arrange(desc(EXPENSES)) %>%
    
    # Adding averages variables.
    mutate(EXPENSES_AVG = EXPENSES/QTY,
           PROPDMGABS_AVG = PROPDMGABS/QTY,
           CROPDMGABS_AVG = CROPDMGABS/QTY) 

The adjusted dataset will be shown below.

# Printing the table from dataframe 
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
EVTYPE EXPENSES QTY PROPDMGABS CROPDMGABS EXPENSES_AVG PROPDMGABS_AVG CROPDMGABS_AVG
FLOOD 180658667663 86133 168270100463 12388567200 2.097438e+06 1.953608e+06 1.438307e+05
HURRICANE 90872527810 299 85356410010 5516117800 3.039215e+08 2.854729e+08 1.844855e+07
TORNADO 59020779387 60700 58603317867 417461520 9.723357e+05 9.654583e+05 6.877455e+03
CURRENT 48056407000 1937 48055552000 855000 2.480971e+07 2.480927e+07 4.414042e+02
EXCESSIVE HEAT 21065038320 33788 5257925140 15807113180 6.234473e+05 1.556152e+05 4.678322e+05
EXCESSIVE COLD 21015496360 14341 11877309560 9138186800 1.465414e+06 8.282065e+05 6.372071e+05
HAIL 19024452136 289281 15977564513 3046887623 6.576461e+04 5.523199e+04 1.053262e+04
THUNDERSTORM WIND 12456657288 343796 11184948300 1271708988 3.623270e+04 3.253368e+04 3.699022e+03
FOREST FIRE 8899845130 4232 8496563500 403281630 2.102988e+06 2.007695e+06 9.529339e+04
ICE STORM 7551932751 15973 7412418751 139514000 4.727936e+05 4.640593e+05 8.734364e+03
HIGH WIND 7048888488 28173 6268408088 780480400 2.502001e+05 2.224970e+05 2.770313e+04
LIGHTNING 952534520 15763 940442430 12092090 6.042850e+04 5.966139e+04 7.671186e+02
LANDSLIDE 346693100 643 326676100 20017000 5.391806e+05 5.080499e+05 3.113064e+04
OTHER 169455250 546 10666300 158788950 3.103576e+05 1.953535e+04 2.908223e+05
TSUNAMI 144082000 20 144062000 20000 7.204100e+06 7.203100e+06 1.000000e+03
FOG 22929500 1855 22929500 0 1.236092e+04 1.236092e+04 0.000000e+00
WATERSPOUT 9564200 3845 9564200 0 2.487438e+03 2.487438e+03 0.000000e+00
DUST 9388130 586 6288130 3100000 1.602070e+04 1.073060e+04 5.290102e+03
AVALANCHE 3721800 386 3721800 0 9.641969e+03 9.641969e+03 0.000000e+00

Based on the df_tidy_q2, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.

rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q2 %>%
    select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS = PROPDMGABS/1000000,
           CROPDMGABS = CROPDMGABS/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q2 %>%
    arrange(desc(EXPENSES_AVG)) %>%
    select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
           CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Economic Consequence of Event Types") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Economic Impact [Million USD]") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

Remarks:

FLOOD is the event with the greatest economic consequence (most of it related to Properties Damages). Between 1950 and 2011, it causes Damages to Properties and Crops that reach more than USD 180 billion. However, the HURRICANE is one event type with the highest average in terms of damage. On average, one hurricane’s outcome is around USD 300 million in damages.

Finally, it is also important to point out that EXCESSIVE HEAT is the worst event related to Crops, which is almost USD 15.8 billion in Damages to Crops.