1.Introduction

Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.

This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.

2.Data

The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:

Storm Data

There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.

National Weather Service Storm Data Documentation

National Climatic Data Center Storm Events

The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.

3.Questions to be answered

The data analysis must address the following questions:

Across the United States, which types of events (as indicated in the EVTYPE start color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?

4.Loading libraries

library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)

5.Getting the data

5.1 Creating a data directory

# 1. Create a data directory
if(!base::file.exists("data")) {
    base::dir.create("data")
}

# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
    utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
                         destfile = "./data/repdata_data_StormData.csv.bz2")
}

5.2 Loading the data

# 1. the raw data
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")

# 2. converting into a data frame 
raw_data <- dplyr::tbl_df(raw_data)

6 Some EDA on the data

This section presents a few remarks in the EDA on the data from NOAA.

6.1 Dimensions

base::dim(raw_data)

## [1] 902297     37

6.2 Head and Tail

Printing the first 5 rows from dataset

# printing the first 5 rows from raw data
head(raw_data,5)

## # A tibble: 5 × 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1       1 4/18/1950… 0130     CST           97 MOBILE     AL    TORNA…         0
## 2       1 4/18/1950… 0145     CST            3 BALDWIN    AL    TORNA…         0
## 3       1 2/20/1951… 1600     CST           57 FAYETTE    AL    TORNA…         0
## 4       1 6/8/1951 … 0900     CST           89 MADISON    AL    TORNA…         0
## 5       1 11/15/195… 1500     CST           43 CULLMAN    AL    TORNA…         0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

Printing the last 5 rows from dataset

# printing the last 5 rows from raw data
tail(raw_data,5)

## # A tibble: 5 × 37
##   STATE__ BGN_DATE   BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
##     <dbl> <chr>      <chr>    <chr>      <dbl> <chr>      <chr> <chr>      <dbl>
## 1      56 11/30/201… 10:30:0… MST            7 WYZ007 - … WY    HIGH …         0
## 2      30 11/10/201… 02:48:0… MST            9 MTZ009 - … MT    HIGH …         0
## 3       2 11/8/2011… 02:58:0… AKS          213 AKZ213     AK    HIGH …         0
## 4       2 11/9/2011… 10:21:0… AKS          202 AKZ202     AK    BLIZZ…         0
## 5       1 11/28/201… 08:00:0… CST            6 ALZ006     AL    HEAVY…         0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## #   END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## #   END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## #   MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## #   PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## #   STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## #   LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>

6.3 Variables

The dataset from NOAA has 37 variables, as wee seen using the str function

utils::str(raw_data)

## tibble [902,297 × 37] (S3: tbl_df/tbl/data.frame)
##  $ STATE__   : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
##  $ BGN_DATE  : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
##  $ BGN_TIME  : chr [1:902297] "0130" "0145" "1600" "0900" ...
##  $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
##  $ COUNTY    : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
##  $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
##  $ STATE     : chr [1:902297] "AL" "AL" "AL" "AL" ...
##  $ EVTYPE    : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
##  $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ BGN_AZI   : chr [1:902297] "" "" "" "" ...
##  $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ END_DATE  : chr [1:902297] "" "" "" "" ...
##  $ END_TIME  : chr [1:902297] "" "" "" "" ...
##  $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
##  $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ END_AZI   : chr [1:902297] "" "" "" "" ...
##  $ END_LOCATI: chr [1:902297] "" "" "" "" ...
##  $ LENGTH    : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
##  $ WIDTH     : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
##  $ F         : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
##  $ MAG       : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
##  $ INJURIES  : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
##  $ PROPDMG   : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
##  $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
##  $ CROPDMG   : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
##  $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
##  $ WFO       : chr [1:902297] "" "" "" "" ...
##  $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
##  $ ZONENAMES : chr [1:902297] "" "" "" "" ...
##  $ LATITUDE  : num [1:902297] 3040 3042 3340 3458 3412 ...
##  $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
##  $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
##  $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
##  $ REMARKS   : chr [1:902297] "" "" "" "" ...
##  $ REFNUM    : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...

6.3 `NA` Presence

# Printing the summ of LATITUDE
summary(raw_data$LATITUDE)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0    2802    3540    2875    4019    9706      47

# Printing the summ of Latitude_e
summary(raw_data$LATITUDE_E)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
##       0       0       0    1452    3549    9706      40

6.4 Exponential variables´s treatment

The PROPDMGEXP and CROPDMGEXP(Properties Exponential and Crop Exponential) variables store the base 10 exponential value of PROPDMG and CROPDMG, respectively. For this reason, it is necessary to combine PROPDMG and PROPDMGEXP into a single variable and do the same to CROPDMG and CROPDMGEXP

Therefore, it is essential to understand how the PROPDMGEXP and CROPDMGEXP were recorded.

# Printing the unique values to exponential to properties
unique(raw_data$PROPDMGEXP)

##  [1] "K" "M" ""  "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"

# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)

## [1] ""  "M" "K" "m" "B" "?" "0" "k" "2"

Finally, the notation is not uniform because sometimes they use Capital or Lowercase (e.g., K or k). In other cases, they use a number or insert a question mark (?) or plus/minus signal (e.g., - or +).

*PROPDMGEXP

?: Has 8 (eight) observations with this notation, none of the rows of PROPDM and CROPDMG has values;
+: Has 5 (five) observations with this notation, and;
-: Has 1 (one) observation with this notation and 1 (one) fatality. I will consider like “0”.

*CROPDMGEXP

?: Has 7 (seven) observations with this notation, all those with CROPDMG as zero;

6.5 Grouping data by `Event type`

The raw dataset has 985 Event Types, the code below will present the first 10.

raw_data %>%
    group_by(EVTYPE) %>%
    summarise(n = n()) %>%
    arrange(desc(n)) %>%
    head(10)

## # A tibble: 10 × 2
##    EVTYPE                  n
##    <chr>               <int>
##  1 HAIL               288661
##  2 TSTM WIND          219940
##  3 THUNDERSTORM WIND   82563
##  4 TORNADO             60652
##  5 FLASH FLOOD         54277
##  6 FLOOD               25326
##  7 THUNDERSTORM WINDS  20843
##  8 HIGH WIND           20212
##  9 LIGHTNING           15754
## 10 HEAVY SNOW          15708

7 Working on the data

There are some troubles working on these date variables: BGN_DATE, END_DATE, and END_TIME, helpfully we are not using these variables as a part of the course project.

Keeping up with the main goal of this project, the dataset will be subsetted as follows:

7.1 Subsseting the raw data

As we move on the project, we are looking for the most harmful events concerning in: population health and the economic consequences (loses in USD american dollars).

In that order, the data that will be subsetted from the raw_data will be:

*Variables:

+EVTYPE: Event type;

+INJURIES: Quantity of injuries;

+FATALITIES: Quantity of fatalities;

+PROPDMG: Amount in USD american dollars in Damages in properties;

+CROPDMG: Amount in USD american dollars in Damages in crops;

+PROPDMGEXP: he exponential in base 10 used to notate the PROPDMG;

+CROPDMGEXP: The exponential in base 10 used to notate the CROPDMG

# Subsetting the dataset
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)

7.2 Merging damage and exponential variables

It is necessary to merge PROPDMG and PROPDMGEXP into a single variable, so I will create the PROPDMGABS using the following rules.

*Adjusted variables:

+All 0, and - will be converted into 1;

+All 1 will be converted into 10;

+All H, h and 2 will be converted into 100;

+All K, k and 3 will be converted into 1000;

+All 4 will be converted into 10000;

+All 5 will be converted into 100000;

+All M, m and 6 will be converted into 1000000;

+All 7 will be converted into 10000000;

+All 8 will be converted into 100000000;

+All B, b and 9 will be converted into 1000000000;

Following the same principles, I will merge CROPDMG and CROPDMG to create CROPDMGABS:

*Adjusted variables:

+All 0, and ? will be converted into 1;

+All H,h and 2 will be converted into 100;

+All K, k and 3 will be converted into 1000;

+All M, m and 6 will be converted into 1000000;

+All B, b and 9 will be converted into 1000000000

The PROPDMGABS and CROPDMGABS will be in charge of storing the absolute values of Properties Damages and Crops Damages.

# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
  # Creating a new column PROPDMGABS
    mutate(PROPDMGABS = case_when(
        PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
        PROPDMGEXP %in% c(1)         ~ PROPDMG * 10,
        PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
        PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
        PROPDMGEXP %in% c(4)         ~ PROPDMG * 10000,
        PROPDMGEXP %in% c(5)         ~ PROPDMG * 100000,
        PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
        PROPDMGEXP %in% c(7)         ~ PROPDMG * 10000000,
        PROPDMGEXP %in% c(8)         ~ PROPDMG * 100000000,
        PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%

# Creating a new column CROPDMGABS
    mutate(CROPDMGABS = case_when(
        CROPDMGEXP %in% c("?","", 0)   ~ CROPDMG * 1,
        CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
        CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
        CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
        CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000))

7.3 Uniform `Event type`

The EVTYPE variable is not tidy because there are a lot of classifications with the same meaning but different notations. Hence, I will standardize this variable following the rules inside of the next chunk.

# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
    
    # Converting all value to be UPPERCASE.
    mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
    
    # Rules to aggregate events into a new variable called EVTYPE_2
    mutate(EVTYPE_2 = case_when(
        grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
        grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
        grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
        grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
        grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
        grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
        grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
        grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
        grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
        grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
        grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
        grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
        grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
        grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
        grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
        grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
        grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
        grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
    
    # Creating a OTHER type of event to aggregate minor events.
    mutate(EVTYPE_2 = case_when(
        is.na(EVTYPE_2) ~ "OTHER",
        !is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
    
    # Dropping EVTYPE because it is not necessary to carry it.
    select(-EVTYPE) %>%
    
    # Renaming the EVTYPE_2 to be the new EVTYPE
    mutate(EVTYPE = EVTYPE_2)

After the standardization, the new EVTYPE has 19 categories.

# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)

##  [1] "TORNADO"           "THUNDERSTORM WIND" "HAIL"             
##  [4] "EXCESSIVE COLD"    "ICE STORM"         "FLOOD"            
##  [7] "HURRICANE"         "EXCESSIVE HEAT"    "LIGHTNING"        
## [10] "FOG"               "CURRENT"           "HIGH WIND"        
## [13] "OTHER"             "WATERSPOUT"        "AVALANCHE"        
## [16] "DUST"              "FOREST FIRE"       "LANDSLIDE"        
## [19] "TSUNAMI"

7.4 Getting the `Tidy Data`

The Tidy Data is result of combining: EVTYPE, INJURIES, FATALITIES, PROPDMGABS and CROPDMGABS

# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
    
    # Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
    select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))

7.5 Getting the `Results`

#Question 1:

Across the United States, which types of events (as indicated in the EVTYPE start color red, start verbatim, EVTYPE, end verbatim, end color red variable) are most harmful with respect to population health?

As first option, I define harmful as the summation of: INJURIES and FATALITIES for each EVTYPE (as it was defined before)

In that order, the next step was creating the following variables:

*New variables:

+INJURIES_AVG: average of injuries as a proportion of its quantities;

+FATALITIES_AVG: average of fatalities as a proportion of its quantities;

+HARMFUL_AVG: average of harmful as a proportion of its quantities;

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(HARMFUL = INJURIES + FATALITIES) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(HARMFUL = sum(HARMFUL),
              QTY = n(),
              FATALITIES = sum(FATALITIES),
              INJURIES = sum(INJURIES)) %>%
    
    # Arrange to be descendant.
    arrange(desc(HARMFUL)) %>%
    
    # Adding averages variables.
    mutate(HARMFUL_AVG = HARMFUL/QTY,
           FATALITIES_AVG = FATALITIES/QTY,
           INJURIES_AVG = INJURIES/QTY)

The adjusted dataset will be shown below:

# Printing the table
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()

EVTYPE	HARMFUL	QTY	FATALITIES	INJURIES	HARMFUL_AVG	FATALITIES_AVG	INJURIES_AVG
TORNADO	97068	60700	5661	91407	1.5991433	0.0932619	1.5058814
EXCESSIVE HEAT	14132	33788	3467	10665	0.4182550	0.1026104	0.3156446
THUNDERSTORM WIND	10276	343796	729	9547	0.0298898	0.0021204	0.0277694
FLOOD	10236	86133	1553	8683	0.1188395	0.0180303	0.1008092
LIGHTNING	6048	15763	817	5231	0.3836833	0.0518302	0.3318531
EXCESSIVE COLD	3972	14341	483	3489	0.2769681	0.0336797	0.2432885
HIGH WIND	2690	28173	696	1994	0.0954815	0.0247045	0.0707770
ICE STORM	2565	15973	335	2230	0.1605835	0.0209729	0.1396106
FOREST FIRE	1696	4232	90	1606	0.4007561	0.0212665	0.3794896
HURRICANE	1468	299	135	1333	4.9096990	0.4515050	4.4581940
CURRENT	1442	1937	712	730	0.7444502	0.3675787	0.3768715
HAIL	1386	289281	15	1371	0.0047912	0.0000519	0.0047393
FOG	1156	1855	80	1076	0.6231806	0.0431267	0.5800539
DUST	507	586	24	483	0.8651877	0.0409556	0.8242321
AVALANCHE	394	386	224	170	1.0207254	0.5803109	0.4404145
OTHER	344	546	44	300	0.6300366	0.0805861	0.5494505
TSUNAMI	162	20	33	129	8.1000000	1.6500000	6.4500000
LANDSLIDE	99	643	44	55	0.1539658	0.0684292	0.0855365
WATERSPOUT	32	3845	3	29	0.0083225	0.0007802	0.0075423

Once, based on the Tidy Data on the df_tidy_q1 dataframe the plot shows as follows:

# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q1 %>%
    select(EVTYPE, FATALITIES, INJURIES) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q1 %>%
    arrange(desc(HARMFUL_AVG)) %>%
    select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
    pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Most Harmful Event Type") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Health Harmful") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

Remarks:

In term of absolute value, there are no doubt that TORNADO events are the most harmful to health due to the high frequency of tornadoes. The number of tornadoes observed in the dataset is more than 60 thousand (between 1950 and 2011), which results in 5.661 FATALITIES, 91.407 INJURIES, and 97.068 health HARMFUL.

However, whether you analyze it in terms of average, in that case, the TSUNAMI event type has the highest averages of INJURIES and FATALITIES, resulting in increased health HARMFUL. The meaning of seeing the average is to identify the most deadly event type.

Question 2:

Across the United States, which types of events have the greatest economic consequences?

As was response in question 1, the next step was defining a new variable for economic consequences, as the result of aggregating properties damages: Properties Damages (PROPDMGABS), and Crop Damage (CROPDMGABS).

# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
    
    # Creating the HARMFUL variable.
    mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
    
    # Aggregate everything using EVTYPE.
    group_by(EVTYPE) %>%
    
    # Summarizes the data using the group by EVTYPE.
    summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
              QTY = n(),
              PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
              CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
    
    # Arrange to be descendant.
    arrange(desc(EXPENSES)) %>%
    
    # Adding averages variables.
    mutate(EXPENSES_AVG = EXPENSES/QTY,
           PROPDMGABS_AVG = PROPDMGABS/QTY,
           CROPDMGABS_AVG = CROPDMGABS/QTY)

The adjusted dataset will be shown below.

# Printing the table from dataframe 
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()

EVTYPE	EXPENSES	QTY	PROPDMGABS	CROPDMGABS	EXPENSES_AVG	PROPDMGABS_AVG	CROPDMGABS_AVG
FLOOD	180658667663	86133	168270100463	12388567200	2.097438e+06	1.953608e+06	1.438307e+05
HURRICANE	90872527810	299	85356410010	5516117800	3.039215e+08	2.854729e+08	1.844855e+07
TORNADO	59020779387	60700	58603317867	417461520	9.723357e+05	9.654583e+05	6.877455e+03
CURRENT	48056407000	1937	48055552000	855000	2.480971e+07	2.480927e+07	4.414042e+02
EXCESSIVE HEAT	21065038320	33788	5257925140	15807113180	6.234473e+05	1.556152e+05	4.678322e+05
EXCESSIVE COLD	21015496360	14341	11877309560	9138186800	1.465414e+06	8.282065e+05	6.372071e+05
HAIL	19024452136	289281	15977564513	3046887623	6.576461e+04	5.523199e+04	1.053262e+04
THUNDERSTORM WIND	12456657288	343796	11184948300	1271708988	3.623270e+04	3.253368e+04	3.699022e+03
FOREST FIRE	8899845130	4232	8496563500	403281630	2.102988e+06	2.007695e+06	9.529339e+04
ICE STORM	7551932751	15973	7412418751	139514000	4.727936e+05	4.640593e+05	8.734364e+03
HIGH WIND	7048888488	28173	6268408088	780480400	2.502001e+05	2.224970e+05	2.770313e+04
LIGHTNING	952534520	15763	940442430	12092090	6.042850e+04	5.966139e+04	7.671186e+02
LANDSLIDE	346693100	643	326676100	20017000	5.391806e+05	5.080499e+05	3.113064e+04
OTHER	169455250	546	10666300	158788950	3.103576e+05	1.953535e+04	2.908223e+05
TSUNAMI	144082000	20	144062000	20000	7.204100e+06	7.203100e+06	1.000000e+03
FOG	22929500	1855	22929500	0	1.236092e+04	1.236092e+04	0.000000e+00
WATERSPOUT	9564200	3845	9564200	0	2.487438e+03	2.487438e+03	0.000000e+00
DUST	9388130	586	6288130	3100000	1.602070e+04	1.073060e+04	5.290102e+03
AVALANCHE	3721800	386	3721800	0	9.641969e+03	9.641969e+03	0.000000e+00

Based on the df_tidy_q2, let’s visualize it in a graph.

# Creating a auxiliary dataset to plot 1.

rbind(
    
    # Gathering the dataset to have absolute data in rows.
    df_tidy_q2 %>%
    select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS = PROPDMGABS/1000000,
           CROPDMGABS = CROPDMGABS/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
    mutate(VALUE_TYPE = factor("ABSOLUTE")),
    
    # Gathering the dataset to average have average data in rows.
    df_tidy_q2 %>%
    arrange(desc(EXPENSES_AVG)) %>%
    select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
    mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
           PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
           CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
    pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
    mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
    
    # Plotting the graph using GGPLOT2
    ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) + 
    
    # Defining a stacked barplot.
    geom_bar(position = "stack", stat = "identity") +
    
    # Defining the graph title.
    ggplot2::labs(title = "Economic Consequence of Event Types") + 
    
    # Defining the x-axis label
    ggplot2::xlab("Event Type") +
    
    # Defining the y-axis label
    ggplot2::ylab("Economic Impact [Million USD]") +
    
    # Adjusting the x-axis to be 90 degrees.
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
          legend.position = "right",
          plot.title = element_text(hjust = 0.5),
          plot.subtitle = element_text(hjust = 0.5)) +
    
    # Removing the legend title.
        ggplot2::guides(fill = ggplot2::guide_legend(title = "")) + 
    
    # Creating facet to show absolute and average values.
    facet_grid(rows = vars(VALUE_TYPE), scales = "free")

Remarks:

FLOOD is the event with the greatest economic consequence (most of it related to Properties Damages). Between 1950 and 2011, it causes Damages to Properties and Crops that reach more than USD 180 billion. However, the HURRICANE is one event type with the highest average in terms of damage. On average, one hurricane’s outcome is around USD 300 million in damages.

Finally, it is also important to point out that EXCESSIVE HEAT is the worst event related to Crops, which is almost USD 15.8 billion in Damages to Crops.

Module 5: Reproducible Research - Course Project 2

Roddy Mendoza Marriott

2025-06-19

1.Introduction

2.Data

3.Questions to be answered

4.Loading libraries

5.Getting the data

5.1 Creating a data directory

5.2 Loading the data

6 Some EDA on the data

6.1 Dimensions

6.2 Head and Tail

6.3 Variables

6.3 `NA` Presence

6.4 Exponential variables´s treatment

6.5 Grouping data by `Event type`

7 Working on the data

7.1 Subsseting the raw data

7.2 Merging damage and exponential variables

7.3 Uniform `Event type`

7.4 Getting the `Tidy Data`

7.5 Getting the `Results`

Remarks:

Question 2:

Remarks:

Module 5: Reproducible Research - Course Project 2

Roddy Mendoza Marriott

2025-06-19

1.Introduction

2.Data

3.Questions to be answered

4.Loading libraries

5.Getting the data

5.1 Creating a data directory

5.2 Loading the data

6 Some EDA on the data

6.1 Dimensions

6.2 Head and Tail

6.3 Variables

6.3 NA Presence

6.4 Exponential variables´s treatment

6.5 Grouping data by Event type

7 Working on the data

7.1 Subsseting the raw data

7.2 Merging damage and exponential variables

7.3 Uniform Event type

7.4 Getting the Tidy Data

7.5 Getting the Results

Remarks:

Question 2:

Remarks:

6.3 `NA` Presence

6.5 Grouping data by `Event type`

7.3 Uniform `Event type`

7.4 Getting the `Tidy Data`

7.5 Getting the `Results`