Storms and other severe weather events can cause both public health and economic problems for communities and municipalities. Many severe events can result in fatalities, injuries, and property damage, and preventing such outcomes to the extent possible is a key concern.
This project involves exploring the U.S. National Oceanic and Atmospheric Administration’s (NOAA) storm database. This database tracks characteristics of major storms and weather events in the United States, including when and where they occur, as well as estimates of any fatalities, injuries, and property damage.
The data for this assignment come in the form of a comma-separated-value file compressed via the bzip2 algorithm to reduce its size. You can download the file from the course web site:
There is also some documentation of the database available. Here you will find how some of the variables are constructed/defined.
National Weather Service Storm Data Documentation
National Climatic Data Center Storm Events
The events in the database start in the year 1950 and end in November 2011. In the earlier years of the database there are generally fewer events recorded, most likely due to a lack of good records. More recent years should be considered more complete.
The data analysis must address the following questions:
Across the United States, which types of events (as indicated in
the EVTYPE start color red, start verbatim,
EVTYPE, end verbatim, end color red variable) are most
harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
library(ggplot2)
library(tidyverse)
library(rmarkdown)
library(magrittr)
library(kableExtra)
library(DT)
# 1. Create a data directory
if(!base::file.exists("data")) {
base::dir.create("data")
}
# 2. Download files and store it in data directory.
if(!base::file.exists("./data/repdata_data_StormData.csv.bz2")){
utils::download.file(url = "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2",
destfile = "./data/repdata_data_StormData.csv.bz2")
}
# 1. the raw data
raw_data <- utils::read.csv(file = "./data/repdata_data_StormData.csv.bz2")
# 2. converting into a data frame
raw_data <- dplyr::tbl_df(raw_data)
This section presents a few remarks in the EDA on the data from NOAA.
base::dim(raw_data)
## [1] 902297 37
Printing the first 5 rows from dataset
# printing the first 5 rows from raw data
head(raw_data,5)
## # A tibble: 5 × 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 1 4/18/1950… 0130 CST 97 MOBILE AL TORNA… 0
## 2 1 4/18/1950… 0145 CST 3 BALDWIN AL TORNA… 0
## 3 1 2/20/1951… 1600 CST 57 FAYETTE AL TORNA… 0
## 4 1 6/8/1951 … 0900 CST 89 MADISON AL TORNA… 0
## 5 1 11/15/195… 1500 CST 43 CULLMAN AL TORNA… 0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
Printing the last 5 rows from dataset
# printing the last 5 rows from raw data
tail(raw_data,5)
## # A tibble: 5 × 37
## STATE__ BGN_DATE BGN_TIME TIME_ZONE COUNTY COUNTYNAME STATE EVTYPE BGN_RANGE
## <dbl> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
## 1 56 11/30/201… 10:30:0… MST 7 WYZ007 - … WY HIGH … 0
## 2 30 11/10/201… 02:48:0… MST 9 MTZ009 - … MT HIGH … 0
## 3 2 11/8/2011… 02:58:0… AKS 213 AKZ213 AK HIGH … 0
## 4 2 11/9/2011… 10:21:0… AKS 202 AKZ202 AK BLIZZ… 0
## 5 1 11/28/201… 08:00:0… CST 6 ALZ006 AL HEAVY… 0
## # ℹ 28 more variables: BGN_AZI <chr>, BGN_LOCATI <chr>, END_DATE <chr>,
## # END_TIME <chr>, COUNTY_END <dbl>, COUNTYENDN <lgl>, END_RANGE <dbl>,
## # END_AZI <chr>, END_LOCATI <chr>, LENGTH <dbl>, WIDTH <dbl>, F <int>,
## # MAG <dbl>, FATALITIES <dbl>, INJURIES <dbl>, PROPDMG <dbl>,
## # PROPDMGEXP <chr>, CROPDMG <dbl>, CROPDMGEXP <chr>, WFO <chr>,
## # STATEOFFIC <chr>, ZONENAMES <chr>, LATITUDE <dbl>, LONGITUDE <dbl>,
## # LATITUDE_E <dbl>, LONGITUDE_ <dbl>, REMARKS <chr>, REFNUM <dbl>
The dataset from NOAA has 37 variables, as wee seen using the
str function
utils::str(raw_data)
## tibble [902,297 × 37] (S3: tbl_df/tbl/data.frame)
## $ STATE__ : num [1:902297] 1 1 1 1 1 1 1 1 1 1 ...
## $ BGN_DATE : chr [1:902297] "4/18/1950 0:00:00" "4/18/1950 0:00:00" "2/20/1951 0:00:00" "6/8/1951 0:00:00" ...
## $ BGN_TIME : chr [1:902297] "0130" "0145" "1600" "0900" ...
## $ TIME_ZONE : chr [1:902297] "CST" "CST" "CST" "CST" ...
## $ COUNTY : num [1:902297] 97 3 57 89 43 77 9 123 125 57 ...
## $ COUNTYNAME: chr [1:902297] "MOBILE" "BALDWIN" "FAYETTE" "MADISON" ...
## $ STATE : chr [1:902297] "AL" "AL" "AL" "AL" ...
## $ EVTYPE : chr [1:902297] "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ BGN_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ BGN_AZI : chr [1:902297] "" "" "" "" ...
## $ BGN_LOCATI: chr [1:902297] "" "" "" "" ...
## $ END_DATE : chr [1:902297] "" "" "" "" ...
## $ END_TIME : chr [1:902297] "" "" "" "" ...
## $ COUNTY_END: num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ COUNTYENDN: logi [1:902297] NA NA NA NA NA NA ...
## $ END_RANGE : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ END_AZI : chr [1:902297] "" "" "" "" ...
## $ END_LOCATI: chr [1:902297] "" "" "" "" ...
## $ LENGTH : num [1:902297] 14 2 0.1 0 0 1.5 1.5 0 3.3 2.3 ...
## $ WIDTH : num [1:902297] 100 150 123 100 150 177 33 33 100 100 ...
## $ F : int [1:902297] 3 2 2 2 2 2 2 1 3 3 ...
## $ MAG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ FATALITIES: num [1:902297] 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num [1:902297] 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num [1:902297] 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr [1:902297] "K" "K" "K" "K" ...
## $ CROPDMG : num [1:902297] 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr [1:902297] "" "" "" "" ...
## $ WFO : chr [1:902297] "" "" "" "" ...
## $ STATEOFFIC: chr [1:902297] "" "" "" "" ...
## $ ZONENAMES : chr [1:902297] "" "" "" "" ...
## $ LATITUDE : num [1:902297] 3040 3042 3340 3458 3412 ...
## $ LONGITUDE : num [1:902297] 8812 8755 8742 8626 8642 ...
## $ LATITUDE_E: num [1:902297] 3051 0 0 0 0 ...
## $ LONGITUDE_: num [1:902297] 8806 0 0 0 0 ...
## $ REMARKS : chr [1:902297] "" "" "" "" ...
## $ REFNUM : num [1:902297] 1 2 3 4 5 6 7 8 9 10 ...
NA Presence# Printing the summ of LATITUDE
summary(raw_data$LATITUDE)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 2802 3540 2875 4019 9706 47
# Printing the summ of Latitude_e
summary(raw_data$LATITUDE_E)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 0 0 0 1452 3549 9706 40
The PROPDMGEXP and CROPDMGEXP(Properties
Exponential and Crop Exponential) variables store the base 10
exponential value of PROPDMG and CROPDMG,
respectively. For this reason, it is necessary to combine
PROPDMG and PROPDMGEXP into a single variable
and do the same to CROPDMG and CROPDMGEXP
Therefore, it is essential to understand how the
PROPDMGEXP and CROPDMGEXP were recorded.
# Printing the unique values to exponential to properties
unique(raw_data$PROPDMGEXP)
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
# Printing the unique values to exponential to Crops.
unique(raw_data$CROPDMGEXP)
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Finally, the notation is not uniform because sometimes they use
Capital or Lowercase (e.g., K or k). In other
cases, they use a number or insert a question mark (?) or
plus/minus signal (e.g., - or +).
*PROPDMGEXP
?: Has 8 (eight) observations with this notation, none
of the rows of PROPDM and CROPDMG has
values;+: Has 5 (five) observations with this notation,
and;-: Has 1 (one) observation with this notation and 1
(one) fatality. I will consider like “0”.*CROPDMGEXP
?: Has 7 (seven) observations with this notation, all
those with CROPDMG as zero;Event typeThe raw dataset has 985 Event Types, the code below will present the first 10.
raw_data %>%
group_by(EVTYPE) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 × 2
## EVTYPE n
## <chr> <int>
## 1 HAIL 288661
## 2 TSTM WIND 219940
## 3 THUNDERSTORM WIND 82563
## 4 TORNADO 60652
## 5 FLASH FLOOD 54277
## 6 FLOOD 25326
## 7 THUNDERSTORM WINDS 20843
## 8 HIGH WIND 20212
## 9 LIGHTNING 15754
## 10 HEAVY SNOW 15708
There are some troubles working on these date variables:
BGN_DATE, END_DATE, and END_TIME,
helpfully we are not using these variables as a part of the course
project.
Keeping up with the main goal of this project, the dataset will be subsetted as follows:
As we move on the project, we are looking for the most harmful events concerning in: population health and the economic consequences (loses in USD american dollars).
In that order, the data that will be subsetted from the
raw_data will be:
*Variables:
+EVTYPE: Event type;
+INJURIES: Quantity of injuries;
+FATALITIES: Quantity of fatalities;
+PROPDMG: Amount in USD american dollars in Damages in
properties;
+CROPDMG: Amount in USD american dollars in Damages in
crops;
+PROPDMGEXP: he exponential in base 10 used to notate
the PROPDMG;
+CROPDMGEXP: The exponential in base 10 used to notate
the CROPDMG
# Subsetting the dataset
df_subset <- select(raw_data,EVTYPE,INJURIES,FATALITIES,PROPDMG,PROPDMGEXP,CROPDMG,CROPDMGEXP)
It is necessary to merge PROPDMG and
PROPDMGEXP into a single variable, so I will create the
PROPDMGABS using the following rules.
*Adjusted variables:
+All 0, and - will be converted into 1;
+All 1 will be converted into 10;
+All H, h and 2 will be
converted into 100;
+All K, k and 3 will be
converted into 1000;
+All 4 will be converted into 10000;
+All 5 will be converted into 100000;
+All M, m and 6 will be
converted into 1000000;
+All 7 will be converted into 10000000;
+All 8 will be converted into 100000000;
+All B, b and 9 will be
converted into 1000000000;
Following the same principles, I will merge CROPDMG and
CROPDMG to create CROPDMGABS:
*Adjusted variables:
+All 0, and ? will be converted into 1;
+All H,h and 2 will be
converted into 100;
+All K, k and 3 will be
converted into 1000;
+All M, m and 6 will be
converted into 1000000;
+All B, b and 9 will be
converted into 1000000000
The PROPDMGABS and CROPDMGABS will be in
charge of storing the absolute values of Properties Damages and Crops
Damages.
# Merging exponential column and value column.
df_subset_2 <- df_subset %>%
# Creating a new column PROPDMGABS
mutate(PROPDMGABS = case_when(
PROPDMGEXP %in% c("", "-",0) ~ PROPDMG * 1,
PROPDMGEXP %in% c(1) ~ PROPDMG * 10,
PROPDMGEXP %in% c("H","h",2) ~ PROPDMG * 100,
PROPDMGEXP %in% c("K","k",3) ~ PROPDMG * 1000,
PROPDMGEXP %in% c(4) ~ PROPDMG * 10000,
PROPDMGEXP %in% c(5) ~ PROPDMG * 100000,
PROPDMGEXP %in% c("M","m",6) ~ PROPDMG * 1000000,
PROPDMGEXP %in% c(7) ~ PROPDMG * 10000000,
PROPDMGEXP %in% c(8) ~ PROPDMG * 100000000,
PROPDMGEXP %in% c("B","b",9) ~ PROPDMG * 1000000000)) %>%
# Creating a new column CROPDMGABS
mutate(CROPDMGABS = case_when(
CROPDMGEXP %in% c("?","", 0) ~ CROPDMG * 1,
CROPDMGEXP %in% c("H", "h", 2) ~ CROPDMG * 100,
CROPDMGEXP %in% c("K", "k", 3) ~ CROPDMG * 1000,
CROPDMGEXP %in% c("M", "m", 6) ~ CROPDMG * 1000000,
CROPDMGEXP %in% c("B", "b", 9) ~ CROPDMG * 1000000000))
Event typeThe EVTYPE variable is not tidy because there are a lot
of classifications with the same meaning but different notations. Hence,
I will standardize this variable following the rules inside of the next
chunk.
# Standardizing the EVTYPE variable.
df_subset_3 <- df_subset_2 %>%
# Converting all value to be UPPERCASE.
mutate(EVTYPE = stringr::str_to_upper(EVTYPE)) %>%
# Rules to aggregate events into a new variable called EVTYPE_2
mutate(EVTYPE_2 = case_when(
grepl(pattern = "*HURRICANE*|*TYPHOON*", x = EVTYPE) ~ "HURRICANE",
grepl(pattern = "*TORNADO*", x = EVTYPE) ~ "TORNADO",
grepl(pattern = "*THUNDERSTORM WIN*|*TSTM*|*FUNNEL CLOUD*|*FUNNEL*|*THUNDERSTORM*", x = EVTYPE) ~ "THUNDERSTORM WIND",
grepl(pattern = "*FLOOD*|*FLD*", x = EVTYPE) ~ "FLOOD",
grepl(pattern = "*HAIL*", x = EVTYPE) ~ "HAIL",
grepl(pattern = "*HIGH WIND*|*STRONG WIND*|WIND", x = EVTYPE) ~ "HIGH WIND",
grepl(pattern = "*FOREST FIRE*|*WILD FIRES*|*WILDFIRE*", x = EVTYPE) ~ "FOREST FIRE",
grepl(pattern = "*LIGHTNING*", x = EVTYPE) ~ "LIGHTNING",
grepl(pattern = "*HEAT*|*DROUGHT*|*EXCESSIVE HEAT*|*RECORD HIGH*|*WARMTH*|*UNSEASONABLY WARM*|*UNSEASONABLY DRY*|*TEMPERATURE RECORD*|*RECORD TEMPERATURE*|*DRY*", x = EVTYPE) ~ "EXCESSIVE HEAT",
grepl(pattern = "*WINTER WEATHER*|*FROST/FREEZE*|*EXTREME COLD*|*COLD*|*ICE*|*FREEZE*|*FROST*|*WINTRY MIX*|*LOW TEMPERATURE*", x = EVTYPE) ~ "EXCESSIVE COLD",
grepl(pattern = "*WINTER STORM*|*SLEET*|*SNOW*|*BLIZZARD*|*FREEZING RAIN*", x = EVTYPE) ~ "ICE STORM",
grepl(pattern = "*LANDSLIDE*|*MUDSLIDE*|*MUD SLIDE*", x = EVTYPE) ~ "LANDSLIDE",
grepl(pattern = "*AVALANCHE*", x = EVTYPE) ~ "AVALANCHE",
grepl(pattern = "*FOG*|*SMOKE*", x = EVTYPE) ~ "FOG",
grepl(pattern = "*DUST*", x = EVTYPE) ~ "DUST",
grepl(pattern = "*CURRENT*|*SURF*", x = EVTYPE) ~ "CURRENT",
grepl(pattern = "*TSUNAMI*", x = EVTYPE) ~ "TSUNAMI",
grepl(pattern = "*WATERSPOUT*", x = EVTYPE) ~ "WATERSPOUT")) %>%
# Creating a OTHER type of event to aggregate minor events.
mutate(EVTYPE_2 = case_when(
is.na(EVTYPE_2) ~ "OTHER",
!is.na(EVTYPE_2) ~ EVTYPE_2)) %>%
# Dropping EVTYPE because it is not necessary to carry it.
select(-EVTYPE) %>%
# Renaming the EVTYPE_2 to be the new EVTYPE
mutate(EVTYPE = EVTYPE_2)
After the standardization, the new EVTYPE has 19
categories.
# Printing the standards of Event Type
unique(df_subset_3$EVTYPE)
## [1] "TORNADO" "THUNDERSTORM WIND" "HAIL"
## [4] "EXCESSIVE COLD" "ICE STORM" "FLOOD"
## [7] "HURRICANE" "EXCESSIVE HEAT" "LIGHTNING"
## [10] "FOG" "CURRENT" "HIGH WIND"
## [13] "OTHER" "WATERSPOUT" "AVALANCHE"
## [16] "DUST" "FOREST FIRE" "LANDSLIDE"
## [19] "TSUNAMI"
Tidy DataThe Tidy Data is result of combining:
EVTYPE, INJURIES, FATALITIES,
PROPDMGABS and CROPDMGABS
# Creating the tidy dataset.
df_tidy <- df_subset_3 %>%
# Dropping the columns: PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG
select(-c(PROPDMGEXP, PROPDMG, CROPDMGEXP, CROPDMG))
Results#Question 1:
Across the United States, which types of events (as indicated in the
EVTYPE start color red, start verbatim,
EVTYPE, end verbatim, end color red variable) are most
harmful with respect to population health?
As first option, I define harmful as the summation of:
INJURIES and FATALITIES for each
EVTYPE (as it was defined before)
In that order, the next step was creating the following variables:
*New variables:
+INJURIES_AVG: average of injuries as a proportion of
its quantities;
+FATALITIES_AVG: average of fatalities as a proportion
of its quantities;
+HARMFUL_AVG: average of harmful as a proportion of its
quantities;
# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q1 <- df_tidy %>%
# Creating the HARMFUL variable.
mutate(HARMFUL = INJURIES + FATALITIES) %>%
# Aggregate everything using EVTYPE.
group_by(EVTYPE) %>%
# Summarizes the data using the group by EVTYPE.
summarise(HARMFUL = sum(HARMFUL),
QTY = n(),
FATALITIES = sum(FATALITIES),
INJURIES = sum(INJURIES)) %>%
# Arrange to be descendant.
arrange(desc(HARMFUL)) %>%
# Adding averages variables.
mutate(HARMFUL_AVG = HARMFUL/QTY,
FATALITIES_AVG = FATALITIES/QTY,
INJURIES_AVG = INJURIES/QTY)
The adjusted dataset will be shown below:
# Printing the table
head(df_tidy_q1, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
| EVTYPE | HARMFUL | QTY | FATALITIES | INJURIES | HARMFUL_AVG | FATALITIES_AVG | INJURIES_AVG |
|---|---|---|---|---|---|---|---|
| TORNADO | 97068 | 60700 | 5661 | 91407 | 1.5991433 | 0.0932619 | 1.5058814 |
| EXCESSIVE HEAT | 14132 | 33788 | 3467 | 10665 | 0.4182550 | 0.1026104 | 0.3156446 |
| THUNDERSTORM WIND | 10276 | 343796 | 729 | 9547 | 0.0298898 | 0.0021204 | 0.0277694 |
| FLOOD | 10236 | 86133 | 1553 | 8683 | 0.1188395 | 0.0180303 | 0.1008092 |
| LIGHTNING | 6048 | 15763 | 817 | 5231 | 0.3836833 | 0.0518302 | 0.3318531 |
| EXCESSIVE COLD | 3972 | 14341 | 483 | 3489 | 0.2769681 | 0.0336797 | 0.2432885 |
| HIGH WIND | 2690 | 28173 | 696 | 1994 | 0.0954815 | 0.0247045 | 0.0707770 |
| ICE STORM | 2565 | 15973 | 335 | 2230 | 0.1605835 | 0.0209729 | 0.1396106 |
| FOREST FIRE | 1696 | 4232 | 90 | 1606 | 0.4007561 | 0.0212665 | 0.3794896 |
| HURRICANE | 1468 | 299 | 135 | 1333 | 4.9096990 | 0.4515050 | 4.4581940 |
| CURRENT | 1442 | 1937 | 712 | 730 | 0.7444502 | 0.3675787 | 0.3768715 |
| HAIL | 1386 | 289281 | 15 | 1371 | 0.0047912 | 0.0000519 | 0.0047393 |
| FOG | 1156 | 1855 | 80 | 1076 | 0.6231806 | 0.0431267 | 0.5800539 |
| DUST | 507 | 586 | 24 | 483 | 0.8651877 | 0.0409556 | 0.8242321 |
| AVALANCHE | 394 | 386 | 224 | 170 | 1.0207254 | 0.5803109 | 0.4404145 |
| OTHER | 344 | 546 | 44 | 300 | 0.6300366 | 0.0805861 | 0.5494505 |
| TSUNAMI | 162 | 20 | 33 | 129 | 8.1000000 | 1.6500000 | 6.4500000 |
| LANDSLIDE | 99 | 643 | 44 | 55 | 0.1539658 | 0.0684292 | 0.0855365 |
| WATERSPOUT | 32 | 3845 | 3 | 29 | 0.0083225 | 0.0007802 | 0.0075423 |
Once, based on the Tidy Data on the
df_tidy_q1 dataframe the plot shows as follows:
# Creating a auxiliary dataset to plot 1.
# It will be necessary to put the dataset in such a way to be helpful to ggplot2.
rbind(
# Gathering the dataset to have absolute data in rows.
df_tidy_q1 %>%
select(EVTYPE, FATALITIES, INJURIES) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES") %>%
mutate(VALUE_TYPE = factor("ABSOLUTE")),
# Gathering the dataset to average have average data in rows.
df_tidy_q1 %>%
arrange(desc(HARMFUL_AVG)) %>%
select(EVTYPE, FATALITIES_AVG, INJURIES_AVG) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE)) %>%
pivot_longer(cols = 2:3, names_to = "HARMFUL", values_to = "VALUES")%>%
mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
# Plotting the graph using GGPLOT2
ggplot(aes(fill = HARMFUL, y = VALUES, x = EVTYPE)) +
# Defining a stacked barplot.
geom_bar(position = "stack", stat = "identity") +
# Defining the graph title.
ggplot2::labs(title = "Most Harmful Event Type") +
# Defining the x-axis label
ggplot2::xlab("Event Type") +
# Defining the y-axis label
ggplot2::ylab("Health Harmful") +
# Adjusting the x-axis to be 90 degrees.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
# Removing the legend title.
ggplot2::guides(fill = ggplot2::guide_legend(title = "")) +
# Creating facet to show absolute and average values.
facet_grid(rows = vars(VALUE_TYPE), scales = "free")
In term of absolute value, there are no doubt that
TORNADO events are the most harmful to health due to the
high frequency of tornadoes. The number of tornadoes observed in the
dataset is more than 60 thousand (between 1950 and 2011), which results
in 5.661 FATALITIES, 91.407 INJURIES, and
97.068 health HARMFUL.
However, whether you analyze it in terms of average, in that case,
the TSUNAMI event type has the highest averages of
INJURIES and FATALITIES, resulting in
increased health HARMFUL. The meaning of seeing the
average is to identify the most deadly event type.
Across the United States, which types of events have the greatest economic consequences?
As was response in question 1, the next step was defining a new
variable for economic consequences, as the result of
aggregating properties damages: Properties Damages
(PROPDMGABS), and Crop Damage
(CROPDMGABS).
# This subset will creates averages of each EVTYPE and add the Harmful variable.
df_tidy_q2 <- df_tidy %>%
# Creating the HARMFUL variable.
mutate(EXPENSES = PROPDMGABS + CROPDMGABS) %>%
# Aggregate everything using EVTYPE.
group_by(EVTYPE) %>%
# Summarizes the data using the group by EVTYPE.
summarise(EXPENSES = sum(EXPENSES, na.rm = TRUE),
QTY = n(),
PROPDMGABS = sum(PROPDMGABS, na.rm = TRUE),
CROPDMGABS = sum(CROPDMGABS, na.rm = TRUE)) %>%
# Arrange to be descendant.
arrange(desc(EXPENSES)) %>%
# Adding averages variables.
mutate(EXPENSES_AVG = EXPENSES/QTY,
PROPDMGABS_AVG = PROPDMGABS/QTY,
CROPDMGABS_AVG = CROPDMGABS/QTY)
The adjusted dataset will be shown below.
# Printing the table from dataframe
head(df_tidy_q2, 19) %>% kableExtra::kbl() %>% kableExtra::kable_styling()
| EVTYPE | EXPENSES | QTY | PROPDMGABS | CROPDMGABS | EXPENSES_AVG | PROPDMGABS_AVG | CROPDMGABS_AVG |
|---|---|---|---|---|---|---|---|
| FLOOD | 180658667663 | 86133 | 168270100463 | 12388567200 | 2.097438e+06 | 1.953608e+06 | 1.438307e+05 |
| HURRICANE | 90872527810 | 299 | 85356410010 | 5516117800 | 3.039215e+08 | 2.854729e+08 | 1.844855e+07 |
| TORNADO | 59020779387 | 60700 | 58603317867 | 417461520 | 9.723357e+05 | 9.654583e+05 | 6.877455e+03 |
| CURRENT | 48056407000 | 1937 | 48055552000 | 855000 | 2.480971e+07 | 2.480927e+07 | 4.414042e+02 |
| EXCESSIVE HEAT | 21065038320 | 33788 | 5257925140 | 15807113180 | 6.234473e+05 | 1.556152e+05 | 4.678322e+05 |
| EXCESSIVE COLD | 21015496360 | 14341 | 11877309560 | 9138186800 | 1.465414e+06 | 8.282065e+05 | 6.372071e+05 |
| HAIL | 19024452136 | 289281 | 15977564513 | 3046887623 | 6.576461e+04 | 5.523199e+04 | 1.053262e+04 |
| THUNDERSTORM WIND | 12456657288 | 343796 | 11184948300 | 1271708988 | 3.623270e+04 | 3.253368e+04 | 3.699022e+03 |
| FOREST FIRE | 8899845130 | 4232 | 8496563500 | 403281630 | 2.102988e+06 | 2.007695e+06 | 9.529339e+04 |
| ICE STORM | 7551932751 | 15973 | 7412418751 | 139514000 | 4.727936e+05 | 4.640593e+05 | 8.734364e+03 |
| HIGH WIND | 7048888488 | 28173 | 6268408088 | 780480400 | 2.502001e+05 | 2.224970e+05 | 2.770313e+04 |
| LIGHTNING | 952534520 | 15763 | 940442430 | 12092090 | 6.042850e+04 | 5.966139e+04 | 7.671186e+02 |
| LANDSLIDE | 346693100 | 643 | 326676100 | 20017000 | 5.391806e+05 | 5.080499e+05 | 3.113064e+04 |
| OTHER | 169455250 | 546 | 10666300 | 158788950 | 3.103576e+05 | 1.953535e+04 | 2.908223e+05 |
| TSUNAMI | 144082000 | 20 | 144062000 | 20000 | 7.204100e+06 | 7.203100e+06 | 1.000000e+03 |
| FOG | 22929500 | 1855 | 22929500 | 0 | 1.236092e+04 | 1.236092e+04 | 0.000000e+00 |
| WATERSPOUT | 9564200 | 3845 | 9564200 | 0 | 2.487438e+03 | 2.487438e+03 | 0.000000e+00 |
| DUST | 9388130 | 586 | 6288130 | 3100000 | 1.602070e+04 | 1.073060e+04 | 5.290102e+03 |
| AVALANCHE | 3721800 | 386 | 3721800 | 0 | 9.641969e+03 | 9.641969e+03 | 0.000000e+00 |
Based on the df_tidy_q2, let’s visualize it in a
graph.
# Creating a auxiliary dataset to plot 1.
rbind(
# Gathering the dataset to have absolute data in rows.
df_tidy_q2 %>%
select(EVTYPE, PROPDMGABS, CROPDMGABS) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
PROPDMGABS = PROPDMGABS/1000000,
CROPDMGABS = CROPDMGABS/1000000) %>%
pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES") %>%
mutate(VALUE_TYPE = factor("ABSOLUTE")),
# Gathering the dataset to average have average data in rows.
df_tidy_q2 %>%
arrange(desc(EXPENSES_AVG)) %>%
select(EVTYPE, PROPDMGABS_AVG, CROPDMGABS_AVG) %>%
mutate(EVTYPE = factor(EVTYPE, levels = EVTYPE),
PROPDMGABS_AVG = PROPDMGABS_AVG/1000000,
CROPDMGABS_AVG = CROPDMGABS_AVG/1000000) %>%
pivot_longer(cols = 2:3, names_to = "EXPENSES", values_to = "VALUES")%>%
mutate(VALUE_TYPE = factor("AVERAGE"))) %>%
# Plotting the graph using GGPLOT2
ggplot(aes(fill = EXPENSES, y = VALUES, x = EVTYPE)) +
# Defining a stacked barplot.
geom_bar(position = "stack", stat = "identity") +
# Defining the graph title.
ggplot2::labs(title = "Economic Consequence of Event Types") +
# Defining the x-axis label
ggplot2::xlab("Event Type") +
# Defining the y-axis label
ggplot2::ylab("Economic Impact [Million USD]") +
# Adjusting the x-axis to be 90 degrees.
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1),
legend.position = "right",
plot.title = element_text(hjust = 0.5),
plot.subtitle = element_text(hjust = 0.5)) +
# Removing the legend title.
ggplot2::guides(fill = ggplot2::guide_legend(title = "")) +
# Creating facet to show absolute and average values.
facet_grid(rows = vars(VALUE_TYPE), scales = "free")
FLOOD is the event with the greatest economic
consequence (most of it related to Properties Damages). Between 1950 and
2011, it causes Damages to Properties and Crops that reach more than USD
180 billion. However, the HURRICANE is one event type with
the highest average in terms of damage. On average, one hurricane’s
outcome is around USD 300 million in damages.
Finally, it is also important to point out that
EXCESSIVE HEAT is the worst event related to Crops, which
is almost USD 15.8 billion in Damages to Crops.