The goal of this assignment is to explore the NOAA Storm Database and answer some basic questions about severe weather events. Specifically, we are going to try to answer two questions:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
Load packages necessary for the analysis and show session info
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
sessionInfo()
## R version 4.0.5 (2021-03-31)
## Platform: i386-w64-mingw32/i386 (32-bit)
## Running under: Windows 10 x64 (build 19043)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=English_Canada.1252 LC_CTYPE=English_Canada.1252
## [3] LC_MONETARY=English_Canada.1252 LC_NUMERIC=C
## [5] LC_TIME=English_Canada.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] ggplot2_3.3.5 dplyr_1.0.7
##
## loaded via a namespace (and not attached):
## [1] knitr_1.32 magrittr_2.0.1 munsell_0.5.0 tidyselect_1.1.0
## [5] colorspace_2.0-0 R6_2.5.0 rlang_0.4.10 fansi_0.4.2
## [9] stringr_1.4.0 tools_4.0.5 grid_4.0.5 gtable_0.3.0
## [13] xfun_0.22 utf8_1.2.1 DBI_1.1.1 withr_2.4.1
## [17] jquerylib_0.1.4 htmltools_0.5.1.1 ellipsis_0.3.1 assertthat_0.2.1
## [21] yaml_2.2.1 digest_0.6.27 tibble_3.1.0 lifecycle_1.0.0
## [25] crayon_1.4.1 purrr_0.3.4 vctrs_0.3.7 glue_1.4.2
## [29] evaluate_0.14 rmarkdown_2.11 stringi_1.7.6 compiler_4.0.5
## [33] pillar_1.6.0 scales_1.1.1 generics_0.1.0 pkgconfig_2.0.3
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
download.file(url, dest = "dataset.csv")
dataset_NOAA_original <- read.csv("dataset.csv", na.strings="?")
The variables in the dataset that are relevant to the analysis questions are:
EVTYPE: Event types e.g. Tornado, Flood.
FATALITIES: Number of fatalities caused.
INJURIES: Number of injuries caused.
PROPDMG: Property damage caused.
PROPDMGEXP: Alphabetical characters used to signify magnitude for property damage include “K” for thousands, “M” for millions, and “B” for billions.
CROPDMG: Crop damage caused.
CROPDMGEXP: Alphabetical characters used to signify magnitude for crop damage.
Therefore, I created a subset of the original data with these variables.
NOAA_clean <- dataset_NOAA_original[,c('EVTYPE','FATALITIES','INJURIES', 'PROPDMG', 'PROPDMGEXP', 'CROPDMG', 'CROPDMGEXP')]
Check the structure of the data set and prepare it for analysis.
There is some inconsistencies in the variable ETYPE, sometimes is type Tornado and sometimes Torndao. I modify some of these inconsistencies adding a new variable CATEGORY so we still have the original data in ETYPE.
str(NOAA_clean)
## 'data.frame': 902297 obs. of 7 variables:
## $ EVTYPE : chr "TORNADO" "TORNADO" "TORNADO" "TORNADO" ...
## $ FATALITIES: num 0 0 0 0 0 0 0 0 1 0 ...
## $ INJURIES : num 15 0 2 2 2 6 1 0 14 0 ...
## $ PROPDMG : num 25 2.5 25 2.5 2.5 2.5 2.5 2.5 25 25 ...
## $ PROPDMGEXP: chr "K" "K" "K" "K" ...
## $ CROPDMG : num 0 0 0 0 0 0 0 0 0 0 ...
## $ CROPDMGEXP: chr "" "" "" "" ...
# Covert to upper case for intepretation
NOAA_clean$EVTYPE <- toupper(NOAA_clean$EVTYPE)
# Remove extra spaces at the begging and end of the string of events type
NOAA_clean$EVTYPE <- trimws(NOAA_clean$EVTYPE)
# Remove row with "summary" as of type of event
NOAA_clean <- NOAA_clean[- grep("SUMMARY", NOAA_clean$EVTYPE),]
# Add variable category
NOAA_clean$CATEGORY <- NOAA_clean$EVTYPE
NOAA_clean$CATEGORY[grepl("VOLCANIC", NOAA_clean$CATEGORY)] <- "VOLCANIC"
NOAA_clean$CATEGORY[grepl("DROUGHT|DRY|DRIEST", NOAA_clean$CATEGORY)] <- "DROUGHT"
NOAA_clean$CATEGORY[grepl("TSTM|THUNDERSTORM", NOAA_clean$CATEGORY)] <- "THUNDERSTORM"
NOAA_clean$CATEGORY[grepl("TORNADO|TORNDAO", NOAA_clean$CATEGORY)] <- "TORNADO"
NOAA_clean$CATEGORY[grepl("WATERSPOUT|WAYTERSPOUT", NOAA_clean$CATEGORY)] <- "WATERSPOUT"
NOAA_clean$CATEGORY[grepl("FLO|RAIN|PRECIP|DROWNING|SHOWER|WATER", NOAA_clean$CATEGORY)] <- "FLOODING"
NOAA_clean$CATEGORY[grepl("SLIDE|AVALANCE|AVALANCHE|LANDSLUMP", NOAA_clean$CATEGORY)] <- "LANDSLIDE"
NOAA_clean$CATEGORY[grepl("HEAT|HOT|EXCESSIVE|WARM|HIGH|RECORD|HYPERTHERMIA", NOAA_clean$CATEGORY)] <- "HEAT"
NOAA_clean$CATEGORY[grepl("LOW|SNOW|ICE|FROST|BLIZZARD|HAIL|COLD|COOL|ICY|HYPOTHERMIA|CHILL|FREEZE|WINTER|FREEZING", NOAA_clean$CATEGORY)] <- "COLD"
NOAA_clean$CATEGORY[grepl("WILDFIRE|FIRE", NOAA_clean$CATEGORY)] <- "WILDFIRE"
NOAA_clean$CATEGORY[grepl("LIGHTNING|LIGHTING|LIGNTNING", NOAA_clean$CATEGORY)] <- "LIGTHNING"
NOAA_clean$CATEGORY[grepl("FOG|VOG|DUST|SMOKE", NOAA_clean$CATEGORY)] <- "VISIBILITY"
NOAA_clean$CATEGORY <- factor(NOAA_clean$CATEGORY)
# Calculation of Property and Corp Damage
NOAA_clean$PROPDMGEXP <- recode(NOAA_clean$PROPDMGEXP,'K' = 1000, 'M' = 1000000, 'B' = 1000000000, .default=1)
NOAA_clean$CROPDMGEXP <- recode(NOAA_clean$CROPDMGEXP,'K' = 1000, 'M' = 1000000, 'B' = 1000000000, .default=1)
NOAA_clean$PROPDMGVALUE <- NOAA_clean$PROPDMG * NOAA_clean$PROPDMGEXP
NOAA_clean$CROPDMGVALUE <- NOAA_clean$CROPDMG * NOAA_clean$CROPDMGEXP
Health related calculations
Economic related calculations
# Most devastated weather event overall by fatalities
sum_death_by_category <- aggregate(FATALITIES ~ CATEGORY, NOAA_clean, sum)
sum_death_by_category <- head(sum_death_by_category[order(-sum_death_by_category$FATALITIES), ], 10) %>%
rename(DAMAGE = FATALITIES) %>%
mutate(TYPE = "FATALITIES")
# Most devastated weather event overall by injuries
sum_injury_by_category <- aggregate(INJURIES ~ CATEGORY, NOAA_clean, sum)
sum_injury_by_category <- head(sum_injury_by_category[order(-sum_injury_by_category$INJURIES), ], 10) %>%
rename(DAMAGE = INJURIES) %>%
mutate(TYPE = "INJURIES")
# Add Fatalities and Injuries in one dataset
harmful_category <- bind_rows(sum_death_by_category, sum_injury_by_category)
# Most devastated crop damage by category
sum_crop_by_category <- aggregate(CROPDMGVALUE ~ CATEGORY, NOAA_clean, sum)
sum_crop_by_category <- head(sum_crop_by_category[order(-sum_crop_by_category$CROPDMGVALUE), ], 10) %>%
rename(DAMAGE = CROPDMGVALUE) %>%
mutate(TYPE = "CROP DAMAGE")
# Most devastated property damage by category
sum_property_by_category <- aggregate(PROPDMGVALUE ~ CATEGORY, NOAA_clean, sum)
sum_property_by_category <- head(sum_property_by_category[order(-sum_property_by_category$PROPDMGVALUE), ], 10) %>%
rename(DAMAGE = PROPDMGVALUE) %>%
mutate(TYPE = "PROPERTY DAMAGE")
# Add Property and crop damage in one dataset
economic_category <- bind_rows(sum_property_by_category, sum_crop_by_category)
According to graphs and the calculations done to the dataset we can determine that:
Across the United States, which types of events (as indicated in the EVTYPE variable) are most harmful with respect to population health?
The three top types of events that are the most harmful with respect to population health are:
1. TORNADOS
2. HEAT
3. FLOODING
Across the United States, which types of events have the greatest economic consequences?
The three top types of events that have the greatest economic consequences are:
1. FLOODING
2. STORM SURGE
3. HURRICANE/TYPHOON
# Question 1 plot
ggplot(harmful_category, aes(x = reorder(CATEGORY, -DAMAGE), y = DAMAGE, fill = TYPE))+
geom_bar(stat = "identity")+
labs(title = "Most harmful Events to Population Health",
x = "Category",
y = "Count") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5))
# Question 2 plot
ggplot(economic_category, aes(x = reorder(CATEGORY,-DAMAGE), y = DAMAGE, fill = TYPE))+
geom_bar(stat = "identity")+
labs(title = "Most Economical devastated Events to Population",
x = "Category",
y = "Count") +
theme(axis.text.x = element_text(angle = 90), plot.title = element_text(hjust = 0.5))