In this assignment, our task is to explore the NOAA Storm Database (the events in the database start in year 1950 and end in November 2011) and provide data analysis that answers two questions related to severe weather events:
Across the United States, which types of events are most harmful with respect to population health?
Across the United States, which types of events have the greatest economic consequences?
I use R and the tidyverse suite of tools (specifically
dplyr and ggplot) to process, transform, and analyze this data.
As a result of my analysis, I can conclude that tornadoes are the most harmful with respect to population health, assessed using fatality and injury data. In comparison, floods are the event that have the greatest economic consequences as assessed by property and crop damage.
The data was download from the internet using the URL
provided. Then read.csv was used to read in the CSV for
analysis.
url <- "https://d396qusza40orc.cloudfront.net/repdata%2Fdata%2FStormData.csv.bz2"
temp <- tempfile(fileext = ".bz2")
download.file(url, temp)
df <- read.csv(bzfile(temp))
unlink(temp)
glimpse(df)
## Rows: 902,297
## Columns: 37
## $ STATE__ <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,…
## $ BGN_DATE <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0…
## $ BGN_TIME <chr> "0130", "0145", "1600", "0900", "1500", "2000", "0100", "09…
## $ TIME_ZONE <chr> "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CST", "CS…
## $ COUNTY <dbl> 97, 3, 57, 89, 43, 77, 9, 123, 125, 57, 43, 9, 73, 49, 107,…
## $ COUNTYNAME <chr> "MOBILE", "BALDWIN", "FAYETTE", "MADISON", "CULLMAN", "LAUD…
## $ STATE <chr> "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL", "AL",…
## $ EVTYPE <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR…
## $ BGN_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ BGN_AZI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ BGN_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_DATE <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_TIME <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ COUNTY_END <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ COUNTYENDN <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
## $ END_RANGE <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ END_AZI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ END_LOCATI <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LENGTH <dbl> 14.0, 2.0, 0.1, 0.0, 0.0, 1.5, 1.5, 0.0, 3.3, 2.3, 1.3, 4.7…
## $ WIDTH <dbl> 100, 150, 123, 100, 150, 177, 33, 33, 100, 100, 400, 400, 2…
## $ F <int> 3, 2, 2, 2, 2, 2, 2, 1, 3, 3, 1, 1, 3, 3, 3, 4, 1, 1, 1, 1,…
## $ MAG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,…
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, …
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, …
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",…
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,…
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ WFO <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ STATEOFFIC <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ ZONENAMES <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ LATITUDE <dbl> 3040, 3042, 3340, 3458, 3412, 3450, 3405, 3255, 3334, 3336,…
## $ LONGITUDE <dbl> 8812, 8755, 8742, 8626, 8642, 8748, 8631, 8558, 8740, 8738,…
## $ LATITUDE_E <dbl> 3051, 0, 0, 0, 0, 0, 0, 0, 3336, 3337, 3402, 3404, 0, 3432,…
## $ LONGITUDE_ <dbl> 8806, 0, 0, 0, 0, 0, 0, 0, 8738, 8737, 8644, 8640, 0, 8540,…
## $ REMARKS <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",…
## $ REFNUM <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, …
The dataset contains 37 columns and 902,297 rows. To answer the questions given, not all columns will be necessary. Data was subsetted keeping the following columns:
EVTYPE: event typeFATALITIES: number of fatalities (used in population
health analysis)INJURIES: number of injuries (used in population health
analysis)PROPDMG: property damage estimate (used in economic
analysis)CROPDMG: crop damage estimate (used in economic
analysis)PROPDMGEXP: property damage estimate magnitude (used in
economic analysis)CROPDMGEXP: crop damage estimate magnitude (used in
economic analysis)df <- df %>%
select(EVTYPE, FATALITIES, INJURIES, PROPDMG, CROPDMG, PROPDMGEXP, CROPDMGEXP)
To check the class of each column selected, I used
sapply.
sapply(df, class)
## EVTYPE FATALITIES INJURIES PROPDMG CROPDMG PROPDMGEXP
## "character" "numeric" "numeric" "numeric" "numeric" "character"
## CROPDMGEXP
## "character"
Next, we’ll check if there are any NAs present in the dataset.
sum(is.na(df))
## [1] 0
Next, we’ll check the EVTYPE column for unique elements.
There is a total of 985 unique elements in this column. I will not
revise the contents of this column since we are focused on which events
are most harmful with respect to population health and the economy.
Finally, the PROPDMG and CROPDMG columns
require additional transformation. They are currently without their
corresponding units which is in the their respective EXP
columns.
H means hundredsK means thousandsM means millionsB means billionsTotal columns will be summed and a new column will be created called
TOTALDMG.
df <- df %>%
mutate(
PROPDMGTOTAL = case_when(
PROPDMGEXP == "H" ~ PROPDMG * 10^2,
PROPDMGEXP == "K" ~ PROPDMG * 10^3,
PROPDMGEXP == "M" ~ PROPDMG * 10^6,
PROPDMGEXP == "B" ~ PROPDMG * 10^9,
TRUE ~ 0
),
CROPDMGTOTAL = case_when(
CROPDMGEXP == "H" ~ CROPDMG * 10^2,
CROPDMGEXP == "K" ~ CROPDMG * 10^3,
CROPDMGEXP == "M" ~ CROPDMG * 10^6,
CROPDMGEXP == "B" ~ CROPDMG * 10^9,
TRUE ~ 0
),
TOTALDMG = PROPDMGTOTAL + CROPDMGTOTAL
)
To determine which events are most harmful with respect to population
health, we’ll use the FATALITIES and INJURIES
columns.
df %>%
group_by(EVTYPE) %>%
summarize(total_fatalities = sum(FATALITIES, na.rm = T)) %>%
arrange(desc(total_fatalities)) %>%
slice(1:10) %>%
ggplot(aes(x = reorder(EVTYPE, total_fatalities), y = total_fatalities)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
labs(
x = 'Event Type',
y = "Total Fatalities",
title = "Total Number of Fatalities for the top 10 Weather Events"
)
We can conclude based on fatalities that tornadoes are most harmful with
respect to population health in the US.
df %>%
group_by(EVTYPE) %>%
summarize(total_injuries = sum(INJURIES, na.rm = T)) %>%
arrange(desc(total_injuries)) %>%
slice(1:10) %>%
ggplot(aes(x = reorder(EVTYPE, total_injuries), y = total_injuries)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
labs(
x = 'Event Type',
y = "Total Injuries",
title = "Total Number of Injuries for the top 10 Weather Events"
)
The previous finding is supported when looking at injuries. Tornadoes once again pose the greatest threat to population health compared to other severe weather events.
df %>%
group_by(EVTYPE) %>%
summarize(total_dmg = sum(TOTALDMG, na.rm = T)) %>%
arrange(desc(total_dmg)) %>%
slice(1:10) %>%
ggplot(aes(x = reorder(EVTYPE, total_dmg), y = total_dmg)) +
geom_col() +
theme(axis.text.x = element_text(angle = 25, vjust = 1, hjust=1)) +
labs(
x = 'Event Type',
y = "Damages ($)",
title = "Property and Crop Damage for the top 10 Weather Events"
)
When looking at property and crop damage, floods have the greatest economic consequences. Interestingly, tornadoes are among of the top 10 events.
sessionInfo()
## R version 4.4.3 (2025-02-28)
## Platform: aarch64-apple-darwin20
## Running under: macOS Sequoia 15.4.1
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/New_York
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] lubridate_1.9.4 forcats_1.0.1 stringr_1.6.0 dplyr_1.1.4
## [5] purrr_1.2.0 readr_2.1.6 tidyr_1.3.1 tibble_3.3.0
## [9] ggplot2_4.0.1 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.4.3 tidyselect_1.2.1
## [5] jquerylib_0.1.4 scales_1.4.0 yaml_2.3.10 fastmap_1.2.0
## [9] R6_2.6.1 labeling_0.4.3 generics_0.1.4 knitr_1.50
## [13] bslib_0.9.0 pillar_1.11.1 RColorBrewer_1.1-3 tzdb_0.5.0
## [17] rlang_1.1.6 stringi_1.8.7 cachem_1.1.0 xfun_0.54
## [21] sass_0.4.10 S7_0.2.1 timechange_0.3.0 cli_3.6.5
## [25] withr_3.0.2 magrittr_2.0.4 digest_0.6.39 grid_4.4.3
## [29] rstudioapi_0.17.1 hms_1.1.4 lifecycle_1.0.4 vctrs_0.6.5
## [33] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 rmarkdown_2.30
## [37] tools_4.4.3 pkgconfig_2.0.3 htmltools_0.5.8.1