The data set contained information regarding major weather events across the U.S., starting in 1950 and ending in November 2011. It was sourced from the NOAA storm database, comprising 902,297 records across 37 variables in CSV format compressed via the bzip2 algorithm.
The project set out to identify the weather events that were most harmful to population health and those that had the greatest economic consequences.
Since the data set was extensive, the project utilized the data.table package for both data acquisition and preprocessing, while the results were produced in the tidyverse.
Data profiling revealed inconsistent categorical labels, motivating a structured preprocessing pipeline.
The preprocessing pipeline consisted of many steps. First, dates before Jan 1st, 1996 were filtered out, because NOAA didn’t begin recording all weather event types until that date. Second, str_extract() was used for automatic and bulk transformation of most unique values of the “event_type” column,and then, fcase() was employed to manually gather the remaining unique values into groups of weather events, while data entry errors were filtered out. Third, two derived columns were produced holding the total property and agricultural damages, after recoding their metric prefixes columns. Last, summary statistics were calculated for each unique weather event type.
Regarding the health consequences, tornadoes had the most devastating effect, with a total number of 22183 fatalities and injuries. Floods and heat followed with half the previous count, while thunderstorm winds, lightning, and winter storms were also dangerous. Finally, heat was responsible for the most deaths out of all event types, with tornadoes and floods following.
With regard to the economic consequences, floods were undoubtedly the most catastrophic, resulting in a total of 170.8 billion dollars in damages. Moreover, hurricanes were also calamitous, leading to a financial loss of 95.4 billion dollars, while tornadoes and hail followed. Last, drought was the most destructive agriculturally, causing $13.4B in crop damages.
The “repdata_data_StormData.csv” file holds 902,297 rows and 37 columns. Hence, data.table::fread() will be used to read in the data, which is faster than read.csv() and read_csv().
The inspection will be held through the prism of the relevant columns for this project:
Important Note:
relevant_column_names <- c("BGN_DATE","EVTYPE", "FATALITIES", "INJURIES",
"PROPDMG", "PROPDMGEXP", "CROPDMG", "CROPDMGEXP")Transposed summary of the data.table:
## Rows: 902,297
## Columns: 8
## $ BGN_DATE <chr> "4/18/1950 0:00:00", "4/18/1950 0:00:00", "2/20/1951 0:00:0~
## $ EVTYPE <chr> "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TORNADO", "TOR~
## $ FATALITIES <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 4, 0, 0, 0, 0,~
## $ INJURIES <dbl> 15, 0, 2, 2, 2, 6, 1, 0, 14, 0, 3, 3, 26, 12, 6, 50, 2, 0, ~
## $ PROPDMG <dbl> 25.0, 2.5, 25.0, 2.5, 2.5, 2.5, 2.5, 2.5, 25.0, 25.0, 2.5, ~
## $ PROPDMGEXP <chr> "K", "K", "K", "K", "K", "K", "K", "K", "K", "K", "M", "M",~
## $ CROPDMG <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ CROPDMGEXP <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "",~
NAs:
## # A tibble: 8 x 3
## variable n_miss pct_miss
## <chr> <int> <num>
## 1 BGN_DATE 0 0
## 2 EVTYPE 0 0
## 3 FATALITIES 0 0
## 4 INJURIES 0 0
## 5 PROPDMG 0 0
## 6 PROPDMGEXP 0 0
## 7 CROPDMG 0 0
## 8 CROPDMGEXP 0 0
The 8 relevant columns do not contain any NAs.
Check for the existence of duplicate records:
## Empty data.table (0 rows and 38 cols): STATE__,BGN_DATE,BGN_TIME,TIME_ZONE,COUNTY,COUNTYNAME...
There are no duplicates in “dt”.
Data summary by relevant variable:
| Name | dt[, ..relevant_column_na… |
| Number of rows | 902297 |
| Number of columns | 8 |
| Key | NULL |
| _______________________ | |
| Column type frequency: | |
| character | 4 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: character
| skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
|---|---|---|---|---|---|---|---|
| BGN_DATE | 0 | 1 | 16 | 18 | 0 | 16335 | 0 |
| EVTYPE | 0 | 1 | 1 | 30 | 0 | 985 | 0 |
| PROPDMGEXP | 0 | 1 | 0 | 1 | 465934 | 19 | 0 |
| CROPDMGEXP | 0 | 1 | 0 | 1 | 618413 | 9 | 0 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 |
|---|---|---|---|---|---|---|---|---|---|
| FATALITIES | 0 | 1 | 0.02 | 0.77 | 0 | 0 | 0 | 0.0 | 583 |
| INJURIES | 0 | 1 | 0.16 | 5.43 | 0 | 0 | 0 | 0.0 | 1700 |
| PROPDMG | 0 | 1 | 12.06 | 59.48 | 0 | 0 | 0 | 0.5 | 5000 |
| CROPDMG | 0 | 1 | 1.53 | 22.17 | 0 | 0 | 0 | 0.0 | 990 |
Observations from the summary statistics:
Unique values of the “PROPDMGEXP” column:
## [1] "K" "M" "" "B" "m" "+" "0" "5" "6" "?" "4" "2" "3" "h" "7" "H" "-" "1" "8"
Unique values of the “CROPDMGEXP” column:
## [1] "" "M" "K" "m" "B" "?" "0" "k" "2"
Steps in the Preprocessing Pipeline:
Why filter out records before 01/01/1996?
setnames(dt, relevant_column_names,
c("begin_date","event_type", "fatalities", "injuries", "property_damage",
"property_damage_exponent", "crop_damage", "crop_damage_exponent")
)
invisible(dt[
, `:=`(
begin_date = as_date(mdy_hms(begin_date)),
event_type = str_to_title(str_trim(event_type, side = "both"))
)
])results <- dt[
begin_date >= mdy("01/01/1996")
][
, event_type := str_extract(event_type, regex(pattern, ignore_case = T))
][
, event_type := recode_event_type(event_type)
][
!(event_type %chin% filter_out_vals)
][
, `:=`(
total_property_damage =
property_damage *
exponent_to_multiplier(property_damage_exponent),
total_crop_damage =
crop_damage *
exponent_to_multiplier(crop_damage_exponent)
)
][
, .(
count = .N,
fatalities_per_type = sum(fatalities),
injuries_per_type = sum(injuries),
total_property_damage_per_type = sum(total_property_damage),
total_crop_damage_per_type = sum(total_crop_damage)
), by = event_type
][
, `:=`(
health_consequences_per_type = fatalities_per_type + injuries_per_type,
damages_per_type =
total_property_damage_per_type + total_crop_damage_per_type
)
][
order(-health_consequences_per_type, -damages_per_type)
]Check whether the transformation of the “EVTYPE/”event_type” column was successful. There should only exist 20 unique values (levels later) of that column, all stemming from the application of the “recode_event_type” recoding function in the previous preprocessing pipeline.
## [1] "Tornado" "Flood" "Heat"
## [4] "Thunderstorm Wind" "Lightning" "Winter Storm"
## [7] "Wind" "Hurricane/Typhoon" "Marine/Coastal"
## [10] "Fire" "Fog" "Hail"
## [13] "Cold" "Dust Storm" "Avalanche"
## [16] "Heavy Rain" "Landslide" "Drought"
## [19] "Volcanic Ash" "Smoke"
## Rows: 20
## Columns: 8
## $ event_type <chr> "Tornado", "Flood", "Heat", "Thundersto~
## $ count <int> 32618, 79727, 2410, 217013, 13204, 3967~
## $ fatalities_per_type <dbl> 1513, 1348, 2035, 389, 651, 530, 401, 1~
## $ injuries_per_type <dbl> 20670, 8525, 7685, 5156, 4141, 3591, 15~
## $ total_property_damage_per_type <dbl> 24622817010, 164399831670, 9243700, 791~
## $ total_crop_damage_per_type <dbl> 283425010, 6383043200, 492578500, 10169~
## $ health_consequences_per_type <dbl> 22183, 9873, 9720, 5545, 4792, 4121, 19~
## $ damages_per_type <dbl> 24906242020, 170782874870, 501822200, 8~
## [1] "No Missing Values"
if_else(nrow(suppressMessages(janitor::get_dupes(results))) == 0, "No Duplicates", "Duplicates Present") ## [1] "No Duplicates"
Convert the “results” data.table to tibble to facilitate further exploration, since “results” now only contains 20 rows and 8 columns:
Create the tibble that will hold the data for all health consequences:
health_consequences <- results |>
select(event_type, count, fatalities_per_type, injuries_per_type,
health_consequences_per_type) |>
arrange(desc(health_consequences_per_type)) |>
mutate(event_type = as_factor(event_type))
health_consequences## # A tibble: 20 x 5
## event_type count fatalities_per_type injuries_per_type
## <fct> <int> <dbl> <dbl>
## 1 Tornado 32618 1513 20670
## 2 Flood 79727 1348 8525
## 3 Heat 2410 2035 7685
## 4 Thunderstorm Wind 217013 389 5156
## 5 Lightning 13204 651 4141
## 6 Winter Storm 39679 530 3591
## 7 Wind 24547 401 1506
## 8 Hurricane/Typhoon 1015 182 1666
## 9 Marine/Coastal 8432 740 906
## 10 Fire 4176 87 1458
## 11 Fog 1774 69 855
## 12 Hail 208210 7 723
## 13 Cold 3727 380 130
## 14 Dust Storm 559 13 415
## 15 Avalanche 378 223 156
## 16 Heavy Rain 11859 102 249
## 17 Landslide 620 43 55
## 18 Drought 2634 3 29
## 19 Volcanic Ash 30 0 0
## 20 Smoke 21 0 0
## # i 1 more variable: health_consequences_per_type <dbl>
health_consequences |>
select(-count) |>
slice_head(n = 10) |>
pivot_longer(cols = c(fatalities_per_type, injuries_per_type),
names_to = "component",
values_to = "value") |>
mutate(component = factor(component,
levels = c("fatalities_per_type", "injuries_per_type"))) |>
# flipped in a horizontal bar chart
ggplot(aes(x = value, y = fct_rev(event_type), fill = component)) +
geom_bar(stat = "identity", position = "stack", width = 0.8) +
geom_text(
data = health_consequences |> slice_head(n = 10), # Original wide data (1 row per bar)
aes(
x = health_consequences_per_type,
y = fct_rev(event_type),
label = health_consequences_per_type,
),
inherit.aes = F, # detach from parent aes() so fill & component don't interfere
size = 3.5, nudge_x = 650
) +
scale_x_continuous(expand = expansion(mult = c(0.01, 0.1)),
breaks = seq(from = 0, to = 25000, by = 2500)) +
scale_fill_manual(name = "Harm Inflicted:",
values = c("fatalities_per_type" = "#0072B2",
"injuries_per_type" = "#D55E00"), # Blue & Vermillion
labels = c("fatalities_per_type" = "Fatalities",
"injuries_per_type" = "Injuries")
) +
labs(x = "Count of Harmed Individuals (Fatalities & Injuries)", y = "Weather Event Type") +
theme_bw(base_size = 12) +
theme(
axis.text = element_text(size = 12),
legend.position = c(0.865,0.17),
legend.background = element_blank(),
legend.key = element_blank(),
plot.margin = margin(6, 16, 6, 8),
panel.grid.major.y = element_blank()
)Weather event types that are most harmful to population health across the U.S. from 1996-01-01 to 2011-11-30.
Health Consequences Insights:
Create the tibble that will hold the data for all economic consequences:
economic_consequences <- results |>
select(event_type, count, total_property_damage_per_type,
total_crop_damage_per_type, damages_per_type) |>
mutate(
total_property_damage_per_type = total_property_damage_per_type/1e9,
total_crop_damage_per_type = total_crop_damage_per_type/1e9,
damages_per_type = damages_per_type/1e9
) |>
arrange(desc(damages_per_type)) |>
mutate(event_type = as_factor(event_type))
economic_consequences## # A tibble: 20 x 5
## event_type count total_property_damage_per_t~1 total_crop_damage_pe~2
## <fct> <int> <dbl> <dbl>
## 1 Flood 79727 164. 6.38
## 2 Hurricane/Typhoon 1015 89.4 6.03
## 3 Tornado 32618 24.6 0.283
## 4 Hail 208210 14.6 2.50
## 5 Drought 2634 1.05 13.4
## 6 Thunderstorm Wind 217013 7.91 1.02
## 7 Fire 4176 7.76 0.402
## 8 Winter Storm 39679 6.41 0.121
## 9 Wind 24547 5.43 0.716
## 10 Cold 3727 0.0417 2.64
## 11 Heavy Rain 11859 0.600 0.730
## 12 Lightning 13204 0.743 0.00690
## 13 Heat 2410 0.00924 0.493
## 14 Landslide 620 0.327 0.0200
## 15 Marine/Coastal 8432 0.264 0.0402
## 16 Fog 1774 0.0226 0
## 17 Dust Storm 559 0.00616 0.0031
## 18 Avalanche 378 0.00371 0
## 19 Volcanic Ash 30 0.0005 0
## 20 Smoke 21 0.0001 0
## # i abbreviated names: 1: total_property_damage_per_type,
## # 2: total_crop_damage_per_type
## # i 1 more variable: damages_per_type <dbl>
wide_economic_consequences <- economic_consequences |> slice_head(n = 10)
long_economic_consequences <- economic_consequences |>
select(-count) |>
slice_head(n = 10) |>
pivot_longer(cols = c(total_property_damage_per_type, total_crop_damage_per_type),
names_to = "component",
values_to = "value") |>
mutate(component = factor(component,
levels = c("total_crop_damage_per_type",
"total_property_damage_per_type")))
long_economic_consequences |>
ggplot(aes(x = value, y = fct_rev(event_type), fill = component)) +
geom_bar(stat = "identity", position = "stack", width = 0.8) +
geom_text(
data = wide_economic_consequences,
aes(
x = damages_per_type,
y = fct_rev(event_type),
label = scales::dollar(damages_per_type,
accuracy = 0.1, suffix = "B")
),
inherit.aes = F, size = 3.5, nudge_x = 6
) +
scale_x_continuous(expand = expansion(mult = c(0.01, 0.1)),
breaks = seq(from = 0, to = 180, by = 20),
labels = scales::label_dollar(suffix = "B")) +
scale_fill_manual(name = "Economic Consequence:",
values = c("total_crop_damage_per_type" = "#0072B2",
"total_property_damage_per_type" = "#D55E00"),
labels = c("total_crop_damage_per_type" = "Crop Damage",
"total_property_damage_per_type" = "Property Damage")
) +
labs(x = "U.S. Dollars (Billions)", y = "Weather Event Type") +
theme_bw(base_size = 12) +
theme(
axis.text = element_text(size = 12),
legend.position = c(0.832,0.17),
legend.background = element_blank(),
legend.key = element_blank(),
plot.margin = margin(6, 15, 6, 8),
panel.grid.major.y = element_blank()
)Weather event types with the greatest economic consequences across the U.S. from 1996-01-01 to 2011-11-30.
Economic Consequences Insights:
## R version 4.5.3 (2026-03-11 ucrt)
## Platform: x86_64-w64-mingw32/x64
## Running under: Windows 10 x64 (build 17763)
##
## Matrix products: default
## LAPACK version 3.12.1
##
## locale:
## [1] LC_COLLATE=English_United States.1252
## [2] LC_CTYPE=English_United States.1252
## [3] LC_MONETARY=English_United States.1252
## [4] LC_NUMERIC=C
## [5] LC_TIME=English_United States.1252
##
## time zone: Europe/Athens
## tzcode source: internal
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] data.table_1.18.2.1 lubridate_1.9.5 forcats_1.0.1
## [4] stringr_1.6.0 dplyr_1.2.0 purrr_1.2.1
## [7] readr_2.1.6 tidyr_1.3.2 tibble_3.3.1
## [10] ggplot2_4.0.2 tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.3 visdat_0.6.0
## [5] tidyselect_1.2.1 snakecase_0.11.1 jquerylib_0.1.4 scales_1.4.0
## [9] yaml_2.3.12 fastmap_1.2.0 R6_2.6.1 generics_0.1.4
## [13] knitr_1.51 janitor_2.2.1 bslib_0.10.0 pillar_1.11.1
## [17] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.7 utf8_1.2.6
## [21] stringi_1.8.7 cachem_1.1.0 xfun_0.56 sass_0.4.10
## [25] S7_0.2.1 otel_0.2.0 timechange_0.4.0 cli_3.6.5
## [29] withr_3.0.2 magrittr_2.0.4 digest_0.6.39 grid_4.5.3
## [33] rstudioapi_0.18.0 hms_1.1.4 lifecycle_1.0.5 vctrs_0.7.1
## [37] evaluate_1.0.5 glue_1.8.0 farver_2.1.2 naniar_1.1.0
## [41] rmarkdown_2.31 tools_4.5.3 pkgconfig_2.0.3 htmltools_0.5.9