knitr::opts_chunk$set(echo = TRUE)
library(lubridate, warn.conflicts = FALSE)
library(vroom)
library(ggplot2)
library(stringr)
library(dplyr, warn.conflicts = FALSE)
In this report i aim to show the most harmful and most damaging
weather events across the USA. To investigate this i used the Storm data
of the National Weather Service. The events in this data range from 1950
to 2011.
From this data i found that the most harmful and damaging weather event
across the USA are Tornados.
It is recommended to look at the data for each state in a follow up
report, because not every State experiences the same severity of weather
events.
The analysis was conducted with the following environment:
sessionInfo()
## R version 4.5.0 (2025-04-11)
## Platform: x86_64-redhat-linux-gnu
## Running under: Fedora Linux 42 (Workstation Edition)
##
## Matrix products: default
## BLAS/LAPACK: FlexiBLAS OPENBLAS-OPENMP; LAPACK version 3.12.0
##
## locale:
## [1] LC_CTYPE=de_DE.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_DE.UTF-8 LC_COLLATE=de_DE.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_DE.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## time zone: Europe/Berlin
## tzcode source: system (glibc)
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] dplyr_1.1.4 stringr_1.5.1 ggplot2_3.5.2 vroom_1.6.5
## [5] lubridate_1.9.4
##
## loaded via a namespace (and not attached):
## [1] bit_4.6.0 gtable_0.3.6 jsonlite_2.0.0 compiler_4.5.0
## [5] crayon_1.5.3 tidyselect_1.2.1 jquerylib_0.1.4 scales_1.4.0
## [9] yaml_2.3.10 fastmap_1.2.0 R6_2.6.1 generics_0.1.4
## [13] knitr_1.50 tibble_3.3.0 bslib_0.9.0 pillar_1.10.2
## [17] RColorBrewer_1.1-3 tzdb_0.5.0 rlang_1.1.6 stringi_1.8.7
## [21] cachem_1.1.0 xfun_0.52 sass_0.4.10 bit64_4.6.0-1
## [25] timechange_0.3.0 cli_3.6.5 withr_3.0.2 magrittr_2.0.3
## [29] digest_0.6.37 grid_4.5.0 rstudioapi_0.17.1 lifecycle_1.0.4
## [33] vctrs_0.6.5 evaluate_1.0.4 glue_1.8.0 farver_2.1.2
## [37] rmarkdown_2.29 tools_4.5.0 pkgconfig_2.0.3 htmltools_0.5.8.1
The first thing i did was to open the CSV-file in a text editor and check the headers, delim and data. With this information in mind i loaded the file while also setting the data-types for some of the columns and properly formatting the headers.
file <- vroom("repdata_data_StormData.csv",
delim = ",",
na = c("", "?", "NA"),
col_types = cols(
State = col_integer(),
BgnDate = col_datetime(format = "%m/%d/%Y %H:%M:%S"),
BgnTime = col_time(format = "%H%M")
),
.name_repair = ~ janitor::make_clean_names(., case = "upper_camel"))
After reading the file data there are a total of 902297 rows and 37 columns in the dataset.
dim(file)
## [1] 902297 37
The data we are interested in are data concerning harm and damages per event type. For harm we have the columns Fatalities and Injuries and for damages there are the columns Propdmg and Cropdmg. The damage values are not the raw dollar values, the actual value depends on the multiplier provided in columns Propdmgexp and Cropdmgexp.
harmDamageData <- file %>%
as_tibble %>%
select(Evtype, Fatalities, Injuries, Propdmg, Cropdmg, Propdmgexp, Cropdmgexp) %>%
group_by(Evtype)
There are no missing values in the damage numbers:
sum(is.na(harmDamageData$Propdmg))
## [1] 0
sum(is.na(harmDamageData$Cropdmg))
## [1] 0
We have to handle the following multipliers:
unique(union(harmDamageData$Propdmgexp, harmDamageData$Cropdmgexp))
## [1] "K" "M" NA "B" "m" "+" "0" "5" "6" "4" "2" "3" "h" "7" "H" "-" "1" "8" "k"
I construct the multiplier table and compute the damage values.
mulitipliers <- c(
H = 100,
K = 1000,
M = 1000000,
B = 1000000000,
"0" = 1,
"1" = 10,
"2" = 100,
"3" = 1000,
"4" = 10000,
"5" = 100000,
"6" = 1000000,
"7" = 10000000,
"8" = 100000000
)
harmDamageData <- harmDamageData %>%
mutate(Propdmgexp = toupper(Propdmgexp), Cropdmgexp = toupper(Cropdmgexp))%>%
mutate(PropdmgMul = if_else(is.na(Propdmgexp) | Propdmgexp == "-" | Propdmgexp == "+", 1, mulitipliers[Propdmgexp]),
CropdmgMul = if_else(is.na(Cropdmgexp) | Cropdmgexp == "-" | Cropdmgexp == "+", 1, mulitipliers[Cropdmgexp])) %>%
mutate(Propdmg = Propdmg * PropdmgMul, Cropdmg = Cropdmg * CropdmgMul)
Because the USD-values in Propdmg and Cropdmg are quite large i scale them back to million.
harmDamageData$Propdmg <- harmDamageData$Propdmg / 1000000
harmDamageData$Cropdmg <- harmDamageData$Cropdmg / 1000000
In total there are 977 unique event types:
length(unique(harmDamageData$Evtype))
## [1] 977
The event types contain duplicate and/or multiple entries. To clean the event types i manually group similar events and and filter out the summaries.
harmDamageData$Evtype <- str_to_lower(harmDamageData$Evtype)
harmDamageData <- harmDamageData %>% filter(!str_detect(Evtype, "summary"))
harmDamageData[str_detect(harmDamageData$Evtype, "blizzard"),]$Evtype = "blizzard"
harmDamageData[str_detect(harmDamageData$Evtype, "high winds"),]$Evtype = "high winds"
harmDamageData[str_detect(harmDamageData$Evtype, "freeze"),]$Evtype = "freeze"
harmDamageData[str_detect(harmDamageData$Evtype, "dust devil"),]$Evtype = "dust devil"
harmDamageData[!str_detect(harmDamageData$Evtype, "dust devil") & str_detect(harmDamageData$Evtype, "dust"),]$Evtype = "dust storm"
harmDamageData[str_detect(harmDamageData$Evtype, "frost|freeze"),]$Evtype = "frost/freeze"
harmDamageData[str_detect(harmDamageData$Evtype, "flash flood"),]$Evtype = "flash flood"
harmDamageData[str_detect(harmDamageData$Evtype, "thunderstorm") & !str_detect(harmDamageData$Evtype, "marine thunderstorm"),]$Evtype = "thunderstorm"
harmDamageData[str_detect(harmDamageData$Evtype, "volcanic ash"),]$Evtype = "volcanic ash"
harmDamageData[str_detect(harmDamageData$Evtype, "fire"),]$Evtype = "wildfire"
harmDamageData[str_detect(harmDamageData$Evtype, "winter storm"),]$Evtype = "winterstorm"
harmDamageData[str_detect(harmDamageData$Evtype, "winter") & ! str_detect(harmDamageData$Evtype, "winterstorm"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "record") & str_detect(harmDamageData$Evtype, "cold|low|cool"),]$Evtype = "frost/freeze"
harmDamageData[str_detect(harmDamageData$Evtype, "record") & str_detect(harmDamageData$Evtype, "high|warm|heat|temperature"),]$Evtype = "heat"
harmDamageData[str_detect(harmDamageData$Evtype, "microburst|macroburst|mircoburst|downburst|gustnado|thundestorm"),]$Evtype = "thunderstorm wind"
harmDamageData[str_detect(harmDamageData$Evtype, "thunderestorm|thunderstrom|thundertorm|tunderstorm|tstm|thundertsorm|
thundeerstorm|thunerstorm|thundeerstorm"),]$Evtype = "thunderstorm wind"
harmDamageData[str_detect(harmDamageData$Evtype, "drought") & str_detect(harmDamageData$Evtype, "heat"),]$Evtype = "drought"
harmDamageData[str_detect(harmDamageData$Evtype, "snow drought"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "hurricane|typhoon"),]$Evtype = "hurricane/typhoon"
harmDamageData[str_detect(harmDamageData$Evtype, "tropical.*storm"),]$Evtype = "tropical storm"
harmDamageData[str_detect(harmDamageData$Evtype, "dry"),]$Evtype = "drought"
harmDamageData[str_detect(harmDamageData$Evtype, "tornado"),]$Evtype = "tornado"
harmDamageData[str_detect(harmDamageData$Evtype, "waterspout"),]$Evtype = "waterspout"
harmDamageData[str_detect(harmDamageData$Evtype, "funnel"),]$Evtype = "funnel cloud"
harmDamageData[str_detect(harmDamageData$Evtype, "volc"),]$Evtype = "volcanic ash"
harmDamageData[str_detect(harmDamageData$Evtype, "ava"),]$Evtype = "avalanche"
harmDamageData[str_detect(harmDamageData$Evtype, "(coastal.*flood)|(flood.*coastal)"),]$Evtype = "coastal flood"
harmDamageData[str_detect(harmDamageData$Evtype, "rock|mud|slide"),]$Evtype = "debris flow"
harmDamageData[str_detect(harmDamageData$Evtype, "fog|vog") & !str_detect(harmDamageData$Evtype, "freez"),]$Evtype = "dense fog"
harmDamageData[str_detect(harmDamageData$Evtype, "smoke"),]$Evtype = "dense smoke"
harmDamageData[str_detect(harmDamageData$Evtype, "(extreme|excessive).*heat"),]$Evtype = "excessive heat"
harmDamageData[str_detect(harmDamageData$Evtype, "heat wav"),]$Evtype = "excessive heat"
harmDamageData[str_detect(harmDamageData$Evtype, "(extreme|bitter wind).*(cold|wind|chill)"),]$Evtype = "extreme cold/wind chill"
harmDamageData[str_detect(harmDamageData$Evtype, "(flash.*flood)|(flood.*flash)|(flash floo)"),]$Evtype = "flash flood"
harmDamageData[str_detect(harmDamageData$Evtype, "erosion|erosin"),]$Evtype = "coastal flood"
harmDamageData[str_detect(harmDamageData$Evtype, "flood") & !str_detect(harmDamageData$Evtype, "coastal flood"),]$Evtype = "flood"
harmDamageData[str_detect(harmDamageData$Evtype, "hail"),]$Evtype = "hail"
harmDamageData[str_detect(harmDamageData$Evtype, "(heavy.*rain)|(rain.*heavy)"),]$Evtype = "heavy rain"
harmDamageData[str_detect(harmDamageData$Evtype, "(heavy.*snow)|(snow.*heavy)"),]$Evtype = "heavy snow"
harmDamageData[str_detect(harmDamageData$Evtype, "(accum|record|exces).*snow"),]$Evtype = "heavy snow"
harmDamageData[str_detect(harmDamageData$Evtype, "snow.*(accum|record|exces)"),]$Evtype = "heavy snow"
harmDamageData[str_detect(harmDamageData$Evtype, "(high|heavy).*surf"),]$Evtype = "high surf"
harmDamageData[str_detect(harmDamageData$Evtype, "surf|(high wav)"),]$Evtype = "high surf"
harmDamageData[str_detect(harmDamageData$Evtype, "(high|extreme).*wind") & !str_detect(harmDamageData$Evtype, "chill|marine"),]$Evtype = "high wind"
harmDamageData[str_detect(harmDamageData$Evtype, "(wind.*gust)|(gust.*wind)"),]$Evtype = "high wind"
harmDamageData[str_detect(harmDamageData$Evtype, "strong.*wind") & !str_detect(harmDamageData$Evtype, "chill|marine"),]$Evtype = "high wind"
harmDamageData[str_detect(harmDamageData$Evtype, "(ice.*storm)|(storm.*ice)"),]$Evtype = "ice storm"
harmDamageData[str_detect(harmDamageData$Evtype, "(lake.*snow)"),]$Evtype = "lake-effect snow"
harmDamageData[str_detect(harmDamageData$Evtype, "lightn|lighting"),]$Evtype = "lightning"
harmDamageData[str_detect(harmDamageData$Evtype, "current"),]$Evtype = "rip current"
harmDamageData[str_detect(harmDamageData$Evtype, "surge|((blow|high).*tide)") & !str_detect(harmDamageData$Evtype,"astronomical"),]$Evtype = "storm surge/tide"
harmDamageData[str_detect(harmDamageData$Evtype, "^wind"),]$Evtype = "strong wind"
harmDamageData[str_detect(harmDamageData$Evtype, "(storm f|thuderstorm).*wind"),]$Evtype = "thunderstorm wind"
harmDamageData[str_detect(harmDamageData$Evtype, "wind storm"),]$Evtype = "thunderstorm wind"
harmDamageData[str_detect(harmDamageData$Evtype, "water.*spout"),]$Evtype = "waterspout"
harmDamageData[str_detect(harmDamageData$Evtype, "(snow|sleet).*storm"),]$Evtype = "winterstorm"
harmDamageData[str_detect(harmDamageData$Evtype, "blow"),]$Evtype = "winterstorm"
harmDamageData[str_detect(harmDamageData$Evtype, "ice") & !str_detect(harmDamageData$Evtype, "storm"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "snow") & !str_detect(harmDamageData$Evtype, "heavy|lake-effect"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "freezi"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "rain"),]$Evtype = "heavy rain"
harmDamageData[str_detect(harmDamageData$Evtype, "warm"),]$Evtype = "heat"
harmDamageData[str_detect(harmDamageData$Evtype, "wet micoburst"),]$Evtype = "thunderstorm"
harmDamageData[str_detect(harmDamageData$Evtype, "wet"),]$Evtype = "flood"
harmDamageData[str_detect(harmDamageData$Evtype, "cloud"),]$Evtype = "funnel cloud"
harmDamageData[str_detect(harmDamageData$Evtype, "cold") & !str_detect(harmDamageData$Evtype, "extreme"),]$Evtype = "cold/wind chill"
harmDamageData[str_detect(harmDamageData$Evtype, "wnd"),]$Evtype = "high wind"
harmDamageData[str_detect(harmDamageData$Evtype, "wintry mix"),]$Evtype = "winter weather"
harmDamageData[str_detect(harmDamageData$Evtype, "fld"),]$Evtype = "flash flood"
harmDamageData[str_detect(harmDamageData$Evtype, "driest"),]$Evtype = "drought"
harmDamageData[str_detect(harmDamageData$Evtype, "^thunderstorm"),]$Evtype = "thunderstorm wind"
Further manual cleaning of the event types is stopped at this point, because it will take an unreasonable amount of time.
Before summing the totals i first create the summary data for harm and damages, sorting the data by Total descending.
harmSummary <- summarize(harmDamageData, FatalitiesSum = sum(Fatalities, na.rm = T), InjuriesSum = sum(Injuries, na.rm = T)) %>%
mutate(Total = FatalitiesSum + InjuriesSum) %>%
arrange(desc(Total))
damageSummary <- summarize(harmDamageData, PropdmgSum = sum(Propdmg, na.rm = T), CropdmgSum = sum(Cropdmg, na.rm = T)) %>%
mutate(Total = (PropdmgSum + CropdmgSum)) %>%
arrange(desc(Total))
The damage total contains quite large values.
range(damageSummary$Total)
## [1] 0.0 180354.9
To show what the most harmful events are i take the top 10 of the summary and plot them in order.
gHarm <- ggplot(harmSummary[1:10,] %>%
mutate(Evtype = factor(Evtype, levels = Evtype)), aes(x = Total, y = Evtype, fill = Evtype))
gHarm + geom_col() +
labs(x = "Total harm", y = "Event type", title = "Top 10 most harmful event types", fill = "Event types")
The picture is very clear, the most harmful even type by far are Tornados.
Plotting the top 10 damaging events shows a different picture:
gDamage <- ggplot(damageSummary[1:10,] %>%
mutate(Evtype = factor(Evtype, levels = Evtype)), aes(x = Total, y = Evtype, fill = Evtype))
gDamage + geom_col() +
labs(x = "Total damage (mil USD)", y = "Event type", title = "Top 10 most damaging event types", fill = "Event types")
Again Tornados are at the top, but this time the difference to the other event types is not as large. Also the order of the event types is different, in 2nd place are flash floods instead of excessive heat.
Further investigation can explore the harm and damage on a state level. Another open question is if the focus should be on events that cause the most harm or the most damage.