library(tidyverse)
library(rvest)
library(janitor)
library(knitr)
National Storms
National Storms
The NOAA provides data on tropical storms and hurricanes on their website, however the data is not very clean.
In this article we will focus on scraping the data from the web and cleaning it into usable fashion. We will then compare the storms’ quantitative impact by category.
Data recommended by Kim Koon
Web scraping
First we will scrape the data from the web page. It is already formatted as an HTML table so scraping should be simple.
As an intersting note, this table was actually updated to include Hurricane Helene in between writing this notebook and publishing.
# Read html
<- "https://vlab.noaa.gov/web/mdl/historic-storms"
url <- read_html(url) webpage
# Specify the node we're interested in
<- html_node(webpage, xpath = "//table[1]") div_node
# Convert it to a table
<- div_node |>
table html_table()
# Our starting point
head(table)
# A tibble: 6 × 7
`2024-Storm` Date `Storm-Tide` Obs Guidance `Cat, Pres, Dead, $bn` Area
<chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 "2024-Helene" Sep … "Obs w4: >9… "(FE… "Warn-A… (Cat4, 938, 221+, $27… "N.W…
2 "2024-Francine" Sep … "w2: 5 mhhw" "" "Warn-A… (Cat2, 972, 0, $1.5) "LA,…
3 "2024-Gilma-cp" Aug … "w1: 3 mhhw" "" "PS-Adv… (Cat4, 949, -, -) "HI"
4 "2024-Hone-cp" Aug … "w1: 3 mhhw" "" "PS-Adv… (Cat1, 988, -, -) "HI"
5 "2024-Post-\nT… Aug … "w3: 6.3 mh… "" "P-ETSS" (Cat4, 947, -, -) "AK"
6 "2024-Debby" Aug … "w2: 5.1 mh… "(FE… "Warn-A… (Cat1, 979, 10, >$1) "N.W…
write_csv(table, "national_storms_raw_data.csv")
Data cleaning
Now we can focus on cleaning the data. As we can see, there are many combined columns in this dataset, with mixed delimiters, white spaces, and structure.
We will try to leave the values itself as intact as possible and perform minimal transformation; as we do not understand the meaning behind each variable well, we should leave that to the meteorologists.
Clean column names
Janitor is a nice package to quickly get data into workable shape. Here we’ll just clean the column names so they’re a bit more workable.
<- read.csv("national_storms_raw_data.csv")
df
<- janitor::clean_names(table) df
Separating the columns
Now we will focus on separating the data into tidy and respective columns, one column at a time.
# fix storm column
<- df |>
df mutate(x2024_storm = str_replace_all(x2024_storm, "\\s+", "")) |>
separate_wider_regex(x2024_storm,
patterns = c(
year = "\\d{4}",
"-",
storm = ".*"
),too_few = "align_start"
)
# fix date column
<- df |>
df separate_wider_regex(date,
patterns = c(
month = "\\w{3}",
"\\s",
day = "\\d{1,2}"
),too_few = "align_start"
)
# Fix surge column
# Extract class
<- df |>
class mutate(class = str_extract(storm_tide, "w\\d")) |>
select(class)
# Extract mhhw
<- df |>
mhhw mutate(mhhw = str_extract(storm_tide, "(\\d*\\.?\\d+) mhhw")) |>
mutate(mhhw = str_extract(mhhw, "\\d*\\.?\\d+")) |>
select(mhhw)
# Reassble the dataframe
<- df |>
df cbind(class, mhhw) |>
select(-storm_tide)
# fix cat_pres_dead_bn column
<- df |> separate_wider_delim(
df delim = ",",
cat_pres_dead_bn, names = c("category", "pres", "dead", "bn")
|>
) mutate(
across(
:bn,
category~ str_replace_all(.x, "[-)(>$+]", "")
) )
#
<- df |> separate_wider_delim(
df
area,",",
names_sep = "",
too_few = "align_start"
)
# Show our progress
head(df)
# A tibble: 6 × 19
year storm month day obs guidance category pres dead bn area1 area2
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2024 Helene Sep 23 "(FE… "Warn-A… Cat4 " 93… " 22… " 27… N.W.… " GA"
2 2024 Franc… Sep 12 "" "Warn-A… Cat2 " 97… " 0" " 1.… LA " MS"
3 2024 Gilma… Aug 30 "" "PS-Adv… Cat4 " 94… " " " " HI <NA>
4 2024 Hone-… Aug 25 "" "PS-Adv… Cat1 " 98… " " " " HI <NA>
5 2024 Post-… Aug 22 "" "P-ETSS" Cat4 " 94… " " " " AK <NA>
6 2024 Debby Aug 5 "(FE… "Warn-A… Cat1 " 97… " 10" " 1" N.W.… " GA"
# ℹ 7 more variables: area3 <chr>, area4 <chr>, area5 <chr>, area6 <chr>,
# area7 <chr>, class <chr>, mhhw <chr>
Long format and nulls
This data is actually relatively well suited to a wide format given the amount of dimensions, but we can improve the areas
column by converting that into long format.
We’ll also make sure we’ve replaced any empty strings will null values.
<- df |> pivot_longer(
df cols = starts_with("area"),
names_to = NULL,
values_to = "area",
values_drop_na = TRUE)
# Fill NAs
<- df |>
df mutate(across(c(year, day, pres, dead, bn, mhhw), ~as.numeric(.x))) |>
mutate(across(where(is.character), ~ na_if(.,"")))
head(df)
# A tibble: 6 × 13
year storm month day obs guidance category pres dead bn class mhhw
<dbl> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <chr> <dbl>
1 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
2 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
3 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
4 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
5 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
6 2024 Helene Sep 23 (FEV) Warn-A6… Cat4 938 221 27.5 w4 9.3
# ℹ 1 more variable: area <chr>
Analysis
Now that our data is tidy, we can perform some analysis. We’d like to know the quantitative impact of a storm by its category, so we will group by category and then visualize the impact for pres
, dead
, bn
, and mhhw
.
<- c('Cat5', 'Cat4', 'Cat3', 'Cat2', 'Cat1', 'TS')
category_levels
<- df |>
gb pivot_longer(cols = c(pres, dead, bn, mhhw),
names_to = "metric",
values_to = "value") |>
mutate(category = factor(category, levels = category_levels))
= c(
storm_palette pres = "#2f3e46",
dead = "#354f52",
bn = "#52796f",
mhhw = "#84a98c"
)
ggplot(gb, aes(x = category, y = value, fill = metric)) +
geom_bar(stat = 'identity', position = 'dodge') +
labs(
title = "Impact of Tropical Storms by Category",
x = "Category",
y = "Metric") +
theme_minimal() +
theme(legend.position = "none") +
scale_fill_manual(values = storm_palette) +
facet_wrap(~ metric, scales = "free")
We had to do some research to understand these metrics but we understand them to be as below;
bn
indicates the dollar value of destruction, in billionsdead
indicates the number of individuals killed by the stormmhhw
represents the Mean Higher High Water, or the average level of floodingpres
represents the mean pressure of the storm
We can see from our plots that Category 5 storms are the most lethal my a considerable margin, as well as the mostly costly. Flooding is relatively similar as measured by mhhw
for category 5 and 4, while tropical storms have less impact.
Interestingly, tropical storms seem to have the highest pressure. Upon further research, I discovered that higher pressures indicate the storm is less “concentrated” and potentially dissipating. This is consistent with our results as tropical storms are considered the least severe type of storm.
Conclusion
In this article, we scraped tropical storms data from a NOAA webpage. We then cleaned and transformed the data into usable shape. We then visualized the storms by category to better understand the relative impacts.