National Storms

Author

Kevin Havis

Published

September 28, 2024

National Storms

The NOAA provides data on tropical storms and hurricanes on their website, however the data is not very clean.

In this article we will focus on scraping the data from the web and cleaning it into usable fashion. We will then compare the storms’ quantitative impact by category.

Data recommended by Kim Koon

library(tidyverse)
library(rvest)
library(janitor)
library(knitr)

Web scraping

First we will scrape the data from the web page. It is already formatted as an HTML table so scraping should be simple.

As an intersting note, this table was actually updated to include Hurricane Helene in between writing this notebook and publishing.

# Read html

url <- "https://vlab.noaa.gov/web/mdl/historic-storms"
webpage <- read_html(url)
# Specify the node we're interested in
div_node <- html_node(webpage, xpath = "//table[1]")
# Convert it to a table
table <- div_node |> 
  html_table()
# Our starting point
head(table)
# A tibble: 6 × 7
  `2024-Storm`    Date  `Storm-Tide` Obs   Guidance `Cat, Pres, Dead, $bn` Area 
  <chr>           <chr> <chr>        <chr> <chr>    <chr>                  <chr>
1 "2024-Helene"   Sep … "Obs w4: >9… "(FE… "Warn-A… (Cat4, 938, 221+, $27… "N.W…
2 "2024-Francine" Sep … "w2: 5 mhhw" ""    "Warn-A… (Cat2, 972, 0, $1.5)   "LA,…
3 "2024-Gilma-cp" Aug … "w1: 3 mhhw" ""    "PS-Adv… (Cat4, 949, -, -)      "HI" 
4 "2024-Hone-cp"  Aug … "w1: 3 mhhw" ""    "PS-Adv… (Cat1, 988, -, -)      "HI" 
5 "2024-Post-\nT… Aug … "w3: 6.3 mh… ""    "P-ETSS" (Cat4, 947, -, -)      "AK" 
6 "2024-Debby"    Aug … "w2: 5.1 mh… "(FE… "Warn-A… (Cat1, 979, 10, >$1)   "N.W…
write_csv(table, "national_storms_raw_data.csv")

Data cleaning

Now we can focus on cleaning the data. As we can see, there are many combined columns in this dataset, with mixed delimiters, white spaces, and structure.

We will try to leave the values itself as intact as possible and perform minimal transformation; as we do not understand the meaning behind each variable well, we should leave that to the meteorologists.

Clean column names

Janitor is a nice package to quickly get data into workable shape. Here we’ll just clean the column names so they’re a bit more workable.

df <- read.csv("national_storms_raw_data.csv")

df <- janitor::clean_names(table)

Separating the columns

Now we will focus on separating the data into tidy and respective columns, one column at a time.

# fix storm column

df <- df |> 
  mutate(x2024_storm = str_replace_all(x2024_storm, "\\s+", "")) |> 
  separate_wider_regex(x2024_storm,
                        patterns = c(
                          year = "\\d{4}",
                          "-",
                          storm = ".*"
                          ),
                          too_few = "align_start"
                       )
# fix date column

df <- df |> 
  separate_wider_regex(date, 
                       patterns = c(
                         month = "\\w{3}",
                         "\\s",
                         day = "\\d{1,2}"
                       ),
                       too_few = "align_start"
                       )
# Fix surge column

# Extract class
class <- df |> 
  mutate(class = str_extract(storm_tide, "w\\d")) |> 
  select(class)

# Extract mhhw
mhhw <- df |> 
  mutate(mhhw = str_extract(storm_tide, "(\\d*\\.?\\d+) mhhw")) |> 
  mutate(mhhw = str_extract(mhhw, "\\d*\\.?\\d+")) |> 
  select(mhhw)

# Reassble the dataframe
df <- df |> 
  cbind(class, mhhw) |> 
  select(-storm_tide)
# fix cat_pres_dead_bn column

df <- df |> separate_wider_delim(
    cat_pres_dead_bn, delim = ",",
    names = c("category", "pres", "dead", "bn")
    ) |> 
  mutate(
    across(
      category:bn,
      ~ str_replace_all(.x, "[-)(>$+]", "")
      )
    )
# 
df <- df |> separate_wider_delim(
  area,
  ",",
  names_sep = "",
  too_few = "align_start"
  )
# Show our progress
head(df)
# A tibble: 6 × 19
  year  storm  month day   obs   guidance category pres  dead  bn    area1 area2
  <chr> <chr>  <chr> <chr> <chr> <chr>    <chr>    <chr> <chr> <chr> <chr> <chr>
1 2024  Helene Sep   23    "(FE… "Warn-A… Cat4     " 93… " 22… " 27… N.W.… " GA"
2 2024  Franc… Sep   12    ""    "Warn-A… Cat2     " 97… " 0"  " 1.… LA    " MS"
3 2024  Gilma… Aug   30    ""    "PS-Adv… Cat4     " 94… " "   " "   HI     <NA>
4 2024  Hone-… Aug   25    ""    "PS-Adv… Cat1     " 98… " "   " "   HI     <NA>
5 2024  Post-… Aug   22    ""    "P-ETSS" Cat4     " 94… " "   " "   AK     <NA>
6 2024  Debby  Aug   5     "(FE… "Warn-A… Cat1     " 97… " 10" " 1"  N.W.… " GA"
# ℹ 7 more variables: area3 <chr>, area4 <chr>, area5 <chr>, area6 <chr>,
#   area7 <chr>, class <chr>, mhhw <chr>

Long format and nulls

This data is actually relatively well suited to a wide format given the amount of dimensions, but we can improve the areas column by converting that into long format.

We’ll also make sure we’ve replaced any empty strings will null values.

df <- df |> pivot_longer(
  cols = starts_with("area"),
  names_to = NULL,
  values_to = "area",
  values_drop_na = TRUE)
# Fill NAs

df <- df |> 
  mutate(across(c(year, day, pres, dead, bn, mhhw), ~as.numeric(.x))) |> 
  mutate(across(where(is.character), ~ na_if(.,"")))
head(df)
# A tibble: 6 × 13
   year storm  month   day obs   guidance category  pres  dead    bn class  mhhw
  <dbl> <chr>  <chr> <dbl> <chr> <chr>    <chr>    <dbl> <dbl> <dbl> <chr> <dbl>
1  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
2  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
3  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
4  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
5  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
6  2024 Helene Sep      23 (FEV) Warn-A6… Cat4       938   221  27.5 w4      9.3
# ℹ 1 more variable: area <chr>

Analysis

Now that our data is tidy, we can perform some analysis. We’d like to know the quantitative impact of a storm by its category, so we will group by category and then visualize the impact for pres, dead, bn, and mhhw.

category_levels <- c('Cat5', 'Cat4', 'Cat3', 'Cat2', 'Cat1', 'TS')

gb <- df |> 
  pivot_longer(cols = c(pres, dead, bn, mhhw),
               names_to = "metric",
               values_to = "value") |> 
  mutate(category = factor(category, levels = category_levels))

storm_palette = c(
  pres = "#2f3e46",
  dead = "#354f52",
  bn = "#52796f",
  mhhw = "#84a98c"
)

ggplot(gb, aes(x = category, y = value, fill = metric)) +
  geom_bar(stat = 'identity', position = 'dodge') +
  labs(
    title = "Impact of Tropical Storms by Category",
    x = "Category",
    y = "Metric") +
  theme_minimal() +
  theme(legend.position = "none") +
  scale_fill_manual(values = storm_palette) +
  facet_wrap(~ metric, scales = "free")

We had to do some research to understand these metrics but we understand them to be as below;

  • bn indicates the dollar value of destruction, in billions
  • dead indicates the number of individuals killed by the storm
  • mhhw represents the Mean Higher High Water, or the average level of flooding
  • pres represents the mean pressure of the storm

We can see from our plots that Category 5 storms are the most lethal my a considerable margin, as well as the mostly costly. Flooding is relatively similar as measured by mhhw for category 5 and 4, while tropical storms have less impact.

Interestingly, tropical storms seem to have the highest pressure. Upon further research, I discovered that higher pressures indicate the storm is less “concentrated” and potentially dissipating. This is consistent with our results as tropical storms are considered the least severe type of storm.

Conclusion

In this article, we scraped tropical storms data from a NOAA webpage. We then cleaned and transformed the data into usable shape. We then visualized the storms by category to better understand the relative impacts.