1 Purpose and context

The aim of this analysis was to explore whether Atlantic hurricanes can be meaningfully grouped based on selected characteristics describing their intensity, duration, and basic spatial properties, such as the location of origin and first landfall.

Rather than attempting to classify hurricanes according to predefined categories, the analysis follows an exploratory approach. The objective was to examine whether recurring patterns emerge directly from historical data and whether these patterns remain stable across different clustering methods.

This analysis was prepared as part of the course Unsupervised Machine Learning at the University of Warsaw.

2 Libraries

library(tidyverse)
library(lubridate)
library(cluster)
library(dbscan)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(factoextra)
library(gridExtra)

3 Data

This analysis is based on the HURDAT2 Atlantic Summary and Track Datasets, a publicly available dataset hosted on the Kaggle platform.

The data originate from the official HURDAT2 (Hurricane Database 2) maintained by the National Hurricane Center (NHC) and include detailed information on Atlantic tropical cyclones, such as: - storm tracks and trajectories, - maximum sustained wind speeds, - minimum central pressure, - landfall locations, - basic life-cycle characteristics of each storm.

The dataset was published on Kaggle by Wesley Neill and is available at:
https://www.kaggle.com/datasets/wesleyneill/hurdat2-atlantic-summary-and-track-datasets

For the purpose of this project, selected variables describing storm intensity, spatial characteristics (initial and final positions), and the number of recorded observations during the storm lifetime were used. Prior to clustering, the data were cleaned and standardized to ensure comparability of features.

It should be noted that the selected variables provide a simplified representation of each hurricane. The analysis is based on aggregated characteristics at the storm level and does not account for temporal changes in intensity or trajectory during the hurricane’s lifetime.

As a result, the clustering reflects overall storm properties rather than detailed dynamic behavior, and the findings should be interpreted within this limitation.

data_dir <- params$data_dir

hur     <- read_csv(file.path(data_dir, "Atlantic_Hurricane_Tracks.csv"))
hur_sum <- read_csv(file.path(data_dir, "Atlantic_Hurricane_Summary_Data.csv"))

glimpse(hur)
## Rows: 51,840
## Columns: 23
## $ storm_id     <chr> "AL011851", "AL011851", "AL011851", "AL011851", "AL011851…
## $ name         <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "U…
## $ records      <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ date         <date> 1851-06-25, 1851-06-25, 1851-06-25, 1851-06-25, 1851-06-…
## $ time         <chr> "0000", "0600", "1200", "1800", "2100", "0000", "0600", "…
## $ record_id    <chr> NA, NA, NA, NA, "L", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ status       <chr> "HU", "HU", "HU", "HU", "HU", "HU", "TS", "TS", "TS", "TS…
## $ latitude     <dbl> 28.0, 28.0, 28.0, 28.1, 28.2, 28.2, 28.3, 28.4, 28.6, 29.…
## $ longitude    <dbl> -94.8, -95.4, -96.0, -96.5, -96.8, -97.0, -97.6, -98.3, -…
## $ max_wind     <dbl> 80, 80, 80, 80, 80, 70, 60, 60, 50, 50, 40, 40, 40, 40, 8…
## $ min_pressure <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne34ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se34ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw34ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw34ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne50ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se50ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw50ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw50ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne64ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se64ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw64ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw64ktr      <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
glimpse(hur_sum)
## Rows: 1,893
## Columns: 17
## $ storm_id          <chr> "AL011851", "AL021851", "AL031851", "AL041851", "AL0…
## $ name              <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED…
## $ records           <dbl> 14, 1, 1, 49, 16, 17, 45, 8, 20, 36, 25, 1, 1, 48, 1…
## $ landfall          <dbl> 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ hurricane         <dbl> 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0…
## $ hurr_landfall     <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ lifetime_max_wind <dbl> 80, 80, 50, 100, 50, 60, 100, 70, 70, 80, 90, 50, 40…
## $ lifetime_min_pres <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, Na…
## $ init_latitude     <dbl> 28.0, 22.2, 12.0, 13.4, 32.5, 28.7, 20.5, 17.0, 26.4…
## $ init_longitude    <dbl> -94.8, -97.6, -60.0, -48.0, -73.5, -78.0, -67.1, -64…
## $ init_date         <date> 1851-06-25, 1851-07-05, 1851-07-10, 1851-08-16, 185…
## $ final_latitude    <dbl> 31.0, 22.2, 12.0, 48.5, 32.5, 42.0, 41.0, 19.7, 34.0…
## $ final_longitude   <dbl> -100.2, -97.6, -60.0, -54.2, -73.5, -71.6, -68.0, -7…
## $ final_date        <date> 1851-06-28, 1851-07-05, 1851-07-10, 1851-08-27, 185…
## $ flf_latitude      <dbl> 28.2, NaN, NaN, 30.1, NaN, 41.1, 30.2, NaN, 28.0, Na…
## $ flf_longitude     <dbl> -96.8, NaN, NaN, -85.7, NaN, -71.7, -88.6, NaN, -82.…
## $ flf_date          <chr> "1851-06-25 00:00:00", "NaN", "NaN", "1851-08-23 00:…

4 Preprocessing and feature construction

4.1 Dates and time variables

To make the time-related fields usable in the analysis, the date columns were converted to appropriate formats. Initial and final storm dates were stored as Date, while the first landfall timestamp was parsed as a datetime value. Based on these fields, additional features were derived (storm lifetime in days, season year, and the month of first landfall), which are later used for summary comparisons.

hur_sum <- hur_sum %>% 
  mutate(
    init_date  = as.Date(init_date),
    final_date = as.Date(final_date),
    flf_date   = ymd_hms(flf_date, quiet = TRUE),
    lifetime_days = as.numeric(difftime(final_date, init_date, units = "days")),
    season        = year(init_date),
    flf_month     = month(flf_date)
  )

4.2 Filtering – storms with recorded first landfall

The analysis focuses on hurricanes for which the first landfall was recorded. Storms without available landfall latitude or longitude were excluded so that all retained observations contain valid spatial information related to landfall. As a result, the dataset represents storms that directly affected land areas, which is important for interpreting landfall-related patterns.

hur_usa <- hur_sum %>% 
  filter(!is.na(flf_latitude),
         !is.na(flf_longitude))

4.3 A collection of hurricane-level features

From the filtered dataset, a set of hurricane-level features was constructed. The selected variables describe storm duration, intensity, spatial characteristics at different stages of the storm lifecycle, and seasonal timing. Each observation represents a single hurricane, summarized by its key attributes, which form the basis for subsequent clustering analyses.

hur_feat <- hur_usa %>% 
  select(
    storm_id,
    season,
    records,
    lifetime_days,
    lifetime_max_wind,
    init_latitude, init_longitude,
    final_latitude, final_longitude,
    flf_latitude, flf_longitude,
    flf_month
  )

summary(hur_feat)
##    storm_id             season        records       lifetime_days  
##  Length:648         Min.   :1851   Min.   :  1.00   Min.   : 0.00  
##  Class :character   1st Qu.:1903   1st Qu.: 17.00   1st Qu.: 4.00  
##  Mode  :character   Median :1951   Median : 28.00   Median : 6.00  
##                     Mean   :1948   Mean   : 31.37   Mean   : 7.13  
##                     3rd Qu.:1999   3rd Qu.: 41.00   3rd Qu.:10.00  
##                     Max.   :2019   Max.   :133.00   Max.   :32.00  
##  lifetime_max_wind init_latitude   init_longitude   final_latitude 
##  Min.   : 25.0     Min.   : 7.50   Min.   :-97.50   Min.   : 9.10  
##  1st Qu.: 50.0     1st Qu.:14.60   1st Qu.:-87.30   1st Qu.:30.50  
##  Median : 70.0     Median :19.80   Median :-79.35   Median :35.40  
##  Mean   : 77.6     Mean   :19.99   Mean   :-73.76   Mean   :36.97  
##  3rd Qu.:100.0     3rd Qu.:25.00   3rd Qu.:-61.67   3rd Qu.:44.50  
##  Max.   :160.0     Max.   :44.00   Max.   :-16.90   Max.   :81.00  
##  final_longitude    flf_latitude   flf_longitude      flf_month     
##  Min.   :-109.50   Min.   :10.10   Min.   :-98.00   Min.   : 1.000  
##  1st Qu.: -91.17   1st Qu.:23.10   1st Qu.:-90.20   1st Qu.: 8.000  
##  Median : -79.30   Median :29.00   Median :-83.10   Median : 9.000  
##  Mean   : -74.06   Mean   :27.34   Mean   :-82.46   Mean   : 8.475  
##  3rd Qu.: -62.90   3rd Qu.:30.30   3rd Qu.:-78.08   3rd Qu.: 9.000  
##  Max.   :  63.00   Max.   :51.90   Max.   : -7.10   Max.   :12.000

4.4 Removal of deficiencies in key features + standardization

Before applying distance-based clustering methods, storms with missing values in the features used for clustering were removed to ensure that all algorithms operate on the same set of observations.

Only the variables used directly in clustering were standardized using z-scores. This prevents variables measured on different scales (e.g., coordinates vs. counts) from dominating the distance calculations.

hur_feat_complete <- hur_feat %>% 
  drop_na(
    records,
    lifetime_max_wind,
    init_latitude, init_longitude,
    flf_latitude, flf_longitude
  )

feat_scaled_tbl <- hur_feat_complete %>% 
  select(
    records,
    lifetime_max_wind,
    init_latitude, init_longitude,
    flf_latitude, flf_longitude
  ) %>% 
  scale() %>% 
  as_tibble()

summary(feat_scaled_tbl)
##     records        lifetime_max_wind init_latitude      init_longitude   
##  Min.   :-1.6778   Min.   :-1.7334   Min.   :-2.01993   Min.   :-1.2737  
##  1st Qu.:-0.7939   1st Qu.:-0.9096   1st Qu.:-0.87126   1st Qu.:-0.7265  
##  Median :-0.1862   Median :-0.2505   Median :-0.02999   Median :-0.3000  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.5320   3rd Qu.: 0.7382   3rd Qu.: 0.81129   3rd Qu.: 0.6482  
##  Max.   : 5.6146   Max.   : 2.7155   Max.   : 3.88520   Max.   : 3.0503  
##   flf_latitude     flf_longitude     
##  Min.   :-2.5192   Min.   :-1.31949  
##  1st Qu.:-0.6199   1st Qu.:-0.65735  
##  Median : 0.2420   Median :-0.05464  
##  Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.4320   3rd Qu.: 0.37193  
##  Max.   : 3.5876   Max.   : 6.39694

5 K-means: selecting the number of clusters

Selecting the number of clusters is one of the main practical decisions in k-means. Here, the choice of k is guided using the average silhouette width, computed for a range of candidate values.

Silhouette is a useful heuristic: it rewards compact clusters that are also well separated from each other. In practice, differences between nearby values of k can be small, so the silhouette result is treated as guidance rather than an absolute rule.

5.1 Average silhouette for different k

For each candidate value of k, k-means was fitted to the standardized feature set and evaluated using the silhouette coefficient. The mean silhouette width across storms is reported below; the value of k with the highest mean silhouette is used as the default choice for the rest of the analysis.

set.seed(123)

n_unique <- nrow(distinct(feat_scaled_tbl))

k_min <- 2
k_max <- min(8, n_unique - 1)
k_range <- k_min:k_max

sil_means <- purrr::map_dbl(k_range, function(k) {
  km <- kmeans(feat_scaled_tbl, centers = k, nstart = 25)
  sil <- silhouette(km$cluster, dist(feat_scaled_tbl))
  mean(sil[, "sil_width"])
})

sil_df <- tibble(k = k_range, sil_mean = sil_means)

ggplot(sil_df, aes(x = k, y = sil_mean)) +
  geom_line() +
  geom_point() +
  theme_minimal() +
  labs(
    title = "Average silhouette for different k (k-means)",
    x = "Number of clusters (k)",
    y = "Average silhouette width"
  )

k_best <- sil_df$k[which.max(sil_df$sil_mean)]
k_best
## [1] 2

5.2 Comparison of k selection: k-means vs. PAM vs. CLARA

Different clustering algorithms may suggest different optimal numbers of clusters due to their underlying assumptions and optimization criteria. To assess the robustness of the cluster structure, the optimal number of clusters was evaluated using the silhouette method for three algorithms: k-means, Partitioning Around Medoids (PAM), and CLARA.

For each method, the average silhouette width was computed across a range of candidate values of k. Comparing the resulting silhouette profiles allows for assessing whether different algorithms converge toward a similar number of clusters, thereby increasing confidence in the selected solution.

set.seed(123)

a <- fviz_nbclust(
  feat_scaled_tbl,
  FUNcluster = kmeans,
  method = "silhouette"
) + ggtitle("k-means")

b <- fviz_nbclust(
  feat_scaled_tbl,
  FUNcluster = pam,
  method = "silhouette"
) + ggtitle("PAM")

clara_fun <- function(x, k) {
  clara(as.matrix(x), k = k, metric = "euclidean", stand = FALSE)
}

c <- fviz_nbclust(
  feat_scaled_tbl,
  FUNcluster = clara_fun,
  method = "silhouette"
) + ggtitle("CLARA")

grid.arrange(a, b, c, ncol = 2,
             top = "Optimal number of clusters (silhouette)")

6 K-means: model, evaluation, and interpretation

After choosing k, k-means was fitted to the standardized feature set. Multiple random starts were used to reduce the risk of a suboptimal solution due to initialization.

To get a quick sense of how well-separated the clusters are in the chosen feature space, the silhouette coefficient is reported. The average silhouette gives a single summary number, while the silhouette plot shows how consistently individual storms fit their assigned cluster.

set.seed(123)
km_best <- kmeans(feat_scaled_tbl, centers = k_best, nstart = 25)

sil_best <- silhouette(km_best$cluster, dist(feat_scaled_tbl))
mean(sil_best[, "sil_width"])
## [1] 0.329483
plot(
  sil_best,
  main = paste0("Silhouette plot – k-means, k = ", k_best),
  col = 2:(k_best + 1),
  border = NA
)

6.1 Cluster profiles (average characteristics)

To interpret the k-means solution, cluster-level summaries were computed by aggregating hurricane characteristics within each cluster. For every cluster, we report the number of storms and the mean values of selected variables describing storm intensity, duration, and spatial properties (initial position, final position, and first landfall location), as well as the month of first landfall.

These profiles provide an intuitive way to characterize how clusters differ from one another (e.g., typical storm intensity, lifespan, and geographical behavior). The accompanying bar chart shows the distribution of observations across clusters, which is useful for identifying potential class imbalance (very small or dominant clusters) that may affect interpretability.

Since these are cluster averages, they can hide within-cluster variability; the profiles should be read as rough summaries rather than precise “typical storms”.

hur_clusters <- hur_feat_complete %>% 
  mutate(cluster_kmeans = factor(km_best$cluster))

cluster_profiles <- hur_clusters %>% 
  group_by(cluster_kmeans) %>% 
  summarise(
    n = n(),
    across(
      c(records, lifetime_days, lifetime_max_wind,
        init_latitude, init_longitude,
        final_latitude, final_longitude,
        flf_latitude, flf_longitude, flf_month),
      ~ mean(.x, na.rm = TRUE),
      .names = "mean_{.col}"
    ),
    .groups = "drop"
  )

cluster_profiles
ggplot(hur_clusters, aes(x = cluster_kmeans)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "Number of hurricanes in clusters (k-means)",
    x = "Cluster (k-means)",
    y = "Number of hurricanes"
  )

7 Comparison of methods for k = 2, 3, 4 (k-means, PAM, CLARA)

To further assess the stability and interpretability of the clustering structure, three partitioning methods—k-means, PAM, and CLARA—were compared for different numbers of clusters (k = 2, 3, and 4). Visual cluster representations were used to examine how consistently each method partitions the feature space and how cluster separation changes as the number of clusters increases.

Comparing methods across multiple values of k helps identify solutions that are robust to the choice of algorithm. Consistent patterns across methods suggest a strong underlying cluster structure, while large discrepancies may indicate sensitivity to initialization, distance calculations, or the presence of outliers.

set.seed(123)

km2 <- kmeans(feat_scaled_tbl, centers = 2, nstart = 25)
km3 <- kmeans(feat_scaled_tbl, centers = 3, nstart = 25)
km4 <- kmeans(feat_scaled_tbl, centers = 4, nstart = 25)

pam2 <- pam(feat_scaled_tbl, k = 2)
pam3 <- pam(feat_scaled_tbl, k = 3)
pam4 <- pam(feat_scaled_tbl, k = 4)

clara2 <- clara(as.matrix(feat_scaled_tbl), k = 2, metric = "euclidean", stand = FALSE, samples = 50)
clara3 <- clara(as.matrix(feat_scaled_tbl), k = 3, metric = "euclidean", stand = FALSE, samples = 50)
clara4 <- clara(as.matrix(feat_scaled_tbl), k = 4, metric = "euclidean", stand = FALSE, samples = 50)

p_k2_1 <- fviz_cluster(km2, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 2)")
p_k2_2 <- fviz_cluster(pam2, geom = "point")                        + ggtitle("PAM (k = 2)")
p_k2_3 <- fviz_cluster(clara2, geom = "point")                      + ggtitle("CLARA (k = 2)")
grid.arrange(p_k2_1, p_k2_2, p_k2_3, ncol = 2, top = "Clustering – k = 2")

p_k3_1 <- fviz_cluster(km3, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 3)")
p_k3_2 <- fviz_cluster(pam3, geom = "point")                        + ggtitle("PAM (k = 3)")
p_k3_3 <- fviz_cluster(clara3, geom = "point")                      + ggtitle("CLARA (k = 3)")
grid.arrange(p_k3_1, p_k3_2, p_k3_3, ncol = 2, top = "Clustering – k = 3")

p_k4_1 <- fviz_cluster(km4, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 4)")
p_k4_2 <- fviz_cluster(pam4, geom = "point")                        + ggtitle("PAM (k = 4)")
p_k4_3 <- fviz_cluster(clara4, geom = "point")                      + ggtitle("CLARA (k = 4)")
grid.arrange(p_k4_1, p_k4_2, p_k4_3, ncol = 2, top = "Clustering – k = 4")

8 Hierarchical clustering (Ward)

Ward’s hierarchical clustering provides another view of the structure in the standardized feature space. At a high level, the dendrogram suggests a separation that is broadly consistent with a two-cluster solution.

Cutting the dendrogram at k = 2 yields clusters that align reasonably well with the k-means split, which suggests that the two-group structure is not purely an artifact of k-means initialization. At the same time, the dendrogram also shows gradual substructure below that level, which may reflect continuous variation rather than clearly separated hurricane “types”.

d <- dist(feat_scaled_tbl)
hc <- hclust(d, method = "ward.D2")

plot(hc, labels = FALSE,
     main = "Hierarchical clustering of hurricanes (Ward.D2)",
     xlab = "", sub = "")

rect.hclust(hc, k = k_best, border = "red")

clusters_hc <- cutree(hc, k = k_best)
hur_clusters <- hur_clusters %>% mutate(cluster_hc = factor(clusters_hc))

9 Time series analysis: clusters in decades

The stacked bar chart shows the relative share of k-means clusters across successive decades. Both clusters appear throughout the observation period, but their proportions vary over time.

This pattern may reflect changes in the composition of storms as captured by the selected features. However, it may also be influenced by long-term differences in data coverage and measurement practices across historical periods, so the decade trends should be interpreted cautiously.

The plot focuses on relative shares rather than absolute counts, which helps compare decades with different overall levels of hurricane activity.

hur_clusters %>% 
  mutate(decade = floor(season / 10) * 10) %>% 
  count(decade, cluster_kmeans) %>% 
  group_by(decade) %>% 
  mutate(prop = n / sum(n)) %>% 
  ggplot(aes(x = decade, y = prop, fill = cluster_kmeans)) +
  geom_col(position = "fill") +
  theme_minimal() +
  labs(
    title = "Hurricane cluster share in subsequent decades (k-means)",
    x = "Decade",
    y = "Share in a given decade",
    fill = "Cluster (k-means)"
  )

10 CLARA: k-medoids on samples + comparison with k-means

To check whether the two-cluster structure is sensitive to the choice of algorithm, CLARA was applied using the same k as in the k-means solution. Since CLARA is based on medoids and operates on subsamples, it can be less sensitive to extreme observations than k-means.

The resulting assignments were compared with k-means using a contingency table to see whether storms tend to stay in the same group across methods.

k_clara <- k_best
set.seed(123)

clara_fit <- clara(
  x = as.matrix(feat_scaled_tbl),
  k = k_clara,
  metric = "euclidean",
  stand = FALSE,
  samples = 50
)

hur_clara <- hur_feat_complete %>% 
  mutate(cluster_clara = factor(clara_fit$clustering))

hur_compare <- hur_clusters %>% 
  select(storm_id, cluster_kmeans) %>% 
  inner_join(hur_clara %>% select(storm_id, cluster_clara), by = "storm_id")

table(hur_compare$cluster_kmeans, hur_compare$cluster_clara)
##    
##       1   2
##   1   0 258
##   2 340  50

11 DBSCAN: density clusters + outliers

The k-nearest neighbor distance plot suggests a natural choice of the neighborhood radius (eps) around 1.5, corresponding to a visible change in slope of the curve. Using this value together with minPts = 4, the DBSCAN algorithm identifies a single dominant dense cluster and a small number of observations classified as noise.

In total, only 14 hurricanes are labeled as outliers, indicating that the majority of storms form a relatively homogeneous density structure in the standardized feature space. This result suggests that, at the chosen scale, the data do not naturally separate into multiple dense regions.

Compared to k-means and hierarchical clustering, DBSCAN provides a complementary perspective by highlighting atypical storms rather than enforcing a fixed number of clusters.

kNNdistplot(as.matrix(feat_scaled_tbl), k = 4)
abline(h = 1.5, col = "red")

eps_val <- 1.5  
min_pts <- 4

db_fit <- dbscan(
  as.matrix(feat_scaled_tbl),
  eps = eps_val,
  minPts = min_pts
)

hur_db <- hur_feat_complete %>% 
  mutate(cluster_dbscan = db_fit$cluster) %>% 
  mutate(
    cluster_dbscan = ifelse(cluster_dbscan == 0, NA, cluster_dbscan),
    cluster_dbscan = factor(cluster_dbscan)
  )

sum(is.na(hur_db$cluster_dbscan))
## [1] 14

12 Visualizations on the map

To better interpret the clusters, the first landfall locations were plotted on a map and colored by the k-means cluster assignment. This helps check whether the groups differ not only in the feature space but also geographically.

The maps are mainly descriptive, but they provide an intuitive geographic interpretation of the clustering results and suggest differences in preferred landfall regions.

12.1 Map from rnaturalearth + sf (Atlantic Ocean / USA)

world <- ne_countries(scale = "medium", returnclass = "sf")
world_crop <- st_crop(world, xmin = -100, xmax = -10, ymin = 0, ymax = 50)

ggplot() +
  geom_sf(data = world_crop, fill = "gray90", color = "gray60") +
  geom_point(
    data = hur_clusters,
    aes(x = flf_longitude, y = flf_latitude, color = cluster_kmeans),
    size = 2, alpha = 0.8
  ) +
  coord_sf(xlim = c(-100, -10), ylim = c(0, 50), expand = FALSE) +
  theme_minimal() +
  labs(
    title = "Hurricanes – clustering (k-means) against a coastal map",
    x = "Longitude (°)",
    y = "Latitude (°)",
    color = "Cluster (k-means)"
  )

12.2 Simpler map (maps)

usa <- map_data("world")

ggplot() +
  geom_polygon(
    data = usa,
    aes(x = long, y = lat, group = group),
    fill = "gray90", color = "gray60"
  ) +
  geom_point(
    data = hur_clusters,
    aes(x = flf_longitude, y = flf_latitude, color = cluster_kmeans),
    size = 2, alpha = 0.8
  ) +
  coord_fixed(xlim = c(-100, -10), ylim = c(0, 50)) +
  theme_minimal() +
  labs(
    title = "Hurricanes – k-means on a map background (maps)",
    x = "Longitude (°)",
    y = "Latitude (°)",
    color = "Cluster (k-means)"
  )

13 Comparison of methods: k-means, Ward, CLARA, DBSCAN

Comparing cluster size distributions for k-means, Ward, and CLARA shows that all three methods produce two groups of comparable size (i.e., no extreme imbalance where one cluster contains only a handful of storms). This makes the two-cluster solution easier to interpret and less likely to be driven by a very small subset of observations.

In contrast, DBSCAN identifies one large dense cluster and a small group of noise points. While this structure is useful for detecting atypical storms, it is less suitable for describing multiple hurricane types.

Overall, the consistency of cluster sizes across k-means, Ward, and CLARA suggests a stable and interpretable partition of the data.

clusters_all <- hur_feat_complete %>% 
  mutate(
    kmeans = factor(km_best$cluster),
    ward   = factor(clusters_hc),
    clara  = factor(clara_fit$clustering),
    dbscan_raw = db_fit$cluster
  ) %>% 
  mutate(
    dbscan = ifelse(dbscan_raw == 0, NA, dbscan_raw),
    dbscan = factor(dbscan)
  )

mean_silhouette <- function(labels, dist_mat) {
  labs_clean <- na.omit(labels)
  if (length(unique(labs_clean)) < 2) return(NA_real_)
  sil <- silhouette(
    as.integer(labs_clean),
    as.dist(as.matrix(dist_mat)[!is.na(labels), !is.na(labels)])
  )
  mean(sil[, "sil_width"])
}

d <- dist(feat_scaled_tbl)

kmeans_summary <- clusters_all %>% 
  count(kmeans, name = "n_points") %>% 
  summarise(
    method = "kmeans",
    n_clusters = n(),
    avg_cluster_size = mean(n_points),
    min_cluster_size = min(n_points),
    max_cluster_size = max(n_points)
  ) %>% 
  mutate(mean_sil = mean_silhouette(clusters_all$kmeans, d))

ward_summary <- clusters_all %>% 
  count(ward, name = "n_points") %>% 
  summarise(
    method = "ward",
    n_clusters = n(),
    avg_cluster_size = mean(n_points),
    min_cluster_size = min(n_points),
    max_cluster_size = max(n_points)
  ) %>% 
  mutate(mean_sil = mean_silhouette(clusters_all$ward, d))

clara_summary <- clusters_all %>% 
  count(clara, name = "n_points") %>% 
  summarise(
    method = "clara",
    n_clusters = n(),
    avg_cluster_size = mean(n_points),
    min_cluster_size = min(n_points),
    max_cluster_size = max(n_points)
  ) %>% 
  mutate(mean_sil = mean_silhouette(clusters_all$clara, d))

dbscan_summary <- clusters_all %>% 
  filter(!is.na(dbscan)) %>% 
  count(dbscan, name = "n_points") %>% 
  summarise(
    method = "dbscan",
    n_clusters = n(),
    avg_cluster_size = mean(n_points),
    min_cluster_size = min(n_points),
    max_cluster_size = max(n_points)
  ) %>% 
  mutate(
    mean_sil = mean_silhouette(clusters_all$dbscan, d),
    n_noise  = sum(is.na(clusters_all$dbscan))
  )

methods_summary <- bind_rows(kmeans_summary, ward_summary, clara_summary, dbscan_summary)
methods_summary
table(clusters_all$kmeans, clusters_all$ward)
##    
##       1   2
##   1  28 230
##   2 367  23
table(clusters_all$kmeans, clusters_all$clara)
##    
##       1   2
##   1   0 258
##   2 340  50
clusters_all %>% 
  pivot_longer(cols = c(kmeans, ward, clara),
               names_to = "method",
               values_to = "cluster") %>% 
  ggplot(aes(x = cluster)) +
  geom_bar() +
  facet_wrap(~ method, scales = "free_y") +
  theme_minimal() +
  labs(
    title = "Distribution of cluster sizes for different methods",
    x = "Cluster",
    y = "Number of hurricanes"
  )

clusters_all %>% 
  mutate(dbscan_label = ifelse(is.na(dbscan), "noise", as.character(dbscan))) %>% 
  ggplot(aes(x = dbscan_label)) +
  geom_bar() +
  theme_minimal() +
  labs(
    title = "DBSCAN – cluster and outlier counts",
    x = "Label (cluster/noise)",
    y = "Number of hurricanes"
  )

14 Conclusions

Overall, with the selected aggregated features, a two-group structure appears repeatedly across several clustering approaches.

However, the resulting clustering strongly depends on the choice of variables and the aggregation of each hurricane into a single observation. Incorporating temporal evolution of storm intensity or additional environmental factors could reveal a more complex structure.

Therefore, the results should be viewed as exploratory and descriptive rather than as a definitive classification of Atlantic hurricanes.