The aim of this analysis was to explore whether Atlantic hurricanes can be meaningfully grouped based on selected characteristics describing their intensity, duration, and basic spatial properties, such as the location of origin and first landfall.
Rather than attempting to classify hurricanes according to predefined categories, the analysis follows an exploratory approach. The objective was to examine whether recurring patterns emerge directly from historical data and whether these patterns remain stable across different clustering methods.
This analysis was prepared as part of the course Unsupervised Machine Learning at the University of Warsaw.
library(tidyverse)
library(lubridate)
library(cluster)
library(dbscan)
library(sf)
library(rnaturalearth)
library(rnaturalearthdata)
library(maps)
library(factoextra)
library(gridExtra)
This analysis is based on the HURDAT2 Atlantic Summary and Track Datasets, a publicly available dataset hosted on the Kaggle platform.
The data originate from the official HURDAT2 (Hurricane Database 2) maintained by the National Hurricane Center (NHC) and include detailed information on Atlantic tropical cyclones, such as: - storm tracks and trajectories, - maximum sustained wind speeds, - minimum central pressure, - landfall locations, - basic life-cycle characteristics of each storm.
The dataset was published on Kaggle by Wesley Neill
and is available at:
https://www.kaggle.com/datasets/wesleyneill/hurdat2-atlantic-summary-and-track-datasets
For the purpose of this project, selected variables describing storm intensity, spatial characteristics (initial and final positions), and the number of recorded observations during the storm lifetime were used. Prior to clustering, the data were cleaned and standardized to ensure comparability of features.
It should be noted that the selected variables provide a simplified representation of each hurricane. The analysis is based on aggregated characteristics at the storm level and does not account for temporal changes in intensity or trajectory during the hurricane’s lifetime.
As a result, the clustering reflects overall storm properties rather than detailed dynamic behavior, and the findings should be interpreted within this limitation.
Atlantic_Hurricane_Tracks.csv – trajectories (time
records),Atlantic_Hurricane_Summary_Data.csv – summary data at
the storm level.data_dir <- params$data_dir
hur <- read_csv(file.path(data_dir, "Atlantic_Hurricane_Tracks.csv"))
hur_sum <- read_csv(file.path(data_dir, "Atlantic_Hurricane_Summary_Data.csv"))
glimpse(hur)
## Rows: 51,840
## Columns: 23
## $ storm_id <chr> "AL011851", "AL011851", "AL011851", "AL011851", "AL011851…
## $ name <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "U…
## $ records <dbl> 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 1…
## $ date <date> 1851-06-25, 1851-06-25, 1851-06-25, 1851-06-25, 1851-06-…
## $ time <chr> "0000", "0600", "1200", "1800", "2100", "0000", "0600", "…
## $ record_id <chr> NA, NA, NA, NA, "L", NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ status <chr> "HU", "HU", "HU", "HU", "HU", "HU", "TS", "TS", "TS", "TS…
## $ latitude <dbl> 28.0, 28.0, 28.0, 28.1, 28.2, 28.2, 28.3, 28.4, 28.6, 29.…
## $ longitude <dbl> -94.8, -95.4, -96.0, -96.5, -96.8, -97.0, -97.6, -98.3, -…
## $ max_wind <dbl> 80, 80, 80, 80, 80, 70, 60, 60, 50, 50, 40, 40, 40, 40, 8…
## $ min_pressure <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne34ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se34ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw34ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw34ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne50ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se50ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw50ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw50ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ ne64ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ se64ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ sw64ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
## $ nw64ktr <dbl> -999, -999, -999, -999, -999, -999, -999, -999, -999, -99…
glimpse(hur_sum)
## Rows: 1,893
## Columns: 17
## $ storm_id <chr> "AL011851", "AL021851", "AL031851", "AL041851", "AL0…
## $ name <chr> "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED", "UNNAMED…
## $ records <dbl> 14, 1, 1, 49, 16, 17, 45, 8, 20, 36, 25, 1, 1, 48, 1…
## $ landfall <dbl> 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ hurricane <dbl> 1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 0…
## $ hurr_landfall <dbl> 1, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0…
## $ lifetime_max_wind <dbl> 80, 80, 50, 100, 50, 60, 100, 70, 70, 80, 90, 50, 40…
## $ lifetime_min_pres <dbl> NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, NaN, Na…
## $ init_latitude <dbl> 28.0, 22.2, 12.0, 13.4, 32.5, 28.7, 20.5, 17.0, 26.4…
## $ init_longitude <dbl> -94.8, -97.6, -60.0, -48.0, -73.5, -78.0, -67.1, -64…
## $ init_date <date> 1851-06-25, 1851-07-05, 1851-07-10, 1851-08-16, 185…
## $ final_latitude <dbl> 31.0, 22.2, 12.0, 48.5, 32.5, 42.0, 41.0, 19.7, 34.0…
## $ final_longitude <dbl> -100.2, -97.6, -60.0, -54.2, -73.5, -71.6, -68.0, -7…
## $ final_date <date> 1851-06-28, 1851-07-05, 1851-07-10, 1851-08-27, 185…
## $ flf_latitude <dbl> 28.2, NaN, NaN, 30.1, NaN, 41.1, 30.2, NaN, 28.0, Na…
## $ flf_longitude <dbl> -96.8, NaN, NaN, -85.7, NaN, -71.7, -88.6, NaN, -82.…
## $ flf_date <chr> "1851-06-25 00:00:00", "NaN", "NaN", "1851-08-23 00:…
To make the time-related fields usable in the analysis, the date
columns were converted to appropriate formats. Initial and final storm
dates were stored as Date, while the first landfall
timestamp was parsed as a datetime value. Based on these fields,
additional features were derived (storm lifetime in days, season year,
and the month of first landfall), which are later used for summary
comparisons.
hur_sum <- hur_sum %>%
mutate(
init_date = as.Date(init_date),
final_date = as.Date(final_date),
flf_date = ymd_hms(flf_date, quiet = TRUE),
lifetime_days = as.numeric(difftime(final_date, init_date, units = "days")),
season = year(init_date),
flf_month = month(flf_date)
)
The analysis focuses on hurricanes for which the first landfall was recorded. Storms without available landfall latitude or longitude were excluded so that all retained observations contain valid spatial information related to landfall. As a result, the dataset represents storms that directly affected land areas, which is important for interpreting landfall-related patterns.
hur_usa <- hur_sum %>%
filter(!is.na(flf_latitude),
!is.na(flf_longitude))
From the filtered dataset, a set of hurricane-level features was constructed. The selected variables describe storm duration, intensity, spatial characteristics at different stages of the storm lifecycle, and seasonal timing. Each observation represents a single hurricane, summarized by its key attributes, which form the basis for subsequent clustering analyses.
hur_feat <- hur_usa %>%
select(
storm_id,
season,
records,
lifetime_days,
lifetime_max_wind,
init_latitude, init_longitude,
final_latitude, final_longitude,
flf_latitude, flf_longitude,
flf_month
)
summary(hur_feat)
## storm_id season records lifetime_days
## Length:648 Min. :1851 Min. : 1.00 Min. : 0.00
## Class :character 1st Qu.:1903 1st Qu.: 17.00 1st Qu.: 4.00
## Mode :character Median :1951 Median : 28.00 Median : 6.00
## Mean :1948 Mean : 31.37 Mean : 7.13
## 3rd Qu.:1999 3rd Qu.: 41.00 3rd Qu.:10.00
## Max. :2019 Max. :133.00 Max. :32.00
## lifetime_max_wind init_latitude init_longitude final_latitude
## Min. : 25.0 Min. : 7.50 Min. :-97.50 Min. : 9.10
## 1st Qu.: 50.0 1st Qu.:14.60 1st Qu.:-87.30 1st Qu.:30.50
## Median : 70.0 Median :19.80 Median :-79.35 Median :35.40
## Mean : 77.6 Mean :19.99 Mean :-73.76 Mean :36.97
## 3rd Qu.:100.0 3rd Qu.:25.00 3rd Qu.:-61.67 3rd Qu.:44.50
## Max. :160.0 Max. :44.00 Max. :-16.90 Max. :81.00
## final_longitude flf_latitude flf_longitude flf_month
## Min. :-109.50 Min. :10.10 Min. :-98.00 Min. : 1.000
## 1st Qu.: -91.17 1st Qu.:23.10 1st Qu.:-90.20 1st Qu.: 8.000
## Median : -79.30 Median :29.00 Median :-83.10 Median : 9.000
## Mean : -74.06 Mean :27.34 Mean :-82.46 Mean : 8.475
## 3rd Qu.: -62.90 3rd Qu.:30.30 3rd Qu.:-78.08 3rd Qu.: 9.000
## Max. : 63.00 Max. :51.90 Max. : -7.10 Max. :12.000
Before applying distance-based clustering methods, storms with missing values in the features used for clustering were removed to ensure that all algorithms operate on the same set of observations.
Only the variables used directly in clustering were standardized using z-scores. This prevents variables measured on different scales (e.g., coordinates vs. counts) from dominating the distance calculations.
hur_feat_complete <- hur_feat %>%
drop_na(
records,
lifetime_max_wind,
init_latitude, init_longitude,
flf_latitude, flf_longitude
)
feat_scaled_tbl <- hur_feat_complete %>%
select(
records,
lifetime_max_wind,
init_latitude, init_longitude,
flf_latitude, flf_longitude
) %>%
scale() %>%
as_tibble()
summary(feat_scaled_tbl)
## records lifetime_max_wind init_latitude init_longitude
## Min. :-1.6778 Min. :-1.7334 Min. :-2.01993 Min. :-1.2737
## 1st Qu.:-0.7939 1st Qu.:-0.9096 1st Qu.:-0.87126 1st Qu.:-0.7265
## Median :-0.1862 Median :-0.2505 Median :-0.02999 Median :-0.3000
## Mean : 0.0000 Mean : 0.0000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.5320 3rd Qu.: 0.7382 3rd Qu.: 0.81129 3rd Qu.: 0.6482
## Max. : 5.6146 Max. : 2.7155 Max. : 3.88520 Max. : 3.0503
## flf_latitude flf_longitude
## Min. :-2.5192 Min. :-1.31949
## 1st Qu.:-0.6199 1st Qu.:-0.65735
## Median : 0.2420 Median :-0.05464
## Mean : 0.0000 Mean : 0.00000
## 3rd Qu.: 0.4320 3rd Qu.: 0.37193
## Max. : 3.5876 Max. : 6.39694
Selecting the number of clusters is one of the main practical
decisions in k-means. Here, the choice of k is guided using
the average silhouette width, computed for a range of candidate
values.
Silhouette is a useful heuristic: it rewards compact clusters that
are also well separated from each other. In practice, differences
between nearby values of k can be small, so the silhouette
result is treated as guidance rather than an absolute rule.
For each candidate value of k, k-means was fitted to the
standardized feature set and evaluated using the silhouette coefficient.
The mean silhouette width across storms is reported below; the value of
k with the highest mean silhouette is used as the default
choice for the rest of the analysis.
set.seed(123)
n_unique <- nrow(distinct(feat_scaled_tbl))
k_min <- 2
k_max <- min(8, n_unique - 1)
k_range <- k_min:k_max
sil_means <- purrr::map_dbl(k_range, function(k) {
km <- kmeans(feat_scaled_tbl, centers = k, nstart = 25)
sil <- silhouette(km$cluster, dist(feat_scaled_tbl))
mean(sil[, "sil_width"])
})
sil_df <- tibble(k = k_range, sil_mean = sil_means)
ggplot(sil_df, aes(x = k, y = sil_mean)) +
geom_line() +
geom_point() +
theme_minimal() +
labs(
title = "Average silhouette for different k (k-means)",
x = "Number of clusters (k)",
y = "Average silhouette width"
)
k_best <- sil_df$k[which.max(sil_df$sil_mean)]
k_best
## [1] 2
Different clustering algorithms may suggest different optimal numbers of clusters due to their underlying assumptions and optimization criteria. To assess the robustness of the cluster structure, the optimal number of clusters was evaluated using the silhouette method for three algorithms: k-means, Partitioning Around Medoids (PAM), and CLARA.
For each method, the average silhouette width was computed across a range of candidate values of k. Comparing the resulting silhouette profiles allows for assessing whether different algorithms converge toward a similar number of clusters, thereby increasing confidence in the selected solution.
set.seed(123)
a <- fviz_nbclust(
feat_scaled_tbl,
FUNcluster = kmeans,
method = "silhouette"
) + ggtitle("k-means")
b <- fviz_nbclust(
feat_scaled_tbl,
FUNcluster = pam,
method = "silhouette"
) + ggtitle("PAM")
clara_fun <- function(x, k) {
clara(as.matrix(x), k = k, metric = "euclidean", stand = FALSE)
}
c <- fviz_nbclust(
feat_scaled_tbl,
FUNcluster = clara_fun,
method = "silhouette"
) + ggtitle("CLARA")
grid.arrange(a, b, c, ncol = 2,
top = "Optimal number of clusters (silhouette)")
After choosing k, k-means was fitted to the standardized
feature set. Multiple random starts were used to reduce the risk of a
suboptimal solution due to initialization.
To get a quick sense of how well-separated the clusters are in the chosen feature space, the silhouette coefficient is reported. The average silhouette gives a single summary number, while the silhouette plot shows how consistently individual storms fit their assigned cluster.
set.seed(123)
km_best <- kmeans(feat_scaled_tbl, centers = k_best, nstart = 25)
sil_best <- silhouette(km_best$cluster, dist(feat_scaled_tbl))
mean(sil_best[, "sil_width"])
## [1] 0.329483
plot(
sil_best,
main = paste0("Silhouette plot – k-means, k = ", k_best),
col = 2:(k_best + 1),
border = NA
)
To interpret the k-means solution, cluster-level summaries were computed by aggregating hurricane characteristics within each cluster. For every cluster, we report the number of storms and the mean values of selected variables describing storm intensity, duration, and spatial properties (initial position, final position, and first landfall location), as well as the month of first landfall.
These profiles provide an intuitive way to characterize how clusters differ from one another (e.g., typical storm intensity, lifespan, and geographical behavior). The accompanying bar chart shows the distribution of observations across clusters, which is useful for identifying potential class imbalance (very small or dominant clusters) that may affect interpretability.
Since these are cluster averages, they can hide within-cluster variability; the profiles should be read as rough summaries rather than precise “typical storms”.
hur_clusters <- hur_feat_complete %>%
mutate(cluster_kmeans = factor(km_best$cluster))
cluster_profiles <- hur_clusters %>%
group_by(cluster_kmeans) %>%
summarise(
n = n(),
across(
c(records, lifetime_days, lifetime_max_wind,
init_latitude, init_longitude,
final_latitude, final_longitude,
flf_latitude, flf_longitude, flf_month),
~ mean(.x, na.rm = TRUE),
.names = "mean_{.col}"
),
.groups = "drop"
)
cluster_profiles
ggplot(hur_clusters, aes(x = cluster_kmeans)) +
geom_bar() +
theme_minimal() +
labs(
title = "Number of hurricanes in clusters (k-means)",
x = "Cluster (k-means)",
y = "Number of hurricanes"
)
To further assess the stability and interpretability of the clustering structure, three partitioning methods—k-means, PAM, and CLARA—were compared for different numbers of clusters (k = 2, 3, and 4). Visual cluster representations were used to examine how consistently each method partitions the feature space and how cluster separation changes as the number of clusters increases.
Comparing methods across multiple values of k helps identify solutions that are robust to the choice of algorithm. Consistent patterns across methods suggest a strong underlying cluster structure, while large discrepancies may indicate sensitivity to initialization, distance calculations, or the presence of outliers.
set.seed(123)
km2 <- kmeans(feat_scaled_tbl, centers = 2, nstart = 25)
km3 <- kmeans(feat_scaled_tbl, centers = 3, nstart = 25)
km4 <- kmeans(feat_scaled_tbl, centers = 4, nstart = 25)
pam2 <- pam(feat_scaled_tbl, k = 2)
pam3 <- pam(feat_scaled_tbl, k = 3)
pam4 <- pam(feat_scaled_tbl, k = 4)
clara2 <- clara(as.matrix(feat_scaled_tbl), k = 2, metric = "euclidean", stand = FALSE, samples = 50)
clara3 <- clara(as.matrix(feat_scaled_tbl), k = 3, metric = "euclidean", stand = FALSE, samples = 50)
clara4 <- clara(as.matrix(feat_scaled_tbl), k = 4, metric = "euclidean", stand = FALSE, samples = 50)
p_k2_1 <- fviz_cluster(km2, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 2)")
p_k2_2 <- fviz_cluster(pam2, geom = "point") + ggtitle("PAM (k = 2)")
p_k2_3 <- fviz_cluster(clara2, geom = "point") + ggtitle("CLARA (k = 2)")
grid.arrange(p_k2_1, p_k2_2, p_k2_3, ncol = 2, top = "Clustering – k = 2")
p_k3_1 <- fviz_cluster(km3, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 3)")
p_k3_2 <- fviz_cluster(pam3, geom = "point") + ggtitle("PAM (k = 3)")
p_k3_3 <- fviz_cluster(clara3, geom = "point") + ggtitle("CLARA (k = 3)")
grid.arrange(p_k3_1, p_k3_2, p_k3_3, ncol = 2, top = "Clustering – k = 3")
p_k4_1 <- fviz_cluster(km4, data = feat_scaled_tbl, geom = "point") + ggtitle("K-means (k = 4)")
p_k4_2 <- fviz_cluster(pam4, geom = "point") + ggtitle("PAM (k = 4)")
p_k4_3 <- fviz_cluster(clara4, geom = "point") + ggtitle("CLARA (k = 4)")
grid.arrange(p_k4_1, p_k4_2, p_k4_3, ncol = 2, top = "Clustering – k = 4")
Ward’s hierarchical clustering provides another view of the structure in the standardized feature space. At a high level, the dendrogram suggests a separation that is broadly consistent with a two-cluster solution.
Cutting the dendrogram at k = 2 yields clusters that
align reasonably well with the k-means split, which suggests that the
two-group structure is not purely an artifact of k-means initialization.
At the same time, the dendrogram also shows gradual substructure below
that level, which may reflect continuous variation rather than clearly
separated hurricane “types”.
d <- dist(feat_scaled_tbl)
hc <- hclust(d, method = "ward.D2")
plot(hc, labels = FALSE,
main = "Hierarchical clustering of hurricanes (Ward.D2)",
xlab = "", sub = "")
rect.hclust(hc, k = k_best, border = "red")
clusters_hc <- cutree(hc, k = k_best)
hur_clusters <- hur_clusters %>% mutate(cluster_hc = factor(clusters_hc))
The stacked bar chart shows the relative share of k-means clusters across successive decades. Both clusters appear throughout the observation period, but their proportions vary over time.
This pattern may reflect changes in the composition of storms as captured by the selected features. However, it may also be influenced by long-term differences in data coverage and measurement practices across historical periods, so the decade trends should be interpreted cautiously.
The plot focuses on relative shares rather than absolute counts, which helps compare decades with different overall levels of hurricane activity.
hur_clusters %>%
mutate(decade = floor(season / 10) * 10) %>%
count(decade, cluster_kmeans) %>%
group_by(decade) %>%
mutate(prop = n / sum(n)) %>%
ggplot(aes(x = decade, y = prop, fill = cluster_kmeans)) +
geom_col(position = "fill") +
theme_minimal() +
labs(
title = "Hurricane cluster share in subsequent decades (k-means)",
x = "Decade",
y = "Share in a given decade",
fill = "Cluster (k-means)"
)
To check whether the two-cluster structure is sensitive to the choice
of algorithm, CLARA was applied using the same k as in the
k-means solution. Since CLARA is based on medoids and operates on
subsamples, it can be less sensitive to extreme observations than
k-means.
The resulting assignments were compared with k-means using a contingency table to see whether storms tend to stay in the same group across methods.
k_clara <- k_best
set.seed(123)
clara_fit <- clara(
x = as.matrix(feat_scaled_tbl),
k = k_clara,
metric = "euclidean",
stand = FALSE,
samples = 50
)
hur_clara <- hur_feat_complete %>%
mutate(cluster_clara = factor(clara_fit$clustering))
hur_compare <- hur_clusters %>%
select(storm_id, cluster_kmeans) %>%
inner_join(hur_clara %>% select(storm_id, cluster_clara), by = "storm_id")
table(hur_compare$cluster_kmeans, hur_compare$cluster_clara)
##
## 1 2
## 1 0 258
## 2 340 50
The k-nearest neighbor distance plot suggests a natural choice of the neighborhood radius (eps) around 1.5, corresponding to a visible change in slope of the curve. Using this value together with minPts = 4, the DBSCAN algorithm identifies a single dominant dense cluster and a small number of observations classified as noise.
In total, only 14 hurricanes are labeled as outliers, indicating that the majority of storms form a relatively homogeneous density structure in the standardized feature space. This result suggests that, at the chosen scale, the data do not naturally separate into multiple dense regions.
Compared to k-means and hierarchical clustering, DBSCAN provides a complementary perspective by highlighting atypical storms rather than enforcing a fixed number of clusters.
kNNdistplot(as.matrix(feat_scaled_tbl), k = 4)
abline(h = 1.5, col = "red")
eps_val <- 1.5
min_pts <- 4
db_fit <- dbscan(
as.matrix(feat_scaled_tbl),
eps = eps_val,
minPts = min_pts
)
hur_db <- hur_feat_complete %>%
mutate(cluster_dbscan = db_fit$cluster) %>%
mutate(
cluster_dbscan = ifelse(cluster_dbscan == 0, NA, cluster_dbscan),
cluster_dbscan = factor(cluster_dbscan)
)
sum(is.na(hur_db$cluster_dbscan))
## [1] 14
To better interpret the clusters, the first landfall locations were plotted on a map and colored by the k-means cluster assignment. This helps check whether the groups differ not only in the feature space but also geographically.
The maps are mainly descriptive, but they provide an intuitive geographic interpretation of the clustering results and suggest differences in preferred landfall regions.
rnaturalearth + sf (Atlantic Ocean / USA)world <- ne_countries(scale = "medium", returnclass = "sf")
world_crop <- st_crop(world, xmin = -100, xmax = -10, ymin = 0, ymax = 50)
ggplot() +
geom_sf(data = world_crop, fill = "gray90", color = "gray60") +
geom_point(
data = hur_clusters,
aes(x = flf_longitude, y = flf_latitude, color = cluster_kmeans),
size = 2, alpha = 0.8
) +
coord_sf(xlim = c(-100, -10), ylim = c(0, 50), expand = FALSE) +
theme_minimal() +
labs(
title = "Hurricanes – clustering (k-means) against a coastal map",
x = "Longitude (°)",
y = "Latitude (°)",
color = "Cluster (k-means)"
)
maps)usa <- map_data("world")
ggplot() +
geom_polygon(
data = usa,
aes(x = long, y = lat, group = group),
fill = "gray90", color = "gray60"
) +
geom_point(
data = hur_clusters,
aes(x = flf_longitude, y = flf_latitude, color = cluster_kmeans),
size = 2, alpha = 0.8
) +
coord_fixed(xlim = c(-100, -10), ylim = c(0, 50)) +
theme_minimal() +
labs(
title = "Hurricanes – k-means on a map background (maps)",
x = "Longitude (°)",
y = "Latitude (°)",
color = "Cluster (k-means)"
)
Comparing cluster size distributions for k-means, Ward, and CLARA shows that all three methods produce two groups of comparable size (i.e., no extreme imbalance where one cluster contains only a handful of storms). This makes the two-cluster solution easier to interpret and less likely to be driven by a very small subset of observations.
In contrast, DBSCAN identifies one large dense cluster and a small group of noise points. While this structure is useful for detecting atypical storms, it is less suitable for describing multiple hurricane types.
Overall, the consistency of cluster sizes across k-means, Ward, and CLARA suggests a stable and interpretable partition of the data.
clusters_all <- hur_feat_complete %>%
mutate(
kmeans = factor(km_best$cluster),
ward = factor(clusters_hc),
clara = factor(clara_fit$clustering),
dbscan_raw = db_fit$cluster
) %>%
mutate(
dbscan = ifelse(dbscan_raw == 0, NA, dbscan_raw),
dbscan = factor(dbscan)
)
mean_silhouette <- function(labels, dist_mat) {
labs_clean <- na.omit(labels)
if (length(unique(labs_clean)) < 2) return(NA_real_)
sil <- silhouette(
as.integer(labs_clean),
as.dist(as.matrix(dist_mat)[!is.na(labels), !is.na(labels)])
)
mean(sil[, "sil_width"])
}
d <- dist(feat_scaled_tbl)
kmeans_summary <- clusters_all %>%
count(kmeans, name = "n_points") %>%
summarise(
method = "kmeans",
n_clusters = n(),
avg_cluster_size = mean(n_points),
min_cluster_size = min(n_points),
max_cluster_size = max(n_points)
) %>%
mutate(mean_sil = mean_silhouette(clusters_all$kmeans, d))
ward_summary <- clusters_all %>%
count(ward, name = "n_points") %>%
summarise(
method = "ward",
n_clusters = n(),
avg_cluster_size = mean(n_points),
min_cluster_size = min(n_points),
max_cluster_size = max(n_points)
) %>%
mutate(mean_sil = mean_silhouette(clusters_all$ward, d))
clara_summary <- clusters_all %>%
count(clara, name = "n_points") %>%
summarise(
method = "clara",
n_clusters = n(),
avg_cluster_size = mean(n_points),
min_cluster_size = min(n_points),
max_cluster_size = max(n_points)
) %>%
mutate(mean_sil = mean_silhouette(clusters_all$clara, d))
dbscan_summary <- clusters_all %>%
filter(!is.na(dbscan)) %>%
count(dbscan, name = "n_points") %>%
summarise(
method = "dbscan",
n_clusters = n(),
avg_cluster_size = mean(n_points),
min_cluster_size = min(n_points),
max_cluster_size = max(n_points)
) %>%
mutate(
mean_sil = mean_silhouette(clusters_all$dbscan, d),
n_noise = sum(is.na(clusters_all$dbscan))
)
methods_summary <- bind_rows(kmeans_summary, ward_summary, clara_summary, dbscan_summary)
methods_summary
table(clusters_all$kmeans, clusters_all$ward)
##
## 1 2
## 1 28 230
## 2 367 23
table(clusters_all$kmeans, clusters_all$clara)
##
## 1 2
## 1 0 258
## 2 340 50
clusters_all %>%
pivot_longer(cols = c(kmeans, ward, clara),
names_to = "method",
values_to = "cluster") %>%
ggplot(aes(x = cluster)) +
geom_bar() +
facet_wrap(~ method, scales = "free_y") +
theme_minimal() +
labs(
title = "Distribution of cluster sizes for different methods",
x = "Cluster",
y = "Number of hurricanes"
)
clusters_all %>%
mutate(dbscan_label = ifelse(is.na(dbscan), "noise", as.character(dbscan))) %>%
ggplot(aes(x = dbscan_label)) +
geom_bar() +
theme_minimal() +
labs(
title = "DBSCAN – cluster and outlier counts",
x = "Label (cluster/noise)",
y = "Number of hurricanes"
)
Overall, with the selected aggregated features, a two-group structure appears repeatedly across several clustering approaches.
However, the resulting clustering strongly depends on the choice of variables and the aggregation of each hurricane into a single observation. Incorporating temporal evolution of storm intensity or additional environmental factors could reveal a more complex structure.
Therefore, the results should be viewed as exploratory and descriptive rather than as a definitive classification of Atlantic hurricanes.