1 Introduction

This report analyzes the Forest Cover Type dataset using two density-based clustering methods: DBSCAN and HDBSCAN.

The report follows a clear, step-by-step flow:

Load the data and look at its structure.
Explore the data with summary statistics and plots.
Prepare the data (sampling 5000, scaling, PCA).
Run DBSCAN and HDBSCAN.
Compare the two methods.
Check whether small clusters are just outliers.
Profile each cluster to understand what it represents.

2 Load Data

# Change this path to match the location of your file
df <- read.csv("D:\\OneDrive - Universitas Islam Indonesia\\College\\Spatial Data Science\\covtype.csv")

cat("Number of rows:", nrow(df), "\n")

## Number of rows: 581012

cat("Number of columns:", ncol(df), "\n")

## Number of columns: 55

I use the 10 continuous numeric variables. The binary variables (Wilderness_Area, Soil_Type) and the label (Cover_Type) are kept aside, because they are categorical

numeric_cols <- c(
  "Elevation", "Aspect", "Slope",
  "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology",
  "Horizontal_Distance_To_Roadways",
  "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
  "Horizontal_Distance_To_Fire_Points"
)

kable(data.frame(No = 1:length(numeric_cols), Variable = numeric_cols),
      caption = "The 10 numeric variables used in this analysis")

The 10 numeric variables used in this analysis
No	Variable
1	Elevation
2	Aspect
3	Slope
4	Horizontal_Distance_To_Hydrology
5	Vertical_Distance_To_Hydrology
6	Horizontal_Distance_To_Roadways
7	Hillshade_9am
8	Hillshade_Noon
9	Hillshade_3pm
10	Horizontal_Distance_To_Fire_Points

2.1 Sampling

The full dataset has over 580,000 rows, which is too large for DBSCAN and HDBSCAN to handle directly. I take a random sample of 5,000 rows and use this same sample for both the EDA and the clustering.

set.seed(42)
sample_size <- 5000
sample_idx <- sample(seq_len(nrow(df)), sample_size)

df_sample <- df[sample_idx, c(numeric_cols, "Cover_Type")]
X_sample  <- df_sample[, numeric_cols]

cat("Sample size used:", nrow(X_sample), "rows\n")

## Sample size used: 5000 rows

3 Exploratory Data Analysis (EDA)

3.1 Summary Statistics

summary_stats <- data.frame(
  Variable = numeric_cols,
  Min    = sapply(X_sample, min),
  Mean   = round(sapply(X_sample, mean), 1),
  Median = sapply(X_sample, median),
  Max    = sapply(X_sample, max),
  SD     = round(sapply(X_sample, sd), 1)
)
rownames(summary_stats) <- NULL
kable(summary_stats, caption = "Summary statistics of the numeric variables")

Summary statistics of the numeric variables
Variable	Min	Mean	Median	Max	SD
Elevation	1898	2960.5	2995	3857	280.2
Aspect	0	156.5	131	359	111.5
Slope	0	14.0	13	49	7.4
Horizontal_Distance_To_Hydrology	0	266.8	216	1307	212.2
Vertical_Distance_To_Hydrology	-152	45.8	29	397	58.1
Horizontal_Distance_To_Roadways	0	2330.0	1959	6975	1564.1
Hillshade_9am	75	212.1	218	254	26.4
Hillshade_Noon	114	223.6	227	254	19.9
Hillshade_3pm	0	142.9	143	250	38.1
Horizontal_Distance_To_Fire_Points	30	1984.0	1719	7117	1335.9

3.2 Distribution of Each Variable

Histograms show the shape of each variable. Some variables are skewed to the right, meaning most values are small but a few are very large.

hist_data <- melt(X_sample)

ggplot(hist_data, aes(x = value)) +
  geom_histogram(bins = 30, fill = "#2c7fb8", color = "white") +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  labs(title = "Distribution of Each Numeric Variable", x = "Value", y = "Count") +
  theme_report

3.3 Spread and Outliers

To compare the spread of all variables on one scale, we standardize them first (mean 0, standard deviation 1). Points far above or below the boxes are potential outliers.

scaled_data <- melt(as.data.frame(scale(X_sample)))

ggplot(scaled_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "#7fcdbb", outlier.color = "red", outlier.alpha = 0.3) +
  coord_flip() +
  labs(title = "Standardized Values and Outliers", x = "", y = "Standardized value") +
  theme_report

3.4 Correlation Between Variables

The heatmap below shows how strongly variables move together. Values near +1 (red) mean strong positive correlation; values near -1 (blue) mean strong negative correlation.

cor_matrix <- cor(X_sample)
cor_data <- melt(cor_matrix)

ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), size = 2.8) +
  scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
                       midpoint = 0, limits = c(-1, 1), name = "Correlation") +
  labs(title = "Correlation Between Numeric Variables", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.5 Distribution of the Cover Type Label

Although we do not use Cover_Type for clustering, looking at it gives useful context.

cover_labels <- c("1" = "Spruce/Fir", "2" = "Lodgepole Pine",
                  "3" = "Ponderosa Pine", "4" = "Cottonwood/Willow",
                  "5" = "Aspen", "6" = "Douglas-fir", "7" = "Krummholz")

cover_df <- as.data.frame(table(df_sample$Cover_Type))
colnames(cover_df) <- c("Code", "Count")
cover_df$Type <- cover_labels[as.character(cover_df$Code)]

ggplot(cover_df, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
  geom_col() +
  geom_text(aes(label = Count), vjust = -0.3, size = 3) +
  labs(title = "Number of Points per Forest Cover Type (in the sample)",
       x = "", y = "Count") +
  theme_report +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")

4 Data Preparation

4.1 Scaling and PCA

Because the variables use different units (meters, degrees, brightness), we standardize them so each has equal weight. Then we apply Principal Component Analysis (PCA) to combine the 10 variables into a smaller set of components, keeping enough to explain at least 80% of the total variance.

X_scaled <- scale(X_sample)
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

variance_table <- as.data.frame(round(summary(pca_result)$importance, 4))
kable(variance_table, caption = "Variance explained by each principal component")

Variance explained by each principal component
	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10
Standard deviation	1.6121	1.4678	1.3125	1.0304	0.8869	0.7437	0.6818	0.5919	0.5527	0.0442
Proportion of Variance	0.2599	0.2154	0.1723	0.1062	0.0787	0.0553	0.0465	0.0350	0.0305	0.0002
Cumulative Proportion	0.2599	0.4753	0.6476	0.7538	0.8324	0.8878	0.9342	0.9693	0.9998	1.0000

cum_var <- summary(pca_result)$importance[3, ]
n_pc_used <- which(cum_var >= 0.80)[1]

cat("Number of components kept (>= 80% of variance):", n_pc_used, "\n")

## Number of components kept (>= 80% of variance): 5

X_pca <- pca_result$x[, 1:n_pc_used]

4.2 Scree Plot

The scree plot shows how much variance each component explains (bars) and the running total (red line). We keep components up to the point where the total reaches 80%.

scree_df <- data.frame(
  PC = factor(paste0("PC", seq_along(cum_var)), levels = paste0("PC", seq_along(cum_var))),
  Proportion = summary(pca_result)$importance[2, ],
  Cumulative = cum_var
)

ggplot(scree_df, aes(x = PC)) +
  geom_col(aes(y = Proportion), fill = "#2c7fb8") +
  geom_line(aes(y = Cumulative, group = 1), color = "#b2182b", linewidth = 1) +
  geom_point(aes(y = Cumulative), color = "#b2182b", size = 2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray40") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Scree Plot: Variance Explained by Each Component",
       x = "Principal Component", y = "Variance explained") +
  theme_report

4.3 Variable Contributions

This plot shows how each original variable loads onto the first three components. Variables with bars in the same direction contribute similarly to that component.

load_df <- as.data.frame(pca_result$rotation[, 1:3])
load_df$Variable <- rownames(load_df)
load_data <- melt(load_df, id.vars = "Variable")

ggplot(load_data, aes(x = Variable, y = value, fill = value > 0)) +
  geom_col() +
  facet_wrap(~variable, ncol = 3) +
  coord_flip() +
  scale_fill_manual(values = c("FALSE" = "#2166ac", "TRUE" = "#b2182b"),
                    labels = c("Negative", "Positive"), name = "Loading") +
  labs(title = "Variable Loadings on PC1, PC2, PC3", x = "", y = "Loading") +
  theme_report

5 DBSCAN Clustering

DBSCAN uses two settings: eps (how close points must be to count as neighbors) and minPts (how many neighbors a point needs to be part of a dense region). After testing several combinations, the clearest and most stable result came from:

eps = 1.00
minPts = 15

db_eps    <- 1.00
db_minpts <- 15

5.1 Choosing eps: k-Distance Plot

A common way to choose eps is the k-distance plot. We sort the distance from each point to its k-th nearest neighbor (here k = minPts).

kd <- kNNdist(X_pca, k = db_minpts)
if (is.matrix(kd)) kd <- kd[, ncol(kd)]
kd_sorted <- sort(kd)

kd_df <- data.frame(Index = seq_along(kd_sorted), Distance = kd_sorted)

ggplot(kd_df, aes(x = Index, y = Distance)) +
  geom_line(color = "#2c7fb8", linewidth = 0.8) +
  geom_hline(yintercept = db_eps, linetype = "dashed", color = "#b2182b") +
  annotate("text", x = 0, y = db_eps, label = paste0("eps = ", db_eps),
           hjust = 0, vjust = -0.5, color = "#b2182b") +
  labs(title = paste0("k-Distance Plot (k = ", db_minpts, ")"),
       x = "Points sorted by distance", y = "Distance to k-th neighbor") +
  theme_report

5.2 Run and Summary

db_result <- dbscan(X_pca, eps = db_eps, minPts = db_minpts)

db_table <- as.data.frame(table(db_result$cluster))
colnames(db_table) <- c("Cluster", "Points")
db_table$Cluster <- ifelse(db_table$Cluster == "0", "Noise",
                            paste0("Cluster ", db_table$Cluster))
db_table$Percent <- round(db_table$Points / nrow(X_pca) * 100, 2)

kable(db_table, caption = paste0("DBSCAN result (eps = ", db_eps,
                                 ", minPts = ", db_minpts, ")"))

DBSCAN result (eps = 1, minPts = 15)
Cluster	Points	Percent
Noise	714	14.28
Cluster 1	4251	85.02
Cluster 2	35	0.70

5.3 2D Visualization

db_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(db_plot_data) <- c("PC1", "PC2")
db_plot_data$cluster <- factor(db_result$cluster)
levels(db_plot_data$cluster) <- ifelse(levels(db_plot_data$cluster) == "0", "Noise",
                                        paste0("Cluster ", levels(db_plot_data$cluster)))

ggplot(db_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = paste0("DBSCAN (eps = ", db_eps, ", minPts = ", db_minpts, ")"),
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_report

6 HDBSCAN Clustering

HDBSCAN is a more flexible version of DBSCAN. Instead of one fixed eps, it examines many density levels at once and finds clusters of varying density automatically. Its main setting is minPts. The value that gave the lowest noise while still separating some groups was:

minPts = 5

hdb_minpts <- 5
hdb_result <- hdbscan(X_pca, minPts = hdb_minpts)

hdb_table <- as.data.frame(table(hdb_result$cluster))
colnames(hdb_table) <- c("Cluster", "Points")
hdb_table$Cluster <- ifelse(hdb_table$Cluster == "0", "Noise",
                             paste0("Cluster ", hdb_table$Cluster))
hdb_table$Percent <- round(hdb_table$Points / nrow(X_pca) * 100, 2)

kable(hdb_table, caption = paste0("HDBSCAN result (minPts = ", hdb_minpts, ")"))

HDBSCAN result (minPts = 5)
Cluster	Points	Percent
Noise	452	9.04
Cluster 1	12	0.24
Cluster 2	6	0.12
Cluster 3	4530	90.60

6.1 2D Visualization

hdb_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(hdb_plot_data) <- c("PC1", "PC2")
hdb_plot_data$cluster <- factor(hdb_result$cluster)
levels(hdb_plot_data$cluster) <- ifelse(levels(hdb_plot_data$cluster) == "0", "Noise",
                                         paste0("Cluster ", levels(hdb_plot_data$cluster)))

ggplot(hdb_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = paste0("HDBSCAN (minPts = ", hdb_minpts, ")"),
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_report

6.2 Membership Probability

Unlike DBSCAN, HDBSCAN gives each point a probability (0 to 1) showing how strongly it belongs to its cluster. Values near 1 mean solid membership; values near 0 mean the point sits at the edge.

memb_df <- data.frame(prob = hdb_result$membership_prob)

ggplot(memb_df, aes(x = prob)) +
  geom_histogram(bins = 30, fill = "#7fcdbb", color = "white") +
  labs(title = "Distribution of Membership Probabilities",
       x = "Membership probability", y = "Count") +
  theme_report

7 Comparing DBSCAN and HDBSCAN

db_n_clusters  <- length(setdiff(unique(db_result$cluster), 0))
hdb_n_clusters <- length(setdiff(unique(hdb_result$cluster), 0))
db_noise_pct   <- round(sum(db_result$cluster == 0)  / length(db_result$cluster)  * 100, 2)
hdb_noise_pct  <- round(sum(hdb_result$cluster == 0) / length(hdb_result$cluster) * 100, 2)

compare_table <- data.frame(
  Method = c("DBSCAN", "HDBSCAN"),
  Number_of_Clusters = c(db_n_clusters, hdb_n_clusters),
  Noise_Percent = c(db_noise_pct, hdb_noise_pct)
)
kable(compare_table, caption = "Comparison of DBSCAN and HDBSCAN")

Comparison of DBSCAN and HDBSCAN
Method	Number_of_Clusters	Noise_Percent
DBSCAN	2	14.28
HDBSCAN	3	9.04

compare_data <- melt(compare_table, id.vars = "Method")

ggplot(compare_data, aes(x = Method, y = value, fill = Method)) +
  geom_col() +
  geom_text(aes(label = value), vjust = -0.3, size = 3.5) +
  facet_wrap(~variable, scales = "free_y") +
  labs(title = "DBSCAN vs HDBSCAN", x = "", y = "") +
  theme_report +
  theme(legend.position = "none")

8 Are the Small Clusters Just Outliers?

We test whether the small clusters contain statistically extreme points, using Mahalanobis distance. This method looks at all 10 variables together and flags a point as an outlier if its overall combination of values is unusually far from the average.

mean_vec   <- colMeans(X_scaled)
cov_matrix <- cov(X_scaled)
mahal_dist <- mahalanobis(X_scaled, center = mean_vec, cov = cov_matrix)
threshold  <- qchisq(0.975, df = length(numeric_cols))
is_outlier <- mahal_dist > threshold

cat("Points flagged as outliers:", sum(is_outlier),
    sprintf("(%.1f%% of the sample)\n", mean(is_outlier) * 100))

## Points flagged as outliers: 301 (6.0% of the sample)

outlier_by_cluster <- aggregate(is_outlier ~ hdb_plot_data$cluster, FUN = mean)
colnames(outlier_by_cluster) <- c("Cluster", "Outlier_Percent")
outlier_by_cluster$Outlier_Percent <- round(outlier_by_cluster$Outlier_Percent * 100, 1)

kable(outlier_by_cluster,
      caption = "Percentage of outliers within each HDBSCAN cluster")

Percentage of outliers within each HDBSCAN cluster
Cluster	Outlier_Percent
Noise	39.8
Cluster 1	33.3
Cluster 2	0.0
Cluster 3	2.6

ggplot(outlier_by_cluster, aes(x = reorder(Cluster, -Outlier_Percent),
                               y = Outlier_Percent, fill = Cluster)) +
  geom_col() +
  geom_text(aes(label = paste0(Outlier_Percent, "%")), vjust = -0.3, size = 3.5) +
  labs(title = "Outlier Percentage per Cluster",
       x = "", y = "Outliers (%)") +
  theme_report +
  theme(legend.position = "none")

9 Cluster Profiling

Profiling means describing what each cluster represents in terms of the original variables. We use the HDBSCAN clusters here because they give more groups to compare.

9.1 Average Values per Cluster

hdb_interp <- cbind(X_sample, cluster = hdb_plot_data$cluster)
hdb_profile <- aggregate(. ~ cluster, data = hdb_interp, mean)
hdb_profile[, -1] <- round(hdb_profile[, -1], 1)

kable(hdb_profile, caption = "Average value of each variable per HDBSCAN cluster")

Average value of each variable per HDBSCAN cluster
cluster	Elevation	Aspect	Slope	Horizontal_Distance_To_Hydrology	Vertical_Distance_To_Hydrology	Horizontal_Distance_To_Roadways	Hillshade_9am	Hillshade_Noon	Hillshade_3pm	Horizontal_Distance_To_Fire_Points
Noise	2888.8	184.8	23.8	386.8	117.3	2159.6	189.7	201.6	138.0	2116.1
Cluster 1	3307.4	166.9	8.6	1034.4	195.0	2593.8	225.6	243.1	150.2	1531.8
Cluster 2	2398.5	313.7	24.8	412.2	183.7	789.5	146.0	211.8	201.8	438.2
Cluster 3	2967.4	153.4	13.0	252.5	38.1	2348.3	214.4	225.7	143.3	1974.0

9.2 Profile Heatmap

The raw averages are hard to compare because variables use different scales. Here we standardize each variable across the clusters (so red means “high for this cluster” and blue means “low”).

prof_mat <- as.matrix(hdb_profile[, -1])
rownames(prof_mat) <- hdb_profile$cluster

prof_scaled <- scale(prof_mat)              # z-score each variable across clusters
prof_scaled[is.nan(prof_scaled)] <- 0       # guard against zero-variance columns

prof_data <- melt(prof_scaled)
colnames(prof_data) <- c("Cluster", "Variable", "Value")

ggplot(prof_data, aes(x = Variable, y = Cluster, fill = Value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
                       midpoint = 0, name = "Standardized\nmean") +
  labs(title = "Cluster Profile Heatmap (standardized means)", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

9.3 Key Variables by Cluster

Boxplots of a few important variables show how the clusters differ in real units.

key_vars <- c("Elevation", "Slope", "Horizontal_Distance_To_Roadways")

box_df <- hdb_interp[, c("cluster", key_vars)]
box_data <- melt(box_df, id.vars = "cluster")

ggplot(box_data, aes(x = cluster, y = value, fill = cluster)) +
  geom_boxplot(outlier.size = 0.5, outlier.alpha = 0.3) +
  facet_wrap(~variable, scales = "free_y") +
  labs(title = "Key Variables Across Clusters", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")

Clustering Analysis: DBSCAN vs HDBSCAN

Forest Cover Type

Faiz Fadhlirobby

2026-07-01