1 Introduction

This report analyzes the Forest Cover Type dataset using two density-based clustering methods: DBSCAN and HDBSCAN.

The report follows a clear, step-by-step flow:

  1. Load the data and look at its structure.
  2. Explore the data with summary statistics and plots.
  3. Prepare the data (sampling 5000, scaling, PCA).
  4. Run DBSCAN and HDBSCAN.
  5. Compare the two methods.
  6. Check whether small clusters are just outliers.
  7. Profile each cluster to understand what it represents.

2 Load Data

# Change this path to match the location of your file
df <- read.csv("D:\\OneDrive - Universitas Islam Indonesia\\College\\Spatial Data Science\\covtype.csv")

cat("Number of rows:", nrow(df), "\n")
## Number of rows: 581012
cat("Number of columns:", ncol(df), "\n")
## Number of columns: 55

I use the 10 continuous numeric variables. The binary variables (Wilderness_Area, Soil_Type) and the label (Cover_Type) are kept aside, because they are categorical

numeric_cols <- c(
  "Elevation", "Aspect", "Slope",
  "Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology",
  "Horizontal_Distance_To_Roadways",
  "Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
  "Horizontal_Distance_To_Fire_Points"
)

kable(data.frame(No = 1:length(numeric_cols), Variable = numeric_cols),
      caption = "The 10 numeric variables used in this analysis")
The 10 numeric variables used in this analysis
No Variable
1 Elevation
2 Aspect
3 Slope
4 Horizontal_Distance_To_Hydrology
5 Vertical_Distance_To_Hydrology
6 Horizontal_Distance_To_Roadways
7 Hillshade_9am
8 Hillshade_Noon
9 Hillshade_3pm
10 Horizontal_Distance_To_Fire_Points

2.1 Sampling

The full dataset has over 580,000 rows, which is too large for DBSCAN and HDBSCAN to handle directly. I take a random sample of 5,000 rows and use this same sample for both the EDA and the clustering.

set.seed(42)
sample_size <- 5000
sample_idx <- sample(seq_len(nrow(df)), sample_size)

df_sample <- df[sample_idx, c(numeric_cols, "Cover_Type")]
X_sample  <- df_sample[, numeric_cols]

cat("Sample size used:", nrow(X_sample), "rows\n")
## Sample size used: 5000 rows

3 Exploratory Data Analysis (EDA)

3.1 Summary Statistics

summary_stats <- data.frame(
  Variable = numeric_cols,
  Min    = sapply(X_sample, min),
  Mean   = round(sapply(X_sample, mean), 1),
  Median = sapply(X_sample, median),
  Max    = sapply(X_sample, max),
  SD     = round(sapply(X_sample, sd), 1)
)
rownames(summary_stats) <- NULL
kable(summary_stats, caption = "Summary statistics of the numeric variables")
Summary statistics of the numeric variables
Variable Min Mean Median Max SD
Elevation 1898 2960.5 2995 3857 280.2
Aspect 0 156.5 131 359 111.5
Slope 0 14.0 13 49 7.4
Horizontal_Distance_To_Hydrology 0 266.8 216 1307 212.2
Vertical_Distance_To_Hydrology -152 45.8 29 397 58.1
Horizontal_Distance_To_Roadways 0 2330.0 1959 6975 1564.1
Hillshade_9am 75 212.1 218 254 26.4
Hillshade_Noon 114 223.6 227 254 19.9
Hillshade_3pm 0 142.9 143 250 38.1
Horizontal_Distance_To_Fire_Points 30 1984.0 1719 7117 1335.9

3.2 Distribution of Each Variable

Histograms show the shape of each variable. Some variables are skewed to the right, meaning most values are small but a few are very large.

hist_data <- melt(X_sample)

ggplot(hist_data, aes(x = value)) +
  geom_histogram(bins = 30, fill = "#2c7fb8", color = "white") +
  facet_wrap(~variable, scales = "free", ncol = 3) +
  labs(title = "Distribution of Each Numeric Variable", x = "Value", y = "Count") +
  theme_report

3.3 Spread and Outliers

To compare the spread of all variables on one scale, we standardize them first (mean 0, standard deviation 1). Points far above or below the boxes are potential outliers.

scaled_data <- melt(as.data.frame(scale(X_sample)))

ggplot(scaled_data, aes(x = variable, y = value)) +
  geom_boxplot(fill = "#7fcdbb", outlier.color = "red", outlier.alpha = 0.3) +
  coord_flip() +
  labs(title = "Standardized Values and Outliers", x = "", y = "Standardized value") +
  theme_report

3.4 Correlation Between Variables

The heatmap below shows how strongly variables move together. Values near +1 (red) mean strong positive correlation; values near -1 (blue) mean strong negative correlation.

cor_matrix <- cor(X_sample)
cor_data <- melt(cor_matrix)

ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +
  geom_text(aes(label = round(value, 2)), size = 2.8) +
  scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
                       midpoint = 0, limits = c(-1, 1), name = "Correlation") +
  labs(title = "Correlation Between Numeric Variables", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

3.5 Distribution of the Cover Type Label

Although we do not use Cover_Type for clustering, looking at it gives useful context.

cover_labels <- c("1" = "Spruce/Fir", "2" = "Lodgepole Pine",
                  "3" = "Ponderosa Pine", "4" = "Cottonwood/Willow",
                  "5" = "Aspen", "6" = "Douglas-fir", "7" = "Krummholz")

cover_df <- as.data.frame(table(df_sample$Cover_Type))
colnames(cover_df) <- c("Code", "Count")
cover_df$Type <- cover_labels[as.character(cover_df$Code)]

ggplot(cover_df, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
  geom_col() +
  geom_text(aes(label = Count), vjust = -0.3, size = 3) +
  labs(title = "Number of Points per Forest Cover Type (in the sample)",
       x = "", y = "Count") +
  theme_report +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")

4 Data Preparation

4.1 Scaling and PCA

Because the variables use different units (meters, degrees, brightness), we standardize them so each has equal weight. Then we apply Principal Component Analysis (PCA) to combine the 10 variables into a smaller set of components, keeping enough to explain at least 80% of the total variance.

X_scaled <- scale(X_sample)
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)

variance_table <- as.data.frame(round(summary(pca_result)$importance, 4))
kable(variance_table, caption = "Variance explained by each principal component")
Variance explained by each principal component
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10
Standard deviation 1.6121 1.4678 1.3125 1.0304 0.8869 0.7437 0.6818 0.5919 0.5527 0.0442
Proportion of Variance 0.2599 0.2154 0.1723 0.1062 0.0787 0.0553 0.0465 0.0350 0.0305 0.0002
Cumulative Proportion 0.2599 0.4753 0.6476 0.7538 0.8324 0.8878 0.9342 0.9693 0.9998 1.0000
cum_var <- summary(pca_result)$importance[3, ]
n_pc_used <- which(cum_var >= 0.80)[1]

cat("Number of components kept (>= 80% of variance):", n_pc_used, "\n")
## Number of components kept (>= 80% of variance): 5
X_pca <- pca_result$x[, 1:n_pc_used]

4.2 Scree Plot

The scree plot shows how much variance each component explains (bars) and the running total (red line). We keep components up to the point where the total reaches 80%.

scree_df <- data.frame(
  PC = factor(paste0("PC", seq_along(cum_var)), levels = paste0("PC", seq_along(cum_var))),
  Proportion = summary(pca_result)$importance[2, ],
  Cumulative = cum_var
)

ggplot(scree_df, aes(x = PC)) +
  geom_col(aes(y = Proportion), fill = "#2c7fb8") +
  geom_line(aes(y = Cumulative, group = 1), color = "#b2182b", linewidth = 1) +
  geom_point(aes(y = Cumulative), color = "#b2182b", size = 2) +
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray40") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Scree Plot: Variance Explained by Each Component",
       x = "Principal Component", y = "Variance explained") +
  theme_report

4.3 Variable Contributions

This plot shows how each original variable loads onto the first three components. Variables with bars in the same direction contribute similarly to that component.

load_df <- as.data.frame(pca_result$rotation[, 1:3])
load_df$Variable <- rownames(load_df)
load_data <- melt(load_df, id.vars = "Variable")

ggplot(load_data, aes(x = Variable, y = value, fill = value > 0)) +
  geom_col() +
  facet_wrap(~variable, ncol = 3) +
  coord_flip() +
  scale_fill_manual(values = c("FALSE" = "#2166ac", "TRUE" = "#b2182b"),
                    labels = c("Negative", "Positive"), name = "Loading") +
  labs(title = "Variable Loadings on PC1, PC2, PC3", x = "", y = "Loading") +
  theme_report

5 DBSCAN Clustering

DBSCAN uses two settings: eps (how close points must be to count as neighbors) and minPts (how many neighbors a point needs to be part of a dense region). After testing several combinations, the clearest and most stable result came from:

  • eps = 1.00
  • minPts = 15
db_eps    <- 1.00
db_minpts <- 15

5.1 Choosing eps: k-Distance Plot

A common way to choose eps is the k-distance plot. We sort the distance from each point to its k-th nearest neighbor (here k = minPts).

kd <- kNNdist(X_pca, k = db_minpts)
if (is.matrix(kd)) kd <- kd[, ncol(kd)]
kd_sorted <- sort(kd)

kd_df <- data.frame(Index = seq_along(kd_sorted), Distance = kd_sorted)

ggplot(kd_df, aes(x = Index, y = Distance)) +
  geom_line(color = "#2c7fb8", linewidth = 0.8) +
  geom_hline(yintercept = db_eps, linetype = "dashed", color = "#b2182b") +
  annotate("text", x = 0, y = db_eps, label = paste0("eps = ", db_eps),
           hjust = 0, vjust = -0.5, color = "#b2182b") +
  labs(title = paste0("k-Distance Plot (k = ", db_minpts, ")"),
       x = "Points sorted by distance", y = "Distance to k-th neighbor") +
  theme_report

5.2 Run and Summary

db_result <- dbscan(X_pca, eps = db_eps, minPts = db_minpts)

db_table <- as.data.frame(table(db_result$cluster))
colnames(db_table) <- c("Cluster", "Points")
db_table$Cluster <- ifelse(db_table$Cluster == "0", "Noise",
                            paste0("Cluster ", db_table$Cluster))
db_table$Percent <- round(db_table$Points / nrow(X_pca) * 100, 2)

kable(db_table, caption = paste0("DBSCAN result (eps = ", db_eps,
                                 ", minPts = ", db_minpts, ")"))
DBSCAN result (eps = 1, minPts = 15)
Cluster Points Percent
Noise 714 14.28
Cluster 1 4251 85.02
Cluster 2 35 0.70

5.3 2D Visualization

db_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(db_plot_data) <- c("PC1", "PC2")
db_plot_data$cluster <- factor(db_result$cluster)
levels(db_plot_data$cluster) <- ifelse(levels(db_plot_data$cluster) == "0", "Noise",
                                        paste0("Cluster ", levels(db_plot_data$cluster)))

ggplot(db_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = paste0("DBSCAN (eps = ", db_eps, ", minPts = ", db_minpts, ")"),
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_report

6 HDBSCAN Clustering

HDBSCAN is a more flexible version of DBSCAN. Instead of one fixed eps, it examines many density levels at once and finds clusters of varying density automatically. Its main setting is minPts. The value that gave the lowest noise while still separating some groups was:

  • minPts = 5
hdb_minpts <- 5
hdb_result <- hdbscan(X_pca, minPts = hdb_minpts)

hdb_table <- as.data.frame(table(hdb_result$cluster))
colnames(hdb_table) <- c("Cluster", "Points")
hdb_table$Cluster <- ifelse(hdb_table$Cluster == "0", "Noise",
                             paste0("Cluster ", hdb_table$Cluster))
hdb_table$Percent <- round(hdb_table$Points / nrow(X_pca) * 100, 2)

kable(hdb_table, caption = paste0("HDBSCAN result (minPts = ", hdb_minpts, ")"))
HDBSCAN result (minPts = 5)
Cluster Points Percent
Noise 452 9.04
Cluster 1 12 0.24
Cluster 2 6 0.12
Cluster 3 4530 90.60

6.1 2D Visualization

hdb_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(hdb_plot_data) <- c("PC1", "PC2")
hdb_plot_data$cluster <- factor(hdb_result$cluster)
levels(hdb_plot_data$cluster) <- ifelse(levels(hdb_plot_data$cluster) == "0", "Noise",
                                         paste0("Cluster ", levels(hdb_plot_data$cluster)))

ggplot(hdb_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
  geom_point(alpha = 0.6, size = 1.5) +
  labs(title = paste0("HDBSCAN (minPts = ", hdb_minpts, ")"),
       x = "PC1", y = "PC2", color = "Cluster") +
  theme_report

6.2 Membership Probability

Unlike DBSCAN, HDBSCAN gives each point a probability (0 to 1) showing how strongly it belongs to its cluster. Values near 1 mean solid membership; values near 0 mean the point sits at the edge.

memb_df <- data.frame(prob = hdb_result$membership_prob)

ggplot(memb_df, aes(x = prob)) +
  geom_histogram(bins = 30, fill = "#7fcdbb", color = "white") +
  labs(title = "Distribution of Membership Probabilities",
       x = "Membership probability", y = "Count") +
  theme_report

7 Comparing DBSCAN and HDBSCAN

db_n_clusters  <- length(setdiff(unique(db_result$cluster), 0))
hdb_n_clusters <- length(setdiff(unique(hdb_result$cluster), 0))
db_noise_pct   <- round(sum(db_result$cluster == 0)  / length(db_result$cluster)  * 100, 2)
hdb_noise_pct  <- round(sum(hdb_result$cluster == 0) / length(hdb_result$cluster) * 100, 2)

compare_table <- data.frame(
  Method = c("DBSCAN", "HDBSCAN"),
  Number_of_Clusters = c(db_n_clusters, hdb_n_clusters),
  Noise_Percent = c(db_noise_pct, hdb_noise_pct)
)
kable(compare_table, caption = "Comparison of DBSCAN and HDBSCAN")
Comparison of DBSCAN and HDBSCAN
Method Number_of_Clusters Noise_Percent
DBSCAN 2 14.28
HDBSCAN 3 9.04
compare_data <- melt(compare_table, id.vars = "Method")

ggplot(compare_data, aes(x = Method, y = value, fill = Method)) +
  geom_col() +
  geom_text(aes(label = value), vjust = -0.3, size = 3.5) +
  facet_wrap(~variable, scales = "free_y") +
  labs(title = "DBSCAN vs HDBSCAN", x = "", y = "") +
  theme_report +
  theme(legend.position = "none")

8 Are the Small Clusters Just Outliers?

We test whether the small clusters contain statistically extreme points, using Mahalanobis distance. This method looks at all 10 variables together and flags a point as an outlier if its overall combination of values is unusually far from the average.

mean_vec   <- colMeans(X_scaled)
cov_matrix <- cov(X_scaled)
mahal_dist <- mahalanobis(X_scaled, center = mean_vec, cov = cov_matrix)
threshold  <- qchisq(0.975, df = length(numeric_cols))
is_outlier <- mahal_dist > threshold

cat("Points flagged as outliers:", sum(is_outlier),
    sprintf("(%.1f%% of the sample)\n", mean(is_outlier) * 100))
## Points flagged as outliers: 301 (6.0% of the sample)
outlier_by_cluster <- aggregate(is_outlier ~ hdb_plot_data$cluster, FUN = mean)
colnames(outlier_by_cluster) <- c("Cluster", "Outlier_Percent")
outlier_by_cluster$Outlier_Percent <- round(outlier_by_cluster$Outlier_Percent * 100, 1)

kable(outlier_by_cluster,
      caption = "Percentage of outliers within each HDBSCAN cluster")
Percentage of outliers within each HDBSCAN cluster
Cluster Outlier_Percent
Noise 39.8
Cluster 1 33.3
Cluster 2 0.0
Cluster 3 2.6
ggplot(outlier_by_cluster, aes(x = reorder(Cluster, -Outlier_Percent),
                               y = Outlier_Percent, fill = Cluster)) +
  geom_col() +
  geom_text(aes(label = paste0(Outlier_Percent, "%")), vjust = -0.3, size = 3.5) +
  labs(title = "Outlier Percentage per Cluster",
       x = "", y = "Outliers (%)") +
  theme_report +
  theme(legend.position = "none")

9 Cluster Profiling

Profiling means describing what each cluster represents in terms of the original variables. We use the HDBSCAN clusters here because they give more groups to compare.

9.1 Average Values per Cluster

hdb_interp <- cbind(X_sample, cluster = hdb_plot_data$cluster)
hdb_profile <- aggregate(. ~ cluster, data = hdb_interp, mean)
hdb_profile[, -1] <- round(hdb_profile[, -1], 1)

kable(hdb_profile, caption = "Average value of each variable per HDBSCAN cluster")
Average value of each variable per HDBSCAN cluster
cluster Elevation Aspect Slope Horizontal_Distance_To_Hydrology Vertical_Distance_To_Hydrology Horizontal_Distance_To_Roadways Hillshade_9am Hillshade_Noon Hillshade_3pm Horizontal_Distance_To_Fire_Points
Noise 2888.8 184.8 23.8 386.8 117.3 2159.6 189.7 201.6 138.0 2116.1
Cluster 1 3307.4 166.9 8.6 1034.4 195.0 2593.8 225.6 243.1 150.2 1531.8
Cluster 2 2398.5 313.7 24.8 412.2 183.7 789.5 146.0 211.8 201.8 438.2
Cluster 3 2967.4 153.4 13.0 252.5 38.1 2348.3 214.4 225.7 143.3 1974.0

9.2 Profile Heatmap

The raw averages are hard to compare because variables use different scales. Here we standardize each variable across the clusters (so red means “high for this cluster” and blue means “low”).

prof_mat <- as.matrix(hdb_profile[, -1])
rownames(prof_mat) <- hdb_profile$cluster

prof_scaled <- scale(prof_mat)              # z-score each variable across clusters
prof_scaled[is.nan(prof_scaled)] <- 0       # guard against zero-variance columns

prof_data <- melt(prof_scaled)
colnames(prof_data) <- c("Cluster", "Variable", "Value")

ggplot(prof_data, aes(x = Variable, y = Cluster, fill = Value)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
                       midpoint = 0, name = "Standardized\nmean") +
  labs(title = "Cluster Profile Heatmap (standardized means)", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

9.3 Key Variables by Cluster

Boxplots of a few important variables show how the clusters differ in real units.

key_vars <- c("Elevation", "Slope", "Horizontal_Distance_To_Roadways")

box_df <- hdb_interp[, c("cluster", key_vars)]
box_data <- melt(box_df, id.vars = "cluster")

ggplot(box_data, aes(x = cluster, y = value, fill = cluster)) +
  geom_boxplot(outlier.size = 0.5, outlier.alpha = 0.3) +
  facet_wrap(~variable, scales = "free_y") +
  labs(title = "Key Variables Across Clusters", x = "", y = "") +
  theme_report +
  theme(axis.text.x = element_text(angle = 30, hjust = 1),
        legend.position = "none")