This report analyzes the Forest Cover Type dataset using two density-based clustering methods: DBSCAN and HDBSCAN.
The report follows a clear, step-by-step flow:
# Change this path to match the location of your file
df <- read.csv("D:\\OneDrive - Universitas Islam Indonesia\\College\\Spatial Data Science\\covtype.csv")
cat("Number of rows:", nrow(df), "\n")
## Number of rows: 581012
cat("Number of columns:", ncol(df), "\n")
## Number of columns: 55
I use the 10 continuous numeric variables. The binary variables
(Wilderness_Area, Soil_Type) and the label
(Cover_Type) are kept aside, because they are
categorical
numeric_cols <- c(
"Elevation", "Aspect", "Slope",
"Horizontal_Distance_To_Hydrology", "Vertical_Distance_To_Hydrology",
"Horizontal_Distance_To_Roadways",
"Hillshade_9am", "Hillshade_Noon", "Hillshade_3pm",
"Horizontal_Distance_To_Fire_Points"
)
kable(data.frame(No = 1:length(numeric_cols), Variable = numeric_cols),
caption = "The 10 numeric variables used in this analysis")
| No | Variable |
|---|---|
| 1 | Elevation |
| 2 | Aspect |
| 3 | Slope |
| 4 | Horizontal_Distance_To_Hydrology |
| 5 | Vertical_Distance_To_Hydrology |
| 6 | Horizontal_Distance_To_Roadways |
| 7 | Hillshade_9am |
| 8 | Hillshade_Noon |
| 9 | Hillshade_3pm |
| 10 | Horizontal_Distance_To_Fire_Points |
The full dataset has over 580,000 rows, which is too large for DBSCAN and HDBSCAN to handle directly. I take a random sample of 5,000 rows and use this same sample for both the EDA and the clustering.
set.seed(42)
sample_size <- 5000
sample_idx <- sample(seq_len(nrow(df)), sample_size)
df_sample <- df[sample_idx, c(numeric_cols, "Cover_Type")]
X_sample <- df_sample[, numeric_cols]
cat("Sample size used:", nrow(X_sample), "rows\n")
## Sample size used: 5000 rows
summary_stats <- data.frame(
Variable = numeric_cols,
Min = sapply(X_sample, min),
Mean = round(sapply(X_sample, mean), 1),
Median = sapply(X_sample, median),
Max = sapply(X_sample, max),
SD = round(sapply(X_sample, sd), 1)
)
rownames(summary_stats) <- NULL
kable(summary_stats, caption = "Summary statistics of the numeric variables")
| Variable | Min | Mean | Median | Max | SD |
|---|---|---|---|---|---|
| Elevation | 1898 | 2960.5 | 2995 | 3857 | 280.2 |
| Aspect | 0 | 156.5 | 131 | 359 | 111.5 |
| Slope | 0 | 14.0 | 13 | 49 | 7.4 |
| Horizontal_Distance_To_Hydrology | 0 | 266.8 | 216 | 1307 | 212.2 |
| Vertical_Distance_To_Hydrology | -152 | 45.8 | 29 | 397 | 58.1 |
| Horizontal_Distance_To_Roadways | 0 | 2330.0 | 1959 | 6975 | 1564.1 |
| Hillshade_9am | 75 | 212.1 | 218 | 254 | 26.4 |
| Hillshade_Noon | 114 | 223.6 | 227 | 254 | 19.9 |
| Hillshade_3pm | 0 | 142.9 | 143 | 250 | 38.1 |
| Horizontal_Distance_To_Fire_Points | 30 | 1984.0 | 1719 | 7117 | 1335.9 |
Histograms show the shape of each variable. Some variables are skewed to the right, meaning most values are small but a few are very large.
hist_data <- melt(X_sample)
ggplot(hist_data, aes(x = value)) +
geom_histogram(bins = 30, fill = "#2c7fb8", color = "white") +
facet_wrap(~variable, scales = "free", ncol = 3) +
labs(title = "Distribution of Each Numeric Variable", x = "Value", y = "Count") +
theme_report
To compare the spread of all variables on one scale, we standardize them first (mean 0, standard deviation 1). Points far above or below the boxes are potential outliers.
scaled_data <- melt(as.data.frame(scale(X_sample)))
ggplot(scaled_data, aes(x = variable, y = value)) +
geom_boxplot(fill = "#7fcdbb", outlier.color = "red", outlier.alpha = 0.3) +
coord_flip() +
labs(title = "Standardized Values and Outliers", x = "", y = "Standardized value") +
theme_report
The heatmap below shows how strongly variables move together. Values near +1 (red) mean strong positive correlation; values near -1 (blue) mean strong negative correlation.
cor_matrix <- cor(X_sample)
cor_data <- melt(cor_matrix)
ggplot(cor_data, aes(x = Var1, y = Var2, fill = value)) +
geom_tile(color = "white") +
geom_text(aes(label = round(value, 2)), size = 2.8) +
scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0, limits = c(-1, 1), name = "Correlation") +
labs(title = "Correlation Between Numeric Variables", x = "", y = "") +
theme_report +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Although we do not use Cover_Type for clustering,
looking at it gives useful context.
cover_labels <- c("1" = "Spruce/Fir", "2" = "Lodgepole Pine",
"3" = "Ponderosa Pine", "4" = "Cottonwood/Willow",
"5" = "Aspen", "6" = "Douglas-fir", "7" = "Krummholz")
cover_df <- as.data.frame(table(df_sample$Cover_Type))
colnames(cover_df) <- c("Code", "Count")
cover_df$Type <- cover_labels[as.character(cover_df$Code)]
ggplot(cover_df, aes(x = reorder(Type, -Count), y = Count, fill = Type)) +
geom_col() +
geom_text(aes(label = Count), vjust = -0.3, size = 3) +
labs(title = "Number of Points per Forest Cover Type (in the sample)",
x = "", y = "Count") +
theme_report +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")
Because the variables use different units (meters, degrees, brightness), we standardize them so each has equal weight. Then we apply Principal Component Analysis (PCA) to combine the 10 variables into a smaller set of components, keeping enough to explain at least 80% of the total variance.
X_scaled <- scale(X_sample)
pca_result <- prcomp(X_scaled, center = FALSE, scale. = FALSE)
variance_table <- as.data.frame(round(summary(pca_result)$importance, 4))
kable(variance_table, caption = "Variance explained by each principal component")
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Standard deviation | 1.6121 | 1.4678 | 1.3125 | 1.0304 | 0.8869 | 0.7437 | 0.6818 | 0.5919 | 0.5527 | 0.0442 |
| Proportion of Variance | 0.2599 | 0.2154 | 0.1723 | 0.1062 | 0.0787 | 0.0553 | 0.0465 | 0.0350 | 0.0305 | 0.0002 |
| Cumulative Proportion | 0.2599 | 0.4753 | 0.6476 | 0.7538 | 0.8324 | 0.8878 | 0.9342 | 0.9693 | 0.9998 | 1.0000 |
cum_var <- summary(pca_result)$importance[3, ]
n_pc_used <- which(cum_var >= 0.80)[1]
cat("Number of components kept (>= 80% of variance):", n_pc_used, "\n")
## Number of components kept (>= 80% of variance): 5
X_pca <- pca_result$x[, 1:n_pc_used]
The scree plot shows how much variance each component explains (bars) and the running total (red line). We keep components up to the point where the total reaches 80%.
scree_df <- data.frame(
PC = factor(paste0("PC", seq_along(cum_var)), levels = paste0("PC", seq_along(cum_var))),
Proportion = summary(pca_result)$importance[2, ],
Cumulative = cum_var
)
ggplot(scree_df, aes(x = PC)) +
geom_col(aes(y = Proportion), fill = "#2c7fb8") +
geom_line(aes(y = Cumulative, group = 1), color = "#b2182b", linewidth = 1) +
geom_point(aes(y = Cumulative), color = "#b2182b", size = 2) +
geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray40") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Scree Plot: Variance Explained by Each Component",
x = "Principal Component", y = "Variance explained") +
theme_report
This plot shows how each original variable loads onto the first three components. Variables with bars in the same direction contribute similarly to that component.
load_df <- as.data.frame(pca_result$rotation[, 1:3])
load_df$Variable <- rownames(load_df)
load_data <- melt(load_df, id.vars = "Variable")
ggplot(load_data, aes(x = Variable, y = value, fill = value > 0)) +
geom_col() +
facet_wrap(~variable, ncol = 3) +
coord_flip() +
scale_fill_manual(values = c("FALSE" = "#2166ac", "TRUE" = "#b2182b"),
labels = c("Negative", "Positive"), name = "Loading") +
labs(title = "Variable Loadings on PC1, PC2, PC3", x = "", y = "Loading") +
theme_report
DBSCAN uses two settings: eps (how close points must be to count as neighbors) and minPts (how many neighbors a point needs to be part of a dense region). After testing several combinations, the clearest and most stable result came from:
db_eps <- 1.00
db_minpts <- 15
A common way to choose eps is the k-distance plot. We
sort the distance from each point to its k-th nearest neighbor (here k =
minPts).
kd <- kNNdist(X_pca, k = db_minpts)
if (is.matrix(kd)) kd <- kd[, ncol(kd)]
kd_sorted <- sort(kd)
kd_df <- data.frame(Index = seq_along(kd_sorted), Distance = kd_sorted)
ggplot(kd_df, aes(x = Index, y = Distance)) +
geom_line(color = "#2c7fb8", linewidth = 0.8) +
geom_hline(yintercept = db_eps, linetype = "dashed", color = "#b2182b") +
annotate("text", x = 0, y = db_eps, label = paste0("eps = ", db_eps),
hjust = 0, vjust = -0.5, color = "#b2182b") +
labs(title = paste0("k-Distance Plot (k = ", db_minpts, ")"),
x = "Points sorted by distance", y = "Distance to k-th neighbor") +
theme_report
db_result <- dbscan(X_pca, eps = db_eps, minPts = db_minpts)
db_table <- as.data.frame(table(db_result$cluster))
colnames(db_table) <- c("Cluster", "Points")
db_table$Cluster <- ifelse(db_table$Cluster == "0", "Noise",
paste0("Cluster ", db_table$Cluster))
db_table$Percent <- round(db_table$Points / nrow(X_pca) * 100, 2)
kable(db_table, caption = paste0("DBSCAN result (eps = ", db_eps,
", minPts = ", db_minpts, ")"))
| Cluster | Points | Percent |
|---|---|---|
| Noise | 714 | 14.28 |
| Cluster 1 | 4251 | 85.02 |
| Cluster 2 | 35 | 0.70 |
db_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(db_plot_data) <- c("PC1", "PC2")
db_plot_data$cluster <- factor(db_result$cluster)
levels(db_plot_data$cluster) <- ifelse(levels(db_plot_data$cluster) == "0", "Noise",
paste0("Cluster ", levels(db_plot_data$cluster)))
ggplot(db_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = paste0("DBSCAN (eps = ", db_eps, ", minPts = ", db_minpts, ")"),
x = "PC1", y = "PC2", color = "Cluster") +
theme_report
HDBSCAN is a more flexible version of DBSCAN. Instead of one fixed
eps, it examines many density levels at once and finds
clusters of varying density automatically. Its main setting is
minPts. The value that gave the lowest noise while still
separating some groups was:
hdb_minpts <- 5
hdb_result <- hdbscan(X_pca, minPts = hdb_minpts)
hdb_table <- as.data.frame(table(hdb_result$cluster))
colnames(hdb_table) <- c("Cluster", "Points")
hdb_table$Cluster <- ifelse(hdb_table$Cluster == "0", "Noise",
paste0("Cluster ", hdb_table$Cluster))
hdb_table$Percent <- round(hdb_table$Points / nrow(X_pca) * 100, 2)
kable(hdb_table, caption = paste0("HDBSCAN result (minPts = ", hdb_minpts, ")"))
| Cluster | Points | Percent |
|---|---|---|
| Noise | 452 | 9.04 |
| Cluster 1 | 12 | 0.24 |
| Cluster 2 | 6 | 0.12 |
| Cluster 3 | 4530 | 90.60 |
hdb_plot_data <- as.data.frame(X_pca[, 1:2])
colnames(hdb_plot_data) <- c("PC1", "PC2")
hdb_plot_data$cluster <- factor(hdb_result$cluster)
levels(hdb_plot_data$cluster) <- ifelse(levels(hdb_plot_data$cluster) == "0", "Noise",
paste0("Cluster ", levels(hdb_plot_data$cluster)))
ggplot(hdb_plot_data, aes(x = PC1, y = PC2, color = cluster)) +
geom_point(alpha = 0.6, size = 1.5) +
labs(title = paste0("HDBSCAN (minPts = ", hdb_minpts, ")"),
x = "PC1", y = "PC2", color = "Cluster") +
theme_report
Unlike DBSCAN, HDBSCAN gives each point a probability (0 to 1) showing how strongly it belongs to its cluster. Values near 1 mean solid membership; values near 0 mean the point sits at the edge.
memb_df <- data.frame(prob = hdb_result$membership_prob)
ggplot(memb_df, aes(x = prob)) +
geom_histogram(bins = 30, fill = "#7fcdbb", color = "white") +
labs(title = "Distribution of Membership Probabilities",
x = "Membership probability", y = "Count") +
theme_report
db_n_clusters <- length(setdiff(unique(db_result$cluster), 0))
hdb_n_clusters <- length(setdiff(unique(hdb_result$cluster), 0))
db_noise_pct <- round(sum(db_result$cluster == 0) / length(db_result$cluster) * 100, 2)
hdb_noise_pct <- round(sum(hdb_result$cluster == 0) / length(hdb_result$cluster) * 100, 2)
compare_table <- data.frame(
Method = c("DBSCAN", "HDBSCAN"),
Number_of_Clusters = c(db_n_clusters, hdb_n_clusters),
Noise_Percent = c(db_noise_pct, hdb_noise_pct)
)
kable(compare_table, caption = "Comparison of DBSCAN and HDBSCAN")
| Method | Number_of_Clusters | Noise_Percent |
|---|---|---|
| DBSCAN | 2 | 14.28 |
| HDBSCAN | 3 | 9.04 |
compare_data <- melt(compare_table, id.vars = "Method")
ggplot(compare_data, aes(x = Method, y = value, fill = Method)) +
geom_col() +
geom_text(aes(label = value), vjust = -0.3, size = 3.5) +
facet_wrap(~variable, scales = "free_y") +
labs(title = "DBSCAN vs HDBSCAN", x = "", y = "") +
theme_report +
theme(legend.position = "none")
We test whether the small clusters contain statistically extreme points, using Mahalanobis distance. This method looks at all 10 variables together and flags a point as an outlier if its overall combination of values is unusually far from the average.
mean_vec <- colMeans(X_scaled)
cov_matrix <- cov(X_scaled)
mahal_dist <- mahalanobis(X_scaled, center = mean_vec, cov = cov_matrix)
threshold <- qchisq(0.975, df = length(numeric_cols))
is_outlier <- mahal_dist > threshold
cat("Points flagged as outliers:", sum(is_outlier),
sprintf("(%.1f%% of the sample)\n", mean(is_outlier) * 100))
## Points flagged as outliers: 301 (6.0% of the sample)
outlier_by_cluster <- aggregate(is_outlier ~ hdb_plot_data$cluster, FUN = mean)
colnames(outlier_by_cluster) <- c("Cluster", "Outlier_Percent")
outlier_by_cluster$Outlier_Percent <- round(outlier_by_cluster$Outlier_Percent * 100, 1)
kable(outlier_by_cluster,
caption = "Percentage of outliers within each HDBSCAN cluster")
| Cluster | Outlier_Percent |
|---|---|
| Noise | 39.8 |
| Cluster 1 | 33.3 |
| Cluster 2 | 0.0 |
| Cluster 3 | 2.6 |
ggplot(outlier_by_cluster, aes(x = reorder(Cluster, -Outlier_Percent),
y = Outlier_Percent, fill = Cluster)) +
geom_col() +
geom_text(aes(label = paste0(Outlier_Percent, "%")), vjust = -0.3, size = 3.5) +
labs(title = "Outlier Percentage per Cluster",
x = "", y = "Outliers (%)") +
theme_report +
theme(legend.position = "none")
Profiling means describing what each cluster represents in terms of the original variables. We use the HDBSCAN clusters here because they give more groups to compare.
hdb_interp <- cbind(X_sample, cluster = hdb_plot_data$cluster)
hdb_profile <- aggregate(. ~ cluster, data = hdb_interp, mean)
hdb_profile[, -1] <- round(hdb_profile[, -1], 1)
kable(hdb_profile, caption = "Average value of each variable per HDBSCAN cluster")
| cluster | Elevation | Aspect | Slope | Horizontal_Distance_To_Hydrology | Vertical_Distance_To_Hydrology | Horizontal_Distance_To_Roadways | Hillshade_9am | Hillshade_Noon | Hillshade_3pm | Horizontal_Distance_To_Fire_Points |
|---|---|---|---|---|---|---|---|---|---|---|
| Noise | 2888.8 | 184.8 | 23.8 | 386.8 | 117.3 | 2159.6 | 189.7 | 201.6 | 138.0 | 2116.1 |
| Cluster 1 | 3307.4 | 166.9 | 8.6 | 1034.4 | 195.0 | 2593.8 | 225.6 | 243.1 | 150.2 | 1531.8 |
| Cluster 2 | 2398.5 | 313.7 | 24.8 | 412.2 | 183.7 | 789.5 | 146.0 | 211.8 | 201.8 | 438.2 |
| Cluster 3 | 2967.4 | 153.4 | 13.0 | 252.5 | 38.1 | 2348.3 | 214.4 | 225.7 | 143.3 | 1974.0 |
The raw averages are hard to compare because variables use different scales. Here we standardize each variable across the clusters (so red means “high for this cluster” and blue means “low”).
prof_mat <- as.matrix(hdb_profile[, -1])
rownames(prof_mat) <- hdb_profile$cluster
prof_scaled <- scale(prof_mat) # z-score each variable across clusters
prof_scaled[is.nan(prof_scaled)] <- 0 # guard against zero-variance columns
prof_data <- melt(prof_scaled)
colnames(prof_data) <- c("Cluster", "Variable", "Value")
ggplot(prof_data, aes(x = Variable, y = Cluster, fill = Value)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#2166ac", mid = "white", high = "#b2182b",
midpoint = 0, name = "Standardized\nmean") +
labs(title = "Cluster Profile Heatmap (standardized means)", x = "", y = "") +
theme_report +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Boxplots of a few important variables show how the clusters differ in real units.
key_vars <- c("Elevation", "Slope", "Horizontal_Distance_To_Roadways")
box_df <- hdb_interp[, c("cluster", key_vars)]
box_data <- melt(box_df, id.vars = "cluster")
ggplot(box_data, aes(x = cluster, y = value, fill = cluster)) +
geom_boxplot(outlier.size = 0.5, outlier.alpha = 0.3) +
facet_wrap(~variable, scales = "free_y") +
labs(title = "Key Variables Across Clusters", x = "", y = "") +
theme_report +
theme(axis.text.x = element_text(angle = 30, hjust = 1),
legend.position = "none")