K-means Clustering

When I looked at the results from the K-means clustering of my data with K values of 2, 3, and 4, I noticed some interesting patterns and distributions. I particularly focused on how well each cluster was defined and how much overlap there was between clusters in different scenarios.

For K=2, the division was quite clear, splitting the data into two distinct groups. I saw that 37.7% of the variance was explained by the first principal component, which was a good indicator that a significant amount of variability in my dataset was captured. This simple bifurcation might reflect fundamental differences in the data, perhaps corresponding to two types of crops or soil conditions in my agricultural study.

Moving to K=3, the data was segmented into three groups, and I began to see a more nuanced breakdown. This might illustrate more specific characteristics such as varying water usage or yield efficiency among different crop types. The three-cluster solution seemed to offer a good balance, capturing complex patterns without much overlap, which suggested a meaningful categorization that could guide more targeted agricultural strategies.

With K=4, however, the plot showed some overlap among the clusters, especially noticeable between what appeared as the third and fourth groups. This overlap indicated that adding another cluster might not be providing additional useful information, as the new cluster seemed to fragment one of the existing groups rather than identifying a new, distinct category.

Detailed Statistical Insights Variance Explained: Each plot maintained an explanation of 37.7% of variance by the first dimension, which reassured me that the primary structure of the data was consistently captured across different K values. Cluster Purity: With K=2 and K=3, the clusters were more homogeneous and well-separated compared to K=4, where the cluster boundaries became blurred.

# I'm installing the 'tidyverse' for data manipulation and visualization; it makes handling data frames easier.
#install.packages("tidyverse")
library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

# I'm also installing 'factoextra' for clustering visualization; it helps visualize the results from K-means.
#install.packages("factoextra")
library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

# I'm setting a seed to make my random data generation reproducible.
set.seed(123)

# I'm creating a data frame to simulate agricultural data. I'm imagining features like Yield, WaterUsage, and SoilQuality.
agri_data <- data.frame(
  Yield = rnorm(100, mean = 50, sd = 10),  # I chose normal distribution for yield with a mean of 50 and sd of 10.
  WaterUsage = runif(100, min = 20, max = 100),  # I'm using uniform distribution for water usage to vary it broadly.
  SoilQuality = rnorm(100, mean = 5, sd = 1.5)  # Soil quality is normally distributed around a mean score of 5.
)

# I'm scaling the data because K-means clustering performs better when features are on the same scale.
scaled_data <- scale(agri_data)  # This function centers and scales each feature.

# I'm performing K-means clustering with K = 2, 3, and 4 to explore different cluster configurations.
set.seed(123)  # Ensuring reproducibility for clustering results.
k2 <- kmeans(scaled_data, centers = 2, nstart = 25)  # I use nstart = 25 to try 25 random initial configurations.
k3 <- kmeans(scaled_data, centers = 3, nstart = 25)
k4 <- kmeans(scaled_data, centers = 4, nstart = 25)

# I'm creating a function to plot data points colored by their cluster assignment.
#plot_clusters <- function(k) {
  #fviz_cluster(k, data = scaled_data) + 
    #ggtitle(paste("K-means Clustering with K =", k$centers)) +
    #theme_minimal()  # I'm using a minimal theme for a clean visualization.
#}

# I'm plotting for each cluster scenario.
#plot1 <- plot_clusters(k2)
#plot2 <- plot_clusters(k3)
#plot3 <- plot_clusters(k4)

# I'm using grid.arrange to display all plots side by side for easy comparison.
#install.packages("gridExtra")
#library(gridExtra)
#grid.arrange(plot1, plot2, plot3, nrow = 1)

# Load necessary libraries
library(ggplot2)
library(factoextra)  # For better cluster visualization I had to use this library

# I made 'scaled_data' the dataset used for clustering and 'k2', 'k3', 'k4' are the kmeans results
# Adjust the custom function to include the data parameter
plot_clusters <- function(k_result, data) {
  fviz_cluster(k_result, data = data,  # Passing the data used for clustering
               geom = "point",  # Use points for cluster centers
               stand = FALSE,  # Do not standardize the data before plotting
               ellipse = TRUE,  # Draw ellipses around clusters
               repel = TRUE  # Use repelling to avoid label overlap
  ) +
    theme_minimal() +  # Use a minimal theme for aesthetics
    theme(legend.position = "right") +  # Position the legend to the right
    labs(title = paste("K-means Clustering with K =", length(k_result$centers))) +  # Set dynamic title based on K
    scale_color_brewer(palette = "Set1")  # Apply a color palette that is distinct and clear
}

# Assuming that 'scaled_data' is the standardized version of your original dataset
# Let's visualize the clusters
plot1 <- plot_clusters(k2, scaled_data)
plot2 <- plot_clusters(k3, scaled_data)
plot3 <- plot_clusters(k4, scaled_data)

# Displaying all plots side by side using gridExtra
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

grid.arrange(plot1, plot2, plot3, nrow = 1)

K-means Clustering

Avery Holloman

2025-01-03