Wheat Seed - EDA and Clustering

We will be consuming data about various characteristics of Wheat seeds and clustering them into groups using various clustering techniques.

Data Preparation

Data Import

We have taken the dataset from Kaggle, data source link Wheat Seed.
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
The objective is to demonstrate in R to analyze all relevant wheat seed data and cluster into groups.

We are importing the Wheat seed (kernels) data and performing preliminary analysis.

library(rmarkdown)
library(ggplot2)
library(GGally)
library(grid)
library(gridExtra)
library(corrplot)
library(tidyverse)
library(tidyr)
library(caret)
library(knitr)
library(mgcv)
library(dplyr)
library(ROCR)
library(factoextra)
library(gridExtra)
library(fpc)
library(cluster)
library(mclust)
set.seed(12345)

seed <- read.csv("~/Seed_Data.csv")
dim(seed)

## [1] 210   8

str(seed)

## 'data.frame':    210 obs. of  8 variables:
##  $ A     : num  15.3 14.9 14.3 13.8 16.1 ...
##  $ P     : num  14.8 14.6 14.1 13.9 15 ...
##  $ C     : num  0.871 0.881 0.905 0.895 0.903 ...
##  $ LK    : num  5.76 5.55 5.29 5.32 5.66 ...
##  $ WK    : num  3.31 3.33 3.34 3.38 3.56 ...
##  $ A_Coef: num  2.22 1.02 2.7 2.26 1.35 ...
##  $ LKG   : num  5.22 4.96 4.83 4.8 5.17 ...
##  $ target: int  0 0 0 0 0 0 0 0 0 0 ...

There are 210 records of kernels and 8 variables.

We will convert variable target to factors and change the names of the columns to something meaningful.

seed$target <- seed$target + 1
seed$target <- factor(seed$target)
names(seed)[names(seed) == "A"] <- "Area"
names(seed)[names(seed) == "P"] <- "Perimeter"
names(seed)[names(seed) == "C"] <- "Compactness"
names(seed)[names(seed) == "LK"] <- "Length"
names(seed)[names(seed) == "WK"] <- "Width"
names(seed)[names(seed) == "LKG"] <- "Length_groove"

Here is the data.

paged_table(seed)

Data Cleaning

Missing Values

We are checking if there are any missing values in any of the variables.

colSums(is.na(seed))

##          Area     Perimeter   Compactness        Length         Width 
##             0             0             0             0             0 
##        A_Coef Length_groove        target 
##             0             0             0

print(paste(sum(complete.cases(seed)),"Complete cases!"))

## [1] "210 Complete cases!"

any(is.null(seed))

## [1] FALSE

We see that there are no missing values in any of the variables and there are 210 complete cases. Also, there are no invalid or spurious data in the dataset.

Negative Values

We are checking if there are any negative values in any of the variables.

sum(rowMeans(seed < 0), na.rm = TRUE)

## [1] 0

We see that there are no negative values in any of the variables.

Duplicate Records

We are checking if there are duplicate records in the dataset which is redundant and hence needs to be removed.

dim(unique(seed))[1]

## [1] 210

There is no duplicate records in the dataset

Data Dictionary

Variable	Type	Description
Area	double	area of the kernel
Perimeter	double	perimeter of the kernel
Compactness	double	Compactness of the kernel, which is eual to 4piA/P^2
Length	double	length of kernel
Width	double	width of kernel
A_Coef	double	asymmetry coefficient
Length_groove	double	length of kernel groove
target	factor	variety of kernel - Kama, Rosa, or Canadian, stored as 1,2,and 3

EDA

Correlation between variables

corrplot(cor(seed[,1:7]), type = "lower", method = "number")

Area and Perimeter are highly positive correlated.
Area and Width are highly positive correlated as Length and Perimeter.
Length and Area, Width and Perimeter, Length and Length_groove are all highly positively correlated.

Outliers and relation between target and other variables

There are very few outliers in some of the variables.

K-means Clustering

Scaling the data

Scale data to have zero mean and unit variance for each column:

Class <- factor(seed$target, levels = 1:3)
seed <- scale(seed[,-8])

K-means Clustering

To begin, the distance between each pair of records in the dataset is measured and visualized:

distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))

The clusters with k=2,3,4,and 5 are plotted to see how the datapoints are plotted and clustered with fviz_cluster.

k2 <- kmeans(seed, centers = 2, nstart = 25)
k3 <- kmeans(seed, centers = 3, nstart = 25)
k4 <- kmeans(seed, centers = 4, nstart = 25)
k5 <- kmeans(seed, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = seed) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point",  data = seed) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point",  data = seed) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point",  data = seed) + ggtitle("k = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

Finding Optimal number of clusters

Elbow Method

wss <- (nrow(seed) - 1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
                                     centers = i)$withinss)
plot(1:12, wss, type = "b", xlab = "Number of Clusters",ylab = "Within groups sum of squares")

We see a bend (elbow) at k=3, therefore, 3 is the optimal number of clusters.

Silhouette Method

d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15) {
  cluster_result = kmeans(seed, i)
  clusterstat = cluster.stats(d, cluster_result$cluster)
  result[i - 1,1] = i
  result[i - 1,2] = clusterstat$avg.silwidth
  result[i - 1,3] = clusterstat$dunn   
}
plot(result[,c(1,2)], type = "l", ylab = 'silhouette width', xlab = 'number of clusters')

The higher the Silhouette value, the better the clustering results. We see that k=2 is the best and k=3 is the next best.

Dunn’s Method

plot(result[,c(1,3)], type = "l", ylab = 'dunn index', xlab = 'number of clusters')

The higher the Dunn’s value, the better the clustering results. k=3 is the optimal number of clusters.

Final K-means Clusters

plotcluster(seed, k3$cluster)

table(Class, k3$cluster)

##      
## Class  1  2  3
##     1  6 62  2
##     2  0  5 65
##     3 66  4  0

adjustedRandIndex(Class, k3$cluster)

## [1] 0.7732937

k3$centers

##         Area  Perimeter Compactness     Length        Width      A_Coef
## 1 -1.0277967 -1.0042491  -0.9626050 -0.8955451 -1.082995635  0.69314821
## 2 -0.1407831 -0.1696372   0.4485346 -0.2571999  0.001643014 -0.66034079
## 3  1.2536860  1.2589580   0.5591283  1.2349319  1.162075101 -0.04511157
##   Length_groove
## 1    -0.6233191
## 2    -0.5844965
## 3     1.2892273

Hierarchical Clustering

Our goal here is to develop clusters using Hierarchical Clustering.

Finding Optimal number of clusters

Elbow Method

fviz_nbclust(seed, FUN = hcut, method = "wss")

We see a bend (elbow) at k=3, therefore, 3 is the optimal number of clusters.

Silhouette Method

fviz_nbclust(seed, FUN = hcut, method = "silhouette")

The higher the Silhouette value, the better the clustering results. We see that k=2 is the best and k=3 is the next best.

Linkage Methods

Single Linkage

euclidian_dist <- dist(seed, method = "euclidean")

# Hierarchical clustering using Single Linkage
hc1 <- hclust(euclidian_dist, method = "single" )


fviz_dend(hc1,k = 3,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used

Complete Linkage

# Hierarchical clustering using Complete Linkage
hc2 <- hclust(euclidian_dist, method = "complete" )

fviz_dend(hc2,k = 3,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used

Average Linkage

# Hierarchical clustering using Average Linkage
hc3 <- hclust(euclidian_dist, method = "average" )

fviz_dend(hc3,k = 3,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used

Ward Linkage

# Hierarchical clustering using Ward Linkage
hc4 <- hclust(euclidian_dist, method = "ward.D2" )

fviz_dend(hc4,k = 3,
          cex = 0.5, # label size
          k_colors = c( "#00AFBB", "#E7B800"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE, # Add rectangle around groups
          rect_border = c("#00AFBB", "#E7B800"), 
          rect_fill = TRUE)

## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled

## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used

Comparing all linkage methods

m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")

# function to compute coefficient
ac <- function(x) {
  agnes(seed, method = x)$ac
}

map_dbl(m, ac)

##   average    single  complete      ward 
## 0.8649547 0.6055379 0.9231512 0.9846264

The higher the agglomerative coefficient (ac) and close to 1, the better the clustering models.
In this case, Ward’s method yields the best results of clustering so we choose Ward’s as the final hierarchical clustering model.

Final Hierarchical Clusters

#creating final model
seed.3clust = cutree(hc4,k = 3)
table(Class, seed.3clust)

##      seed.3clust
## Class  1  2  3
##     1 64  4  2
##     2  4 66  0
##     3  5  0 65

adjustedRandIndex(Class, seed.3clust)

## [1] 0.7969983

fviz_cluster(list(data = seed, cluster = seed.3clust))

aggregate(seed,by = list(seed.3clust),FUN = mean)

##   Group.1       Area  Perimeter Compactness     Length       Width      A_Coef
## 1       1 -0.2228693 -0.2494138   0.3466797 -0.3392301 -0.08512438 -0.72363072
## 2       2  1.2110889  1.2145429   0.5671502  1.1954000  1.12789917 -0.04059959
## 3       3 -1.0224890 -0.9971761  -0.9702707 -0.8793165 -1.08565466  0.83085096
##   Length_groove
## 1    -0.6549463
## 2     1.2397237
## 3    -0.5816354

Gaussian Mixture Model Clustering

mclust_result = Mclust(seed, G = 3)

summary(mclust_result)

## ---------------------------------------------------- 
## Gaussian finite mixture model fitted by EM algorithm 
## ---------------------------------------------------- 
## 
## Mclust VEV (ellipsoidal, equal shape) model with 3 components: 
## 
##  log-likelihood   n df      BIC      ICL
##        286.9906 210 95 66.00606 61.46508
## 
## Clustering table:
##  1  2  3 
## 76 69 65

plot(mclust_result)

table(Class, mclust_result$classification)

##      
## Class  1  2  3
##     1 62  4  4
##     2  5 65  0
##     3  9  0 61

adjustedRandIndex(Class, mclust_result$classification)

## [1] 0.7115291

aggregate(seed,by = list(mclust_result$classification),FUN = mean)

##   Group.1       Area Perimeter Compactness     Length       Width      A_Coef
## 1       1 -0.2002911 -0.224837   0.3781574 -0.3000798 -0.05186882 -0.39638520
## 2       2  1.2337589  1.235022   0.5859183  1.1966159  1.15320886 -0.09004604
## 3       3 -1.0754960 -1.048137  -1.0641281 -0.9193913 -1.16352893  0.55905310
##   Length_groove
## 1    -0.6031837
## 2     1.2358422
## 3    -0.6066331