We will be consuming data about various characteristics of Wheat seeds and clustering them into groups using various clustering techniques.
We have taken the dataset from Kaggle, data source link Wheat Seed.
The examined group comprised kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each, randomly selected for the experiment. High quality visualization of the internal kernel structure was detected using a soft X-ray technique. It is non-destructive and considerably cheaper than other more sophisticated imaging techniques like scanning microscopy or laser technology. The images were recorded on 13x18 cm X-ray KODAK plates. Studies were conducted using combine harvested wheat grain originating from experimental fields, explored at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.
The objective is to demonstrate in R to analyze all relevant wheat seed data and cluster into groups.
We are importing the Wheat seed (kernels) data and performing preliminary analysis.
library(rmarkdown)
library(ggplot2)
library(GGally)
library(grid)
library(gridExtra)
library(corrplot)
library(tidyverse)
library(tidyr)
library(caret)
library(knitr)
library(mgcv)
library(dplyr)
library(ROCR)
library(factoextra)
library(gridExtra)
library(fpc)
library(cluster)
library(mclust)
set.seed(12345)
seed <- read.csv("~/Seed_Data.csv")
dim(seed)
## [1] 210 8
str(seed)
## 'data.frame': 210 obs. of 8 variables:
## $ A : num 15.3 14.9 14.3 13.8 16.1 ...
## $ P : num 14.8 14.6 14.1 13.9 15 ...
## $ C : num 0.871 0.881 0.905 0.895 0.903 ...
## $ LK : num 5.76 5.55 5.29 5.32 5.66 ...
## $ WK : num 3.31 3.33 3.34 3.38 3.56 ...
## $ A_Coef: num 2.22 1.02 2.7 2.26 1.35 ...
## $ LKG : num 5.22 4.96 4.83 4.8 5.17 ...
## $ target: int 0 0 0 0 0 0 0 0 0 0 ...
There are 210 records of kernels and 8 variables.
We will convert variable target to factors and change the names of the columns to something meaningful.
seed$target <- seed$target + 1
seed$target <- factor(seed$target)
names(seed)[names(seed) == "A"] <- "Area"
names(seed)[names(seed) == "P"] <- "Perimeter"
names(seed)[names(seed) == "C"] <- "Compactness"
names(seed)[names(seed) == "LK"] <- "Length"
names(seed)[names(seed) == "WK"] <- "Width"
names(seed)[names(seed) == "LKG"] <- "Length_groove"
Here is the data.
paged_table(seed)
We are checking if there are any missing values in any of the variables.
colSums(is.na(seed))
## Area Perimeter Compactness Length Width
## 0 0 0 0 0
## A_Coef Length_groove target
## 0 0 0
print(paste(sum(complete.cases(seed)),"Complete cases!"))
## [1] "210 Complete cases!"
any(is.null(seed))
## [1] FALSE
We see that there are no missing values in any of the variables and there are 210 complete cases. Also, there are no invalid or spurious data in the dataset.
We are checking if there are any negative values in any of the variables.
sum(rowMeans(seed < 0), na.rm = TRUE)
## [1] 0
We see that there are no negative values in any of the variables.
We are checking if there are duplicate records in the dataset which is redundant and hence needs to be removed.
dim(unique(seed))[1]
## [1] 210
There is no duplicate records in the dataset
Variable | Type | Description |
---|---|---|
Area | double | area of the kernel |
Perimeter | double | perimeter of the kernel |
Compactness | double | Compactness of the kernel, which is eual to 4piA/P^2 |
Length | double | length of kernel |
Width | double | width of kernel |
A_Coef | double | asymmetry coefficient |
Length_groove | double | length of kernel groove |
target | factor | variety of kernel - Kama, Rosa, or Canadian, stored as 1,2,and 3 |
corrplot(cor(seed[,1:7]), type = "lower", method = "number")
There are very few outliers in some of the variables.
Scale data to have zero mean and unit variance for each column:
Class <- factor(seed$target, levels = 1:3)
seed <- scale(seed[,-8])
To begin, the distance between each pair of records in the dataset is measured and visualized:
distance <- get_dist(seed)
fviz_dist(distance, gradient = list(low = "#00AFBB", mid = "white", high = "#FC4E07"))
The clusters with k=2,3,4,and 5 are plotted to see how the datapoints are plotted and clustered with fviz_cluster.
k2 <- kmeans(seed, centers = 2, nstart = 25)
k3 <- kmeans(seed, centers = 3, nstart = 25)
k4 <- kmeans(seed, centers = 4, nstart = 25)
k5 <- kmeans(seed, centers = 5, nstart = 25)
p1 <- fviz_cluster(k2, geom = "point", data = seed) + ggtitle("k = 2")
p2 <- fviz_cluster(k3, geom = "point", data = seed) + ggtitle("k = 3")
p3 <- fviz_cluster(k4, geom = "point", data = seed) + ggtitle("k = 4")
p4 <- fviz_cluster(k5, geom = "point", data = seed) + ggtitle("k = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
wss <- (nrow(seed) - 1)*sum(apply(seed,2,var))
for (i in 2:12) wss[i] <- sum(kmeans(seed,
centers = i)$withinss)
plot(1:12, wss, type = "b", xlab = "Number of Clusters",ylab = "Within groups sum of squares")
We see a bend (elbow) at k=3, therefore, 3 is the optimal number of clusters.
d = dist(seed, method = "euclidean")
result = matrix(nrow = 14, ncol = 3)
for (i in 2:15) {
cluster_result = kmeans(seed, i)
clusterstat = cluster.stats(d, cluster_result$cluster)
result[i - 1,1] = i
result[i - 1,2] = clusterstat$avg.silwidth
result[i - 1,3] = clusterstat$dunn
}
plot(result[,c(1,2)], type = "l", ylab = 'silhouette width', xlab = 'number of clusters')
The higher the Silhouette value, the better the clustering results. We see that k=2 is the best and k=3 is the next best.
plot(result[,c(1,3)], type = "l", ylab = 'dunn index', xlab = 'number of clusters')
The higher the Dunn’s value, the better the clustering results. k=3 is the optimal number of clusters.
plotcluster(seed, k3$cluster)
table(Class, k3$cluster)
##
## Class 1 2 3
## 1 6 62 2
## 2 0 5 65
## 3 66 4 0
adjustedRandIndex(Class, k3$cluster)
## [1] 0.7732937
k3$centers
## Area Perimeter Compactness Length Width A_Coef
## 1 -1.0277967 -1.0042491 -0.9626050 -0.8955451 -1.082995635 0.69314821
## 2 -0.1407831 -0.1696372 0.4485346 -0.2571999 0.001643014 -0.66034079
## 3 1.2536860 1.2589580 0.5591283 1.2349319 1.162075101 -0.04511157
## Length_groove
## 1 -0.6233191
## 2 -0.5844965
## 3 1.2892273
Our goal here is to develop clusters using Hierarchical Clustering.
fviz_nbclust(seed, FUN = hcut, method = "wss")
We see a bend (elbow) at k=3, therefore, 3 is the optimal number of clusters.
fviz_nbclust(seed, FUN = hcut, method = "silhouette")
The higher the Silhouette value, the better the clustering results. We see that k=2 is the best and k=3 is the next best.
euclidian_dist <- dist(seed, method = "euclidean")
# Hierarchical clustering using Single Linkage
hc1 <- hclust(euclidian_dist, method = "single" )
fviz_dend(hc1,k = 3,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used
# Hierarchical clustering using Complete Linkage
hc2 <- hclust(euclidian_dist, method = "complete" )
fviz_dend(hc2,k = 3,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used
# Hierarchical clustering using Average Linkage
hc3 <- hclust(euclidian_dist, method = "average" )
fviz_dend(hc3,k = 3,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used
# Hierarchical clustering using Ward Linkage
hc4 <- hclust(euclidian_dist, method = "ward.D2" )
fviz_dend(hc4,k = 3,
cex = 0.5, # label size
k_colors = c( "#00AFBB", "#E7B800"),
color_labels_by_k = TRUE, # color labels by groups
rect = TRUE, # Add rectangle around groups
rect_border = c("#00AFBB", "#E7B800"),
rect_fill = TRUE)
## Warning in get_col(col, k): Length of color vector was shorter than the number
## of clusters - color vector was recycled
## Warning in if (color == "cluster") color <- "default": the condition has length
## > 1 and only the first element will be used
m <- c( "average", "single", "complete", "ward")
names(m) <- c( "average", "single", "complete", "ward")
# function to compute coefficient
ac <- function(x) {
agnes(seed, method = x)$ac
}
map_dbl(m, ac)
## average single complete ward
## 0.8649547 0.6055379 0.9231512 0.9846264
The higher the agglomerative coefficient (ac) and close to 1, the better the clustering models.
In this case, Ward’s method yields the best results of clustering so we choose Ward’s as the final hierarchical clustering model.
#creating final model
seed.3clust = cutree(hc4,k = 3)
table(Class, seed.3clust)
## seed.3clust
## Class 1 2 3
## 1 64 4 2
## 2 4 66 0
## 3 5 0 65
adjustedRandIndex(Class, seed.3clust)
## [1] 0.7969983
fviz_cluster(list(data = seed, cluster = seed.3clust))
aggregate(seed,by = list(seed.3clust),FUN = mean)
## Group.1 Area Perimeter Compactness Length Width A_Coef
## 1 1 -0.2228693 -0.2494138 0.3466797 -0.3392301 -0.08512438 -0.72363072
## 2 2 1.2110889 1.2145429 0.5671502 1.1954000 1.12789917 -0.04059959
## 3 3 -1.0224890 -0.9971761 -0.9702707 -0.8793165 -1.08565466 0.83085096
## Length_groove
## 1 -0.6549463
## 2 1.2397237
## 3 -0.5816354
mclust_result = Mclust(seed, G = 3)
summary(mclust_result)
## ----------------------------------------------------
## Gaussian finite mixture model fitted by EM algorithm
## ----------------------------------------------------
##
## Mclust VEV (ellipsoidal, equal shape) model with 3 components:
##
## log-likelihood n df BIC ICL
## 286.9906 210 95 66.00606 61.46508
##
## Clustering table:
## 1 2 3
## 76 69 65
plot(mclust_result)
table(Class, mclust_result$classification)
##
## Class 1 2 3
## 1 62 4 4
## 2 5 65 0
## 3 9 0 61
adjustedRandIndex(Class, mclust_result$classification)
## [1] 0.7115291
aggregate(seed,by = list(mclust_result$classification),FUN = mean)
## Group.1 Area Perimeter Compactness Length Width A_Coef
## 1 1 -0.2002911 -0.224837 0.3781574 -0.3000798 -0.05186882 -0.39638520
## 2 2 1.2337589 1.235022 0.5859183 1.1966159 1.15320886 -0.09004604
## 3 3 -1.0754960 -1.048137 -1.0641281 -0.9193913 -1.16352893 0.55905310
## Length_groove
## 1 -0.6031837
## 2 1.2358422
## 3 -0.6066331