Unsupervised Learning: Wheat Seed Analysis

Introduction to Wheat

Wheat is a grass widely cultivated for its seed, a cereal grain which is a worldwide staple food. Wheat is grown on more land area than any other food crop (220.4 million hectares, 2014). World trade in wheat is greater than for all other crops combined. In 2016, world production of wheat was 749 million tonnes, making it the second most-produced cereal after maize.

Since 1960, world production of wheat and other grain crops has tripled and is expected to grow further through the middle of the 21st century. Global demand for wheat is increasing due to the unique viscoelastic and adhesive properties of gluten proteins, which facilitate the production of processed foods, whose consumption is increasing as a result of the worldwide industrialization process and the westernization of the diet.

In this article, I will demonstrate an unsupervised learning analysis using Wheat Seed dataset from UCI Machine Learning. The analysis includes clustering using K-means algorithm and dimensionality reduction using principal component analysis (PCA).

Through this analysis, I want to evaluate the possibility of performing clustering to produce a new label for the dataset and the possibility of dimensionality reduction using PCA. Additionally, I will also analyze the pattern of the data to obtain insights by combining PCA and clustering.

Solution

Import Library

library(dplyr)
library(tidyr)
library(GGally)
library(gridExtra)
library(factoextra)
library(FactoMineR)
library(plotly)
source("ncluster.R")
source("catnum.R")

Read Data

The dataset were obtained from UCI Machine learning Repository about measurements of geometrical properties of kernels belonging to three different varieties of wheat: Kama, Rosa and Canadian, 70 elements each. The kernels were randomly selected for the experiment conducted at the Institute of Agrophysics of the Polish Academy of Sciences in Lublin.

seed <- read.csv("Seed_Data.csv")
glimpse(seed)

## Observations: 210
## Variables: 8
## $ A      <dbl> 15.26, 14.88, 14.29, 13.84, 16.14, 14.38, 14.69, 14.11,...
## $ P      <dbl> 14.84, 14.57, 14.09, 13.94, 14.99, 14.21, 14.49, 14.10,...
## $ C      <dbl> 0.8710, 0.8811, 0.9050, 0.8955, 0.9034, 0.8951, 0.8799,...
## $ LK     <dbl> 5.763, 5.554, 5.291, 5.324, 5.658, 5.386, 5.563, 5.420,...
## $ WK     <dbl> 3.312, 3.333, 3.337, 3.379, 3.562, 3.312, 3.259, 3.302,...
## $ A_Coef <dbl> 2.2210, 1.0180, 2.6990, 2.2590, 1.3550, 2.4620, 3.5860,...
## $ LKG    <dbl> 5.220, 4.956, 4.825, 4.805, 5.175, 4.956, 5.219, 5.000,...
## $ target <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

In the data, seven geometric parameters of wheat kernels were measured:

A: area
P: perimeter
C: compactness ($4*pi*A/P^2$),
LK: length of kernel
WK: width of kernel
A_Coef: asymmetry coefficient
LKG: length of kernel groove
target: kernel type

Data Wrangling

anyNA(seed)

## [1] FALSE

names(seed) <- c("Area", "Perimeter", "Compactness", "Length", "Width", "Asymetry.coef", "Grove.length", "Type")
seed$Type <- as.factor(seed$Type)
glimpse(seed)

## Observations: 210
## Variables: 8
## $ Area          <dbl> 15.26, 14.88, 14.29, 13.84, 16.14, 14.38, 14.69,...
## $ Perimeter     <dbl> 14.84, 14.57, 14.09, 13.94, 14.99, 14.21, 14.49,...
## $ Compactness   <dbl> 0.8710, 0.8811, 0.9050, 0.8955, 0.9034, 0.8951, ...
## $ Length        <dbl> 5.763, 5.554, 5.291, 5.324, 5.658, 5.386, 5.563,...
## $ Width         <dbl> 3.312, 3.333, 3.337, 3.379, 3.562, 3.312, 3.259,...
## $ Asymetry.coef <dbl> 2.2210, 1.0180, 2.6990, 2.2590, 1.3550, 2.4620, ...
## $ Grove.length  <dbl> 5.220, 4.956, 4.825, 4.805, 5.175, 4.956, 5.219,...
## $ Type          <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...

Exploratory Data Analysis

1. Clustering Opportunity

# Inspecting the difference between numerical variable based on Kernel Type
a <- catnum(seed, "Area", "Type")
b <- catnum(seed, "Perimeter", "Type")
c <- catnum(seed, "Compactness", "Type")
d <- catnum(seed, "Length", "Type")
e <- catnum(seed, "Width", "Type")
f <- catnum(seed, "Asymetry.coef", "Type")
g <- catnum(seed, "Grove.length", "Type")

grid.arrange(a,b,c,d,e,f,g, ncol=3)

From the barplot, we can see that in general, Kernel type 1 has the highest value on most of the variables. Even so, There was a slight difference when we use Asymetry.coef, whereas the value of Kernel 0 < 1 < 2. Compactness also reveals a different result whereas the value of Kernel 0 almost equal to Kernel 1 and then followed by Kernel 2.

By having a distinct separation of values between each type of Kernel, this data might be suitable for clustering using K-means. Although the data already contains labels of the type of Kernels, clustering will still be used to prove whether the geometrical properties of Kernels can be used to produce clusters that resemble Kernel’s type. If not, can we produce new clusters based on those geometrical properties? Additionally, visualization and profiling of the clustering result will be performed to gain valuable insights.

2. Dimensionality Reduction Opportunity

Here we will explore the data distribution of each numeric variable using density plot and the correlation between each variable using scatterplot which were provided within ggpairs function from GGally package.

ggpairs(seed[,c(1:7)], showStrips = F) + 
  theme(axis.text = element_text(colour = "black", size = 11),
        strip.background = element_rect(fill = "#d63d2d"),
        strip.text = element_text(colour = "white", size = 12,
                                  face = "bold"))

It can be seen that there is a strong correlation between some variables from the data, including Area, Perimeter, Length, and Width. This result indicates that this dataset has multicollinearity and might not be suitable for various classification algorithms (which have non-multicollinearity as their assumption).

Principal Component Analysis can be performed for this data to produce non-multicollinearity data, while also reducing the dimension of the data and retaining as much as information possible. The result of this analysis can be utilized further for classification purpose with lower computation.

Data Pre-processing

# scaling data
seed_z <- scale(seed[,-8])

UL: Clustering

# obtain k optimum
wss(seed_z)

bss(seed_z)

btratio(seed_z)

From the plots, we can see that 3 is the optimum number of K. After k=3, increasing the number of K does not result in a considerable decrease of the total within sum of squares (strong internal cohesion) nor a considerable increase of between sum of square and between/total sum of squares ratio (maximum external separation).

# k-means clustering
set.seed(50)
seed_k <- kmeans(seed_z, 3)

# result analysis
seed_k

## K-means clustering with 3 clusters of sizes 67, 72, 71
## 
## Cluster means:
##         Area  Perimeter Compactness     Length        Width Asymetry.coef
## 1  1.2536860  1.2589580   0.5591283  1.2349319  1.162075101   -0.04511157
## 2 -1.0277967 -1.0042491  -0.9626050 -0.8955451 -1.082995635    0.69314821
## 3 -0.1407831 -0.1696372   0.4485346 -0.2571999  0.001643014   -0.66034079
##   Grove.length
## 1    1.2892273
## 2   -0.6233191
## 3   -0.5844965
## 
## Clustering vector:
##   [1] 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 3 3 3 3 2
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 3 1 3 3 1
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 2 2 2 2 2 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 139.5542 144.5954 144.4586
##  (between_SS / total_SS =  70.7 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

From the clustering result, it is very interesting that the size of each cluster is not proportional to one another (cluster 1: 67; cluster 2: 72; cluster 3: 71), revealing a slightly different result than the true observations from each Kernel type (70 each).

This indicates that there might be Kernels with similar geometrical properties which originate from different type/species. In the following, I will try to compare the clustering result (clustering vector) with the actual label to see how many observations fall into a different class.

seed$Type

##   [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## Levels: 0 1 2

seed_k$cluster

##   [1] 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3
##  [36] 3 3 1 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 2 2 3 2 3 3 3 3 3 2
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1 1 1 1 3 1 1 3 1 3 3 1
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2
## [176] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2 3 2 3 2 2 2 2 2 2 2 2

7+5+4 # total of miss-class

## [1] 16

(16/210)*100 # percentage of miss-class from the data

## [1] 7.619048

From the calculation, 16 observations (7.6% of observations) fall into the different class than it should be. This indicates that the geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types. Additional properties such as genetic and metabolites properties of each Kernel might be needed to overcome this problem.

# additional information
seed_k$tot.withinss # total within sum of squares

## [1] 428.6082

seed_k$betweenss # between sum of squares

## [1] 1034.392

seed_k$totss # total sum of squares

## [1] 1463

seed_k$iter # number of iteration needed to obtain optimum clustering

## [1] 3

Nevertheless, new cluster can be made using this dataset and these new clusters also have different characteristics owned by each cluster. Visualization and profiling of cluster results can give us additional information about each clusters which can be useful for us from a business perspective.

To visualize the result of K-means clustering we can use various functions from factoextra package or by combining it with PCA. This time will use factoextra package (I will combine the result with PCA in a later section).

# data preparation for visualization & profiling
seed$cluster <- as.factor(seed_k$cluster)
seed

# clustering visualization
fviz_cluster(object = seed_k, 
             data = seed_z)

# cluster profiling
seed %>%
  group_by(cluster) %>% 
  summarise_all(.funs = "mean") %>% 
  select(-Type)

From the profiling result, we can derive insights:

Cluster 1: consist mostly of kernels with the highest value on their geometrical attributes.
Cluster 2: consist mostly of kernels with the lowest value on their geometrical attributes but highest in their asymetry coefficient.
Cluster 3: consist mostly of kernels with the middle value on their geometrical attributes.

These characteristics from each cluster can also be visualized by combining clustering and PCA in the later section.

UL: Principal Component Analysis

# PCA
seed_pca <- PCA(seed[,1:7], graph = F)
seed_pca$eig # analyze cumulative variance of each PC

##          eigenvalue percentage of variance
## comp 1 5.0312011860            71.87430266
## comp 2 1.1975728470            17.10818353
## comp 3 0.6780034386             9.68576341
## comp 4 0.0683644770             0.97663539
## comp 5 0.0187136090             0.26733727
## comp 6 0.0053320457             0.07617208
## comp 7 0.0008123968             0.01160567
##        cumulative percentage of variance
## comp 1                          71.87430
## comp 2                          88.98249
## comp 3                          98.66825
## comp 4                          99.64488
## comp 5                          99.91222
## comp 6                          99.98839
## comp 7                         100.00000

Through PCA, I can retain some informative principal components (high in cumulative variance) from Kernels dataset to perform dimensionality reduction. By doing this, I can reduce the dimension of the dataset while also retaining as much information as possible.

In this study, I want to retain at least 90% of the information from our data. From the PCA summary (seed_pca$eig), I picked PC1-PC3 from a total of 7 PC. By doing this, I was able to reduce ~57% of dimension from my original data while retaining 98.7% of the information from the data.

We can extract the values of PC1-PC3 from all of the observations and put it into a new data frame. This data frame can later be analyzed using supervised learning classification technique or other purposes.

# making a new data frame from PCA result
seed_x <- data.frame(seed_pca$ind$coord[,1:3])
# another way: 
# seed_x <- PCA(seed[,1:7], graph = F, ncp = 3)$ind$coord

seed_xx <- cbind(seed_x, Type = seed$Type)
seed_xx

Combining Clustering and PCA

From the previous section, we have discussed that PCA can be combined with clustering to obtain better visualization of our clustering result, or simply to understand the pattern in our dataset. This can be done by using a biplot, a common plot in PCA to visualize high dimensional data using PC1 and PC2 as the axes.

We can use plot.PCA to visualize a PCA object with added arguments for customization.

# analysis of clustering result
par(mfcol=c(1,2)) # graphical parameter to arrange plots
plot.PCA(x = seed_pca, choix = "ind", label = "quali", col.ind = seed$Type, title = "Colored by Type")
plot.PCA(x = seed_pca, choix = "ind", label = "quali", col.ind = seed$cluster, title = "Colored by Cluster")

The plots above are examples of individual factor map of a biplot. The points in the plot resemble observations and colored by their Type (original Kernel type) and Cluster (Kernel by clustering result). Dim1 and Dim2 are PC1 and PC2 respectively, with their own share (percentage) of information from the total information of the dataset.

From the biplot, we can clearly see in the Colored by Type plot, some observations from different clusters were located really close with one another and an overlapping view of clusters can be seen. Meanwhile, in the Colored by Cluster plot, we can see that the clusters separate nicely without overlapping view of clusters.

This visualization supports the assumption made during clustering result analysis, which was, “..there might be Kernels with similar geometrical properties which originate from different type/species. This indicates that the geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types.”

After this, I will focus on the interpretation of biplots which observations were colored based on clusters that we have made before.

# analysis of biplot
par(mfcol=c(1,2)) # graphical parameter to arrange plots
plot.PCA(x = seed_pca, choix = "ind", label = "quali", col.ind = seed$cluster, title = "Colored by Cluster")
legend(x = "topright", levels(seed$cluster), pch=19, col=1:4)
plot.PCA(x = seed_pca, choix = "var", title = "Variable Factor Map")

Some insights from the plots are:

The dataset can be clustered according to their geometrical properties.
There are no considerable outliers in data.
Variables that were highly contributed to PC1 are Groove.length, Length, Perimeter, Area, Width.
Variables that were highly contributed to PC2 are Asymmetry.coef and Compactness.
Variables Asymmetry.coef possibly has a negative correlation with Compactness.
Variables Asymmetry.coef and Compactness possibly have a low correlation with Grove.length and other geometrical variables which contribute to PC1.
The three clusters were very different from one another in terms of their score in variables that were highly contributed to PC1. Cluster with the highest score in those variables is cluster 1, followed by cluster 3, then cluster 2.
Additionally, what differentiates cluster 3 with cluster 1 & 2 is that its Asymmetry.coef value which was lower than the other 2 clusters. Notice that most of the observations in cluster 3 positioned below the 0 coordinates of PC2, whereas Asymmetry.coef was highly contributed to PC2.

Summary

From the unsupervised learning analysis above, we can summarize that:

K-means clustering can be done using this dataset, although, the clusters did not resemble Kernels types. Geometrical properties of Kernels alone are not sufficient enough to obtain a clustering that resembles Kernels types. Additional properties such as genetic and metabolites properties of each Kernel might be needed to obtain such clustering.
Dimensionality reduction can be performed using this dataset. To perform dimensionality reduction, we can pick PCs from a total of 7 PC according to the total information we want to retain. In this article, I used 3 PCs to reduce ~57% of dimension from my original data while retaining 98.7% of the information from the data.
The improved data set obtained from unsupervised learning (eg.PCA) can be utilized further for supervised learning (classification) or for better data visualization (high dimensional data) with various insights.

Side note:

For classification purpose, I will still use Type as my target class instead of Cluster (which I obtained from clustering), if I want to classify Kernels based on the Type/Species.

It will be another story though, If your actually don’t care about the Kernel’s Type, and only want to classify Kernels based on its size/geometrical properties. This can happen in many industrial situations. For example when you want to obtain Kernels with bigger size for industrial/production purposes.

Bonus: 3D-plot Visualization for Multidimentional Data

seed_xc <- cbind(seed_x, cluster = seed$cluster)

plot_ly(seed_xc, x = ~Dim.1, y = ~Dim.2, z = ~Dim.3, color = ~cluster, colors = c('black', 'red', 'green')) %>%
  add_markers() %>%
  layout(scene = list(xaxis = list(title = 'Dim.1'),
                     yaxis = list(title = 'Dim.2'),
                     zaxis = list(title = 'Dim.3')))