INTRODUCTION

This study describes two unsupervised learning approaches which dimension reduction and clustering with a real-life dataset. Dimension reduction is used to obtain fast and efficient results on clustering analysis. In this project, principle component analysis (PCA) and K-means algorithm are used for dimension reduction and clustering analysis, respectively.

DATA TYPE

Description of the Dataset

The datasets variables are specific features of wheat seed for three wheat types which are Kama, Rosa and Canadian. Each type of wheat seeds in shown in seedtype variable for the same name respectively in the dataset, respectively. Moreover, 70 observation exist for every type of wheat seed.

Each variable is represented as follows:

  • Seedtype: Represents Kama, Rosa and Canadian types as 1,2 and 3.

  • Area(A): Area of each observation.

  • Perimeter(P): Perimeter value of each wheat seed.

  • Compactness: Campactness value of each wheat seed and it is calculated as C = 4 * pi * A / P^2

  • Lengthofkernel: Length of kernel.

  • Widthofkernel: Width of kernel.

  • Asymmetrycoefficient: Asymmetry coefficient.

  • Lengthofkernelgroove: length of kernel groove.

Data Set Characteristics: Multivariate

Attribute Characteristics: Real

Number of Instances: 210

Number of Attributes:7

Missing Values: No missing values

The dataset is imported in R as the seeds. The dataset documentation type is text (tab delimited) (*.txt). The dataset was imported through the function showed below:

seeds <- read.delim("seeds_dataset.txt", stringsAsFactors = FALSE)
head(seeds)
##   seedtype  area perimeter compactness lengthofkernel widthofkernel
## 1        1 15.26     14.84      0.8710          5.763         3.312
## 2        1 14.88     14.57      0.8811          5.554         3.333
## 3        1 14.29     14.09      0.9050          5.291         3.337
## 4        1 13.84     13.94      0.8955          5.324         3.379
## 5        1 16.14     14.99      0.9034          5.658         3.562
## 6        1 14.38     14.21      0.8951          5.386         3.312
##   asymmetrycoefficient lengthofkernelgroove
## 1                2.221                5.220
## 2                1.018                4.956
## 3                2.699                4.825
## 4                2.259                4.805
## 5                1.355                5.175
## 6                2.462                4.956

ANALYSIS

At first, the relationship of between each variable are analysed. According to the figure below, most of variables have a linear relation and positive correlation between each other. Especially between the area and perimeter, there is almost perfect linear relationship.

plot(seeds[,2:8])

Principle Component Analysis (PCA)

In this step, it is proceed the PCA on the dataset with scaling. After the scaling is used, the variables are become more comparable.

mypr <- prcomp(seeds[,2:8], center = T, scale = T)
summary(mypr)
## Importance of components:
##                           PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.2430 1.0943 0.82341 0.26147 0.13680 0.07302 0.02850
## Proportion of Variance 0.7187 0.1711 0.09686 0.00977 0.00267 0.00076 0.00012
## Cumulative Proportion  0.7187 0.8898 0.98668 0.99645 0.99912 0.99988 1.00000

As it can be seen in the elbow method, the first three components are explained as %98,7 of variance whereas %85 of variance is enough for PCA analysis.

fviz_eig(mypr, addlabels = TRUE, ylim = c(0, 75))

Also, we can see the principle component analysis results of variables

var <- get_pca_var(mypr)
var
## Principal Component Analysis Results for variables
##  ===================================================
##   Name       Description                                    
## 1 "$coord"   "Coordinates for the variables"                
## 2 "$cor"     "Correlations between variables and dimensions"
## 3 "$cos2"    "Cos2 for the variables"                       
## 4 "$contrib" "contributions of the variables"

Cos2 (square cosine) is used to see variables on factor map

head(var$cos2, 4)
##                    Dim.1        Dim.2        Dim.3       Dim.4        Dim.5
## area           0.9939475 0.0008450341 0.0004537914 0.002563424 0.0007819319
## perimeter      0.9810106 0.0084506414 0.0024277403 0.005967849 0.0005683714
## compactness    0.3860875 0.3353216539 0.2688363174 0.007572511 0.0020708334
## lengthofkernel 0.9026272 0.0508079572 0.0304376025 0.004743335 0.0109831433
##                       Dim.6        Dim.7
## area           0.0009696239 4.386450e-04
## perimeter      0.0012093247 3.655035e-04
## compactness    0.0001069541 4.276371e-06
## lengthofkernel 0.0003990721 1.739726e-06

The most contributing variables can be seen on the correlation plot which are area, perimeter, lengthofkernel, widthofkernel and lengthofkernelgroove in dimension 1.

corrplot(var$cos2, is.corr=FALSE)

Additionally, all the PCA score and loading results are shown as follows:

fviz_pca_biplot(mypr, 
                # Fill individuals by groups
                geom.ind = as.factor("point"),
                addEllipses = TRUE, label = "var",
                pointshape = 21,
                pointsize = 2.5,
                fill.ind = as.factor(seeds$seedtype),
                col.ind = as.factor("black"),
                # Color variable by groups
                col.var = factor(c("area", "perimeter", "compactness", "lengthofkernel",
                                   "widthofkernel", "asymmetrycoefficient", "asymmetrycoefficient")),
                
                
                legend.title = list(fill = "Type of seed", color = "Clusters"),
                repel = TRUE        # Avoid label overplotting
)+ggpubr::fill_palette("jco")+      # Indiviual fill color
  ggpubr::color_palette("npg")      # Variable colors

Biplot is created to display the result of PCA. Moreover, Each variable that went into the PCA has an associated arrow. The arrows for each variable point in the direction of increasing values of that variable. Also, seed types’ PCA values can be observed in the clusters.

K-means

At first, the PCA values are are selected and a new dataframe is created.

comp <- data.frame(mypr$x[,1:3])

In this section, the clusters are created by using euclidian distance metric. Firstly, the optimal number of cluster number is calculated. Also, the elbow merhot suggested two clusters.

fviz_nbclust(comp, kmeans, method = "silhouette") + theme_classic()

According to elbow method, the clusters number are assigned as 2 and k-means algorithm is calculated for each first three PCA value. Thus, the clustering relation can be observed between three PCA values.

k <- kmeans(comp, 2, nstart=25, iter.max=1000)
palette(alpha(brewer.pal(9,'Set1'), 0.5))
plot(comp, col=k$clust, pch=16)

Also, we can observe these PCA values together with using 3d plotting.

scatterplot3d(comp[,1:3], pch=16, color = k$clust,
              grid = TRUE, box = FALSE)

The summary of k-means clustering is shown below:

summary(k)
##              Length Class  Mode   
## cluster      210    -none- numeric
## centers        6    -none- numeric
## totss          1    -none- numeric
## withinss       2    -none- numeric
## tot.withinss   1    -none- numeric
## betweenss      1    -none- numeric
## size           2    -none- numeric
## iter           1    -none- numeric
## ifault         1    -none- numeric

Number of members for each cluster are shown as follow:

table(k$cluster)
## 
##   1   2 
## 133  77

Silhouette

The value of average silhouette width must be between -1 and 1; therefore, the result closer to 1 implies high clustering quality and the value for silhouette width for this experiment is 0.47 which means the PCA values are proper for cluster analysis.

sile<-silhouette(k$cluster, dist(comp))

fviz_silhouette(sile)
##   cluster size ave.sil.width
## 1       1  133          0.45
## 2       2   77          0.52

CONCLUSION

As a conclusion, principle component analysis (PCA) and k-means clustering algorithm are analysed in this project. The analysis is started with examining the relation between each variable. Then, principle component analysis is initiated. The percentage of explained variance for each dimension is determined. After that, the correlation plot is created to determine contributed variables for each dimension. PCA score and loading results are represented with biplot. First three PCA variables are selected and determined the optimal cluster number for k-means clustering. k-means algorithm is initiated for first three PCA variable and examined on 3d plot. Evaluation metric is used to understand the quality of clustering analysis (silhouette).

REFERENCES

1. The dataset is imported by using machine learning repository - https://archive.ics.uci.edu/ml/datasets/seeds