K-means is an unsupervised machine learning algorithm used to find groups of observations (clusters) that share similar characteristics.

A cluster is defined as a group of observations that are more similar to each other than they are to the observations in other groups.

Cluster analysis is widely used in the biological and behavioral sciences, marketing, and medical research. For example, a psychological researcher might cluster data on the symptoms and demographics of depressed patients, seeking to uncover subtypes of depression. The hope would be that finding such subtypes might lead to more targeted and effective treatments and a better understanding of the disorder. Marketing researchers use cluster analysis as a customer-segmentation strategy. Customers are arranged into clusters based on the similarity of their demographics and buying behaviors. Marketing campaigns are then tailored to appeal to one or more of these subgroups.

The two most popular clustering approaches are hierarchical agglomerative clustering and partitioning clustering.

In this topic, we are discussing about K means clustering which comes under partitioning clustering.

Common steps in cluster analysis

An effective cluster analysis is a multistep process with numerous decision points. Each decision can affect the quality and usefulness of the results.

1. Choose appropriate attributes

The first (and perhaps most important) step is to select variables that you feel may be important for identifying and understanding differences among groups of observations within the data. For example, in a study of depression, you might want to assess one or more of the following: psychological symptoms; physical symptoms; age at onset; number, duration, and timing of episodes; number of hospitalizations; functional status with regard to self-care; social and work history; current age; gender; ethnicity; socioeconomic status; marital status; family medical history; and response to previous treatments. A sophisticated cluster analysis can’t compensate for a poor choice of variables.

2. Scale the data

If the variables in the analysis vary in range, the variables with the largest range will have the greatest impact on the results. This is often undesirable, and analysts scale the data before continuing. The most popular approach is to standardize each variable to a mean of 0 and a standard deviation of 1.

3. Screen for outliers

Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained. You can screen for (and remove) univariate outliers using functions from the outliers package. The mvoutlier package contains functions that can be used to identify multivariate outliers.

4. Calculate distances

The most popular measure of the distance between two observations is the Euclidean distance, but the Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance measures are also available.

5. Select a clustering algorithm

Next, select a method of clustering the data. Hierarchical clustering is useful for smaller problems (say, 150 observations or less) and where a nested hierarchy of groupings is desired. The partitioning method can handle much larger problems but requires that the number of clusters be specified in advance.

6. Determine the number of clusters present

In order to obtain a final cluster solution, you must decide how many clusters are present in the data.

7. Obtain a final clustering solution

Once the number of clusters has been determined, a final clustering is performed to extract that number of subgroups.

8. Visualize the results

Visualization can help you determine the meaning and usefulness of the cluster solution. The results of a hierarchical clustering are usually presented as a dendrogram. Partitioning results are typically visualized using a bivariate cluster plot.

9. Interpret the clusters

Once a cluster solution has been obtained, you must interpret (and possibly name) the clusters. What do the observations in a cluster have in common? How do they differ from the observations in other clusters? This step is typically accomplished by obtaining summary statistics for each variable by cluster. For continuous data, the mean or median for each variable within each cluster is calculated. For mixed data (data that contain categorical variables), the summary statistics will also include modes or category distributions.

11. Validate the results

Validating the cluster solution involves asking the question, “Are these groupings in some sense real, and not a manifestation of unique aspects of this dataset or statistical technique?” If a different cluster method or different sample is employed, would the same clusters be obtained? The fpc, clv, and clValid packages each contain functions for evaluating the stability of a clustering solution.

Enough of theory now, let’s focus on the practice things. I would recommend to download all the required libraries first and import the Wines data. This dataset containing 13 chemical measurements on 178 Italian wine samples. Wines data is available on UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html.

Installed the required packages

Let’s installed and load the all required packages. In this section, we are using some very useful packages. It is not necessary that you will use the same packages, you can use any package which serve your purpose.

We load a range of libraries for general data wrangling and general visualization together with more specialized tools.

The below code is a function, which can help us to install any package if you don’t have it, later it load the package. Therefore, no need to call the library function here

# Lets clean the unnecessary items
gc()
rm(list = ls(all = TRUE))


packages<-function(x){
  x<-as.character(match.call()[[2]])
  if (!require(x,character.only=TRUE)){
    install.packages(pkgs=x,repos="http://cran.r-project.org")
    require(x,character.only=TRUE)
  }
}

packages(tidyverse) # data manipulation
packages(corrplot)
packages(gridExtra)
packages(GGally)
packages(cluster) # clustering algorithms 
packages(factoextra) # clustering algorithms & visualization

Load the Data

We will be loading the Wines data from our local machine. The file is in ‘.csv’ format.

setwd("C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/k Means Clustering")

wines <- read.csv("Input/Wine.csv")

file:///C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/k Means Clustering/Input/Wine.csv

As we have said before, k-means is an unsupervised machine learning algorithm and works with unlabeled data. We don’t need the Customer_Segment column. Let’s remove this column from our data.

wines <- wines[,-14]
head(wines)
##   Alcohol Malic_Acid  Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1   14.23       1.71 2.43         15.6       127          2.80       3.06
## 2   13.20       1.78 2.14         11.2       100          2.65       2.76
## 3   13.16       2.36 2.67         18.6       101          2.80       3.24
## 4   14.37       1.95 2.50         16.8       113          3.85       3.49
## 5   13.24       2.59 2.87         21.0       118          2.80       2.69
## 6   14.20       1.76 2.45         15.2       112          3.27       3.39
##   Nonflavanoid_Phenols Proanthocyanins Color_Intensity  Hue OD280 Proline
## 1                 0.28            2.29            5.64 1.04  3.92    1065
## 2                 0.26            1.28            4.38 1.05  3.40    1050
## 3                 0.30            2.81            5.68 1.03  3.17    1185
## 4                 0.24            2.18            7.80 0.86  3.45    1480
## 5                 0.39            1.82            4.32 1.04  2.93     735
## 6                 0.34            1.97            6.75 1.05  2.85    1450

Data Analysis

As a first step we will have an overview of the individual data sets using the summary and str function.

Let’s check the summary of the data set

summary(wines)
##     Alcohol        Malic_Acid         Ash         Ash_Alcanity  
##  Min.   :11.03   Min.   :0.740   Min.   :1.360   Min.   :10.60  
##  1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210   1st Qu.:17.20  
##  Median :13.05   Median :1.865   Median :2.360   Median :19.50  
##  Mean   :13.00   Mean   :2.336   Mean   :2.367   Mean   :19.49  
##  3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558   3rd Qu.:21.50  
##  Max.   :14.83   Max.   :5.800   Max.   :3.230   Max.   :30.00  
##    Magnesium      Total_Phenols     Flavanoids    Nonflavanoid_Phenols
##  Min.   : 70.00   Min.   :0.980   Min.   :0.340   Min.   :0.1300      
##  1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205   1st Qu.:0.2700      
##  Median : 98.00   Median :2.355   Median :2.135   Median :0.3400      
##  Mean   : 99.74   Mean   :2.295   Mean   :2.029   Mean   :0.3619      
##  3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875   3rd Qu.:0.4375      
##  Max.   :162.00   Max.   :3.880   Max.   :5.080   Max.   :0.6600      
##  Proanthocyanins Color_Intensity       Hue             OD280      
##  Min.   :0.410   Min.   : 1.280   Min.   :0.4800   Min.   :1.270  
##  1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825   1st Qu.:1.938  
##  Median :1.555   Median : 4.690   Median :0.9650   Median :2.780  
##  Mean   :1.591   Mean   : 5.058   Mean   :0.9574   Mean   :2.612  
##  3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200   3rd Qu.:3.170  
##  Max.   :3.580   Max.   :13.000   Max.   :1.7100   Max.   :4.000  
##     Proline      
##  Min.   : 278.0  
##  1st Qu.: 500.5  
##  Median : 673.5  
##  Mean   : 746.9  
##  3rd Qu.: 985.0  
##  Max.   :1680.0
str(wines)
## 'data.frame':    178 obs. of  13 variables:
##  $ Alcohol             : num  14.2 13.2 13.2 14.4 13.2 ...
##  $ Malic_Acid          : num  1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
##  $ Ash                 : num  2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
##  $ Ash_Alcanity        : num  15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
##  $ Magnesium           : int  127 100 101 113 118 112 96 121 97 98 ...
##  $ Total_Phenols       : num  2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
##  $ Flavanoids          : num  3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
##  $ Nonflavanoid_Phenols: num  0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
##  $ Proanthocyanins     : num  2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
##  $ Color_Intensity     : num  5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
##  $ Hue                 : num  1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
##  $ OD280               : num  3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
##  $ Proline             : int  1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...

We can see that the all the variables are either numeric or integers, therefore, we can use these variables here. But it is always advisable to use only the relevant variable for the Cluster analysis.

Let’s visualize the variables available in the data. Plot the histogram of each attribute.

wines %>%
  gather(attributes, value, 1:13) %>%
  ggplot(aes(x = value)) +
  geom_histogram(fill = 'lightblue2', color = 'black') +
  facet_wrap(~attributes, scales = 'free_x') +
  labs(x="Values", y="Frequency") +
  theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s build a correlation matrix to understand the relation between each attributes

corrplot(cor(wines), type = 'upper', method = 'number', tl.cex = 0.9)

There is a strong linear correlation between Total_Phenols and Flavanoids. We can model the relationship between these two variables by fitting a linear equation

# Relationship between Phenols and Flavanoids
ggplot(wines, aes(x = Total_Phenols, y = Flavanoids)) +
  geom_point() +
  geom_smooth(method = 'lm', se = FALSE) +
  theme_bw()

let’s prepare our data to do the K means clustering

From the data summary, we have seen that there are variables who are on a different scale, we need to either scale the data or normalise it. We can normalise the data using the mean and standard deviation, also we can use scale function to normalise our data.

winesNorm <- as.data.frame(scale(wines))
head(winesNorm)
##     Alcohol  Malic_Acid        Ash Ash_Alcanity  Magnesium Total_Phenols
## 1 1.5143408 -0.56066822  0.2313998   -1.1663032 1.90852151     0.8067217
## 2 0.2455968 -0.49800856 -0.8256672   -2.4838405 0.01809398     0.5670481
## 3 0.1963252  0.02117152  1.1062139   -0.2679823 0.08810981     0.8067217
## 4 1.6867914 -0.34583508  0.4865539   -0.8069748 0.92829983     2.4844372
## 5 0.2948684  0.22705328  1.8352256    0.4506745 1.27837900     0.8067217
## 6 1.4773871 -0.51591132  0.3043010   -1.2860793 0.85828399     1.5576991
##   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1  1.0319081           -0.6577078       1.2214385       0.2510088
## 2  0.7315653           -0.8184106      -0.5431887      -0.2924962
## 3  1.2121137           -0.4970050       2.1299594       0.2682629
## 4  1.4623994           -0.9791134       1.0292513       1.1827317
## 5  0.6614853            0.2261576       0.4002753      -0.3183774
## 6  1.3622851           -0.1755994       0.6623487       0.7298108
##          Hue     OD280     Proline
## 1  0.3611585 1.8427215  1.01015939
## 2  0.4049085 1.1103172  0.96252635
## 3  0.3174085 0.7863692  1.39122370
## 4 -0.4263410 1.1807407  2.32800680
## 5  0.3611585 0.4483365 -0.03776747
## 6  0.4049085 0.3356589  2.23274072

** Computing k-means clustering in R **

We can compute k-means in R with the kmeans function. Here will group the data into two clusters (centers = 2). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations and reports on the best one. For example, adding nstart=25 generates 25 initial configurations. This approach is often recommended.

set.seed(123)

wines_K2 <- kmeans(winesNorm, centers = 2, nstart = 25)
print(wines_K2)
## K-means clustering with 2 clusters of sizes 87, 91
## 
## Cluster means:
##      Alcohol Malic_Acid         Ash Ash_Alcanity  Magnesium Total_Phenols
## 1  0.3248845 -0.3529345  0.05207966   -0.4899811  0.3206911     0.7826625
## 2 -0.3106038  0.3374209 -0.04979045    0.4684435 -0.3065948    -0.7482598
##   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1  0.8235093           -0.5921337       0.6378483      -0.1024529
## 2 -0.7873111            0.5661058      -0.6098110       0.0979495
##          Hue      OD280    Proline
## 1  0.5633135  0.7146506  0.6051873
## 2 -0.5385525 -0.6832374 -0.5785857
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1
##  [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 765.0965 884.3435
##  (between_SS / total_SS =  28.3 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

An Analyst always try to visualize the data and results, let’s visualize the cluster we have created, so far.

fviz_cluster(wines_K2, data = winesNorm)

When we print the model we build (wines_k2), it shows information like, number of clusters, centers of the clusters, size of the clusters and sum of square. Let’s check how to get these attributes of our model.

# Clusters to which each point is associated
wines_K2$cluster
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1
##  [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
# Cluster centers
wines_K2$centers
##      Alcohol Malic_Acid         Ash Ash_Alcanity  Magnesium Total_Phenols
## 1  0.3248845 -0.3529345  0.05207966   -0.4899811  0.3206911     0.7826625
## 2 -0.3106038  0.3374209 -0.04979045    0.4684435 -0.3065948    -0.7482598
##   Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1  0.8235093           -0.5921337       0.6378483      -0.1024529
## 2 -0.7873111            0.5661058      -0.6098110       0.0979495
##          Hue      OD280    Proline
## 1  0.5633135  0.7146506  0.6051873
## 2 -0.5385525 -0.6832374 -0.5785857
# Cluster size
wines_K2$size
## [1] 87 91
# Between clusters sum of square
wines_K2$betweenss
## [1] 651.56
# Within cluster sum of square
wines_K2$withinss
## [1] 765.0965 884.3435
# Total with sum of square
wines_K2$tot.withinss
## [1] 1649.44
# Total sum of square
wines_K2$totss
## [1] 2301

Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results.

We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:

wines_K3 <- kmeans(winesNorm, centers = 3, nstart = 25)
wines_K4 <- kmeans(winesNorm, centers = 4, nstart = 25)
wines_K5 <- kmeans(winesNorm, centers = 5, nstart = 25)

We can plot these clusters for different K value to compare.

p1 <- fviz_cluster(wines_K2, geom = "point", data = winesNorm) + ggtitle(" K = 2")
p2 <- fviz_cluster(wines_K3, geom = "point", data = winesNorm) + ggtitle(" K = 3")
p3 <- fviz_cluster(wines_K4, geom = "point", data = winesNorm) + ggtitle(" K = 4")
p4 <- fviz_cluster(wines_K5, geom = "point", data = winesNorm) + ggtitle(" K = 5")

grid.arrange(p1, p2, p3, p4, nrow = 2)

Determining Optimal Clusters

K-means clustering requires that you specify in advance the number of clusters to extract. A plot of the total within-groups sums of squares against the number of clusters in a k-means solution can be helpful. A bend in the graph can suggest the appropriate number of clusters.

Below are the methods to determine the optimal number of clusters

  1. Elbow method
  2. Silhouette method
  3. Gap statistic
# Determining Optimal clusters (k) Using Elbow method
fviz_nbclust(x = winesNorm,FUNcluster = kmeans, method = 'wss' )

The above one line code work better to find the number of clusters using Elbow method, however, we can do the same thing by making a function which takes your data (winesNorm) as an input. Let’s see the below line of code which is used as a function to create a plot to find number of clusters.

wssplot <- function(data, nc = 15, set.seed = 1234){
  wss <- (nrow(data) - 1)*sum(apply(data, 2, var))
  for(i in 2:nc) {
    set.seed(1234)
    wss[i] <- sum(kmeans(x = data, centers = i, nstart = 25)$withinss)
  }
  plot(1:nc, wss, type = 'b', xlab = 'Number of Clusters', ylab = 'Within Group Sum of Square',
       main = 'Elbow Method Plot to Find Optimal Number of Clusters', frame.plot = T,
       col = 'blue', lwd = 1.5)
}

wssplot(winesNorm)

# Determining Optimal clusters (k) Using Average Silhouette Method

fviz_nbclust(x = winesNorm,FUNcluster = kmeans, method = 'silhouette' )

There is another method called Gap-Static used for finding the optimal value of K.

# compute gap statistic
set.seed(123)
gap_stat <- clusGap(x = winesNorm, FUN = kmeans, K.max = 15, nstart = 25, B = 50 )

# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = winesNorm, FUNcluster = kmeans, K.max = 15, B = 50,     nstart = 25)
## B=50 simulated reference sets, k = 1..15; spaceH0="scaledPCA"
##  --> Number of clusters (method 'firstmax'): 3
##           logW   E.logW       gap     SE.sim
##  [1,] 5.377557 5.862345 0.4847882 0.01285667
##  [2,] 5.203497 5.756033 0.5525361 0.01335832
##  [3,] 5.066929 5.693411 0.6264815 0.01219051
##  [4,] 5.023946 5.647048 0.6231019 0.01197491
##  [5,] 4.989519 5.609867 0.6203484 0.01247597
##  [6,] 4.957563 5.577967 0.6204043 0.01285007
##  [7,] 4.929594 5.549966 0.6203728 0.01296987
##  [8,] 4.906154 5.524312 0.6181580 0.01306180
##  [9,] 4.876410 5.500781 0.6243715 0.01334715
## [10,] 4.854848 5.479572 0.6247242 0.01342649
## [11,] 4.824462 5.459325 0.6348630 0.01349015
## [12,] 4.802637 5.440691 0.6380531 0.01350594
## [13,] 4.780001 5.422362 0.6423608 0.01367134
## [14,] 4.762135 5.404725 0.6425899 0.01349749
## [15,] 4.742632 5.387435 0.6448032 0.01335952
# plot the result to determine the optimal number of clusters.
fviz_gap_stat(gap_stat)

With most of these approaches suggesting 3 as the number of optimal clusters, we can perform the final analysis and extract the results using 3 clusters.

# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(winesNorm, centers = 3, nstart = 25)
print(final)
## K-means clustering with 3 clusters of sizes 62, 51, 65
## 
## Cluster means:
##      Alcohol Malic_Acid        Ash Ash_Alcanity   Magnesium Total_Phenols
## 1  0.8328826 -0.3029551  0.3636801   -0.6084749  0.57596208    0.88274724
## 2  0.1644436  0.8690954  0.1863726    0.5228924 -0.07526047   -0.97657548
## 3 -0.9234669 -0.3929331 -0.4931257    0.1701220 -0.49032869   -0.07576891
##    Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1  0.97506900          -0.56050853      0.57865427       0.1705823
## 2 -1.21182921           0.72402116     -0.77751312       0.9388902
## 3  0.02075402          -0.03343924      0.05810161      -0.8993770
##          Hue      OD280    Proline
## 1  0.4726504  0.7770551  1.1220202
## 2 -1.1615122 -1.2887761 -0.4059428
## 3  0.4605046  0.2700025 -0.7517257
## 
## Clustering vector:
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3
##  [71] 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
## 
## Within cluster sum of squares by cluster:
## [1] 385.6983 326.3537 558.6971
##  (between_SS / total_SS =  44.8 %)
## 
## Available components:
## 
## [1] "cluster"      "centers"      "totss"        "withinss"    
## [5] "tot.withinss" "betweenss"    "size"         "iter"        
## [9] "ifault"

We can visualize the results using the below code.

fviz_cluster(final, data = winesNorm)

We can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level

winesNorm %>% 
  mutate(Cluster = final$cluster) %>%
  group_by(Cluster) %>%
  summarize_all('median')
## # A tibble: 3 x 14
##   Cluster Alcohol Malic_Acid     Ash Ash_Alcanity Magnesium Total_Phenols
##     <int>   <dbl>      <dbl>   <dbl>        <dbl>     <dbl>         <dbl>
## 1       1   0.905     -0.511  0.286        -0.747     0.403         0.847
## 2       2   0.135      0.836  0.0491        0.451    -0.192        -1.03 
## 3       3  -0.925     -0.650 -0.461         0.151    -0.822        -0.152
## # ... with 7 more variables: Flavanoids <dbl>, Nonflavanoid_Phenols <dbl>,
## #   Proanthocyanins <dbl>, Color_Intensity <dbl>, Hue <dbl>, OD280 <dbl>,
## #   Proline <dbl>

Summary

K Means clustering is a simple algorithm used to partition n observations into k clusters in which each observation is belongs to the cluster with the nearest mean.

So far we have learned:

  • prepare the data for cluster analysis (for K-Means cluster). Use the numerical variables and normalizing the data is recommended.
  • Analyze the available data.
  • Find optimal number of clusters using Elbow method, Silhouette method and Gap-Static method.
  • Partitioning the data using the optimal number of clustering.