K-means is an unsupervised machine learning algorithm used to find groups of observations (clusters) that share similar characteristics.
A cluster is defined as a group of observations that are more similar to each other than they are to the observations in other groups.
Cluster analysis is widely used in the biological and behavioral sciences, marketing, and medical research. For example, a psychological researcher might cluster data on the symptoms and demographics of depressed patients, seeking to uncover subtypes of depression. The hope would be that finding such subtypes might lead to more targeted and effective treatments and a better understanding of the disorder. Marketing researchers use cluster analysis as a customer-segmentation strategy. Customers are arranged into clusters based on the similarity of their demographics and buying behaviors. Marketing campaigns are then tailored to appeal to one or more of these subgroups.
The two most popular clustering approaches are hierarchical agglomerative clustering and partitioning clustering.
In this topic, we are discussing about K means clustering which comes under partitioning clustering.
An effective cluster analysis is a multistep process with numerous decision points. Each decision can affect the quality and usefulness of the results.
1. Choose appropriate attributes
The first (and perhaps most important) step is to select variables that you feel may be important for identifying and understanding differences among groups of observations within the data. For example, in a study of depression, you might want to assess one or more of the following: psychological symptoms; physical symptoms; age at onset; number, duration, and timing of episodes; number of hospitalizations; functional status with regard to self-care; social and work history; current age; gender; ethnicity; socioeconomic status; marital status; family medical history; and response to previous treatments. A sophisticated cluster analysis can’t compensate for a poor choice of variables.
2. Scale the data
If the variables in the analysis vary in range, the variables with the largest range will have the greatest impact on the results. This is often undesirable, and analysts scale the data before continuing. The most popular approach is to standardize each variable to a mean of 0 and a standard deviation of 1.
3. Screen for outliers
Many clustering techniques are sensitive to outliers, distorting the cluster solutions obtained. You can screen for (and remove) univariate outliers using functions from the outliers package. The mvoutlier
package contains functions that can be used to identify multivariate outliers.
4. Calculate distances
The most popular measure of the distance between two observations is the Euclidean distance, but the Manhattan, Canberra, asymmetric binary, maximum, and Minkowski distance measures are also available.
5. Select a clustering algorithm
Next, select a method of clustering the data. Hierarchical clustering is useful for smaller problems (say, 150 observations or less) and where a nested hierarchy of groupings is desired. The partitioning method can handle much larger problems but requires that the number of clusters be specified in advance.
6. Determine the number of clusters present
In order to obtain a final cluster solution, you must decide how many clusters are present in the data.
7. Obtain a final clustering solution
Once the number of clusters has been determined, a final clustering is performed to extract that number of subgroups.
8. Visualize the results
Visualization can help you determine the meaning and usefulness of the cluster solution. The results of a hierarchical clustering are usually presented as a dendrogram. Partitioning results are typically visualized using a bivariate cluster plot.
9. Interpret the clusters
Once a cluster solution has been obtained, you must interpret (and possibly name) the clusters. What do the observations in a cluster have in common? How do they differ from the observations in other clusters? This step is typically accomplished by obtaining summary statistics for each variable by cluster. For continuous data, the mean or median for each variable within each cluster is calculated. For mixed data (data that contain categorical variables), the summary statistics will also include modes or category distributions.
11. Validate the results
Validating the cluster solution involves asking the question, “Are these groupings in some sense real, and not a manifestation of unique aspects of this dataset or statistical technique?” If a different cluster method or different sample is employed, would the same clusters be obtained? The fpc
, clv
, and clValid
packages each contain functions for evaluating the stability of a clustering solution.
Enough of theory now, let’s focus on the practice things. I would recommend to download all the required libraries first and import the Wines data. This dataset containing 13 chemical measurements on 178 Italian wine samples. Wines data is available on UCI Machine Learning Repository http://www.ics.uci.edu/~mlearn/MLRepository.html.
Let’s installed and load the all required packages. In this section, we are using some very useful packages. It is not necessary that you will use the same packages, you can use any package which serve your purpose.
We load a range of libraries for general data wrangling and general visualization together with more specialized tools.
The below code is a function, which can help us to install any package if you don’t have it, later it load the package. Therefore, no need to call the library function here
# Lets clean the unnecessary items
gc()
rm(list = ls(all = TRUE))
packages<-function(x){
x<-as.character(match.call()[[2]])
if (!require(x,character.only=TRUE)){
install.packages(pkgs=x,repos="http://cran.r-project.org")
require(x,character.only=TRUE)
}
}
packages(tidyverse) # data manipulation
packages(corrplot)
packages(gridExtra)
packages(GGally)
packages(cluster) # clustering algorithms
packages(factoextra) # clustering algorithms & visualization
We will be loading the Wines data from our local machine. The file is in ‘.csv’ format.
setwd("C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/k Means Clustering")
wines <- read.csv("Input/Wine.csv")
file:///C:/Users/Abdul_Yunus/Desktop/Yunus_Personal/Learning/k Means Clustering/Input/Wine.csv
As we have said before, k-means is an unsupervised machine learning algorithm and works with unlabeled data. We don’t need the Customer_Segment column. Let’s remove this column from our data.
wines <- wines[,-14]
head(wines)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols Flavanoids
## 1 14.23 1.71 2.43 15.6 127 2.80 3.06
## 2 13.20 1.78 2.14 11.2 100 2.65 2.76
## 3 13.16 2.36 2.67 18.6 101 2.80 3.24
## 4 14.37 1.95 2.50 16.8 113 3.85 3.49
## 5 13.24 2.59 2.87 21.0 118 2.80 2.69
## 6 14.20 1.76 2.45 15.2 112 3.27 3.39
## Nonflavanoid_Phenols Proanthocyanins Color_Intensity Hue OD280 Proline
## 1 0.28 2.29 5.64 1.04 3.92 1065
## 2 0.26 1.28 4.38 1.05 3.40 1050
## 3 0.30 2.81 5.68 1.03 3.17 1185
## 4 0.24 2.18 7.80 0.86 3.45 1480
## 5 0.39 1.82 4.32 1.04 2.93 735
## 6 0.34 1.97 6.75 1.05 2.85 1450
As a first step we will have an overview of the individual data sets using the summary and str function.
Let’s check the summary of the data set
summary(wines)
## Alcohol Malic_Acid Ash Ash_Alcanity
## Min. :11.03 Min. :0.740 Min. :1.360 Min. :10.60
## 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210 1st Qu.:17.20
## Median :13.05 Median :1.865 Median :2.360 Median :19.50
## Mean :13.00 Mean :2.336 Mean :2.367 Mean :19.49
## 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558 3rd Qu.:21.50
## Max. :14.83 Max. :5.800 Max. :3.230 Max. :30.00
## Magnesium Total_Phenols Flavanoids Nonflavanoid_Phenols
## Min. : 70.00 Min. :0.980 Min. :0.340 Min. :0.1300
## 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205 1st Qu.:0.2700
## Median : 98.00 Median :2.355 Median :2.135 Median :0.3400
## Mean : 99.74 Mean :2.295 Mean :2.029 Mean :0.3619
## 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875 3rd Qu.:0.4375
## Max. :162.00 Max. :3.880 Max. :5.080 Max. :0.6600
## Proanthocyanins Color_Intensity Hue OD280
## Min. :0.410 Min. : 1.280 Min. :0.4800 Min. :1.270
## 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825 1st Qu.:1.938
## Median :1.555 Median : 4.690 Median :0.9650 Median :2.780
## Mean :1.591 Mean : 5.058 Mean :0.9574 Mean :2.612
## 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200 3rd Qu.:3.170
## Max. :3.580 Max. :13.000 Max. :1.7100 Max. :4.000
## Proline
## Min. : 278.0
## 1st Qu.: 500.5
## Median : 673.5
## Mean : 746.9
## 3rd Qu.: 985.0
## Max. :1680.0
str(wines)
## 'data.frame': 178 obs. of 13 variables:
## $ Alcohol : num 14.2 13.2 13.2 14.4 13.2 ...
## $ Malic_Acid : num 1.71 1.78 2.36 1.95 2.59 1.76 1.87 2.15 1.64 1.35 ...
## $ Ash : num 2.43 2.14 2.67 2.5 2.87 2.45 2.45 2.61 2.17 2.27 ...
## $ Ash_Alcanity : num 15.6 11.2 18.6 16.8 21 15.2 14.6 17.6 14 16 ...
## $ Magnesium : int 127 100 101 113 118 112 96 121 97 98 ...
## $ Total_Phenols : num 2.8 2.65 2.8 3.85 2.8 3.27 2.5 2.6 2.8 2.98 ...
## $ Flavanoids : num 3.06 2.76 3.24 3.49 2.69 3.39 2.52 2.51 2.98 3.15 ...
## $ Nonflavanoid_Phenols: num 0.28 0.26 0.3 0.24 0.39 0.34 0.3 0.31 0.29 0.22 ...
## $ Proanthocyanins : num 2.29 1.28 2.81 2.18 1.82 1.97 1.98 1.25 1.98 1.85 ...
## $ Color_Intensity : num 5.64 4.38 5.68 7.8 4.32 6.75 5.25 5.05 5.2 7.22 ...
## $ Hue : num 1.04 1.05 1.03 0.86 1.04 1.05 1.02 1.06 1.08 1.01 ...
## $ OD280 : num 3.92 3.4 3.17 3.45 2.93 2.85 3.58 3.58 2.85 3.55 ...
## $ Proline : int 1065 1050 1185 1480 735 1450 1290 1295 1045 1045 ...
We can see that the all the variables are either numeric or integers, therefore, we can use these variables here. But it is always advisable to use only the relevant variable for the Cluster analysis.
Let’s visualize the variables available in the data. Plot the histogram of each attribute.
wines %>%
gather(attributes, value, 1:13) %>%
ggplot(aes(x = value)) +
geom_histogram(fill = 'lightblue2', color = 'black') +
facet_wrap(~attributes, scales = 'free_x') +
labs(x="Values", y="Frequency") +
theme_bw()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s build a correlation matrix to understand the relation between each attributes
corrplot(cor(wines), type = 'upper', method = 'number', tl.cex = 0.9)
There is a strong linear correlation between Total_Phenols and Flavanoids. We can model the relationship between these two variables by fitting a linear equation
# Relationship between Phenols and Flavanoids
ggplot(wines, aes(x = Total_Phenols, y = Flavanoids)) +
geom_point() +
geom_smooth(method = 'lm', se = FALSE) +
theme_bw()
let’s prepare our data to do the K means clustering
From the data summary, we have seen that there are variables who are on a different scale, we need to either scale the data or normalise it. We can normalise the data using the mean and standard deviation, also we can use scale function to normalise our data.
winesNorm <- as.data.frame(scale(wines))
head(winesNorm)
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 1.5143408 -0.56066822 0.2313998 -1.1663032 1.90852151 0.8067217
## 2 0.2455968 -0.49800856 -0.8256672 -2.4838405 0.01809398 0.5670481
## 3 0.1963252 0.02117152 1.1062139 -0.2679823 0.08810981 0.8067217
## 4 1.6867914 -0.34583508 0.4865539 -0.8069748 0.92829983 2.4844372
## 5 0.2948684 0.22705328 1.8352256 0.4506745 1.27837900 0.8067217
## 6 1.4773871 -0.51591132 0.3043010 -1.2860793 0.85828399 1.5576991
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 1.0319081 -0.6577078 1.2214385 0.2510088
## 2 0.7315653 -0.8184106 -0.5431887 -0.2924962
## 3 1.2121137 -0.4970050 2.1299594 0.2682629
## 4 1.4623994 -0.9791134 1.0292513 1.1827317
## 5 0.6614853 0.2261576 0.4002753 -0.3183774
## 6 1.3622851 -0.1755994 0.6623487 0.7298108
## Hue OD280 Proline
## 1 0.3611585 1.8427215 1.01015939
## 2 0.4049085 1.1103172 0.96252635
## 3 0.3174085 0.7863692 1.39122370
## 4 -0.4263410 1.1807407 2.32800680
## 5 0.3611585 0.4483365 -0.03776747
## 6 0.4049085 0.3356589 2.23274072
** Computing k-means clustering in R **
We can compute k-means in R with the kmeans function. Here will group the data into two clusters (centers = 2). The kmeans function also has an nstart option that attempts multiple initial configurations and reports on the best one. For example, adding nstart = 25 will generate 25 initial configurations and reports on the best one. For example, adding nstart=25 generates 25 initial configurations. This approach is often recommended.
set.seed(123)
wines_K2 <- kmeans(winesNorm, centers = 2, nstart = 25)
print(wines_K2)
## K-means clustering with 2 clusters of sizes 87, 91
##
## Cluster means:
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625
## 2 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 0.8235093 -0.5921337 0.6378483 -0.1024529
## 2 -0.7873111 0.5661058 -0.6098110 0.0979495
## Hue OD280 Proline
## 1 0.5633135 0.7146506 0.6051873
## 2 -0.5385525 -0.6832374 -0.5785857
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1
## [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 765.0965 884.3435
## (between_SS / total_SS = 28.3 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
An Analyst always try to visualize the data and results, let’s visualize the cluster we have created, so far.
fviz_cluster(wines_K2, data = winesNorm)
When we print the model we build (wines_k2), it shows information like, number of clusters, centers of the clusters, size of the clusters and sum of square. Let’s check how to get these attributes of our model.
# Clusters to which each point is associated
wines_K2$cluster
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 1 2 1 1 2 2 1
## [71] 2 1 2 1 1 2 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 1 1 2 2 2 1
## [106] 2 2 2 2 1 1 2 2 2 2 1 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
# Cluster centers
wines_K2$centers
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 0.3248845 -0.3529345 0.05207966 -0.4899811 0.3206911 0.7826625
## 2 -0.3106038 0.3374209 -0.04979045 0.4684435 -0.3065948 -0.7482598
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 0.8235093 -0.5921337 0.6378483 -0.1024529
## 2 -0.7873111 0.5661058 -0.6098110 0.0979495
## Hue OD280 Proline
## 1 0.5633135 0.7146506 0.6051873
## 2 -0.5385525 -0.6832374 -0.5785857
# Cluster size
wines_K2$size
## [1] 87 91
# Between clusters sum of square
wines_K2$betweenss
## [1] 651.56
# Within cluster sum of square
wines_K2$withinss
## [1] 765.0965 884.3435
# Total with sum of square
wines_K2$tot.withinss
## [1] 1649.44
# Total sum of square
wines_K2$totss
## [1] 2301
Because the number of clusters (k) must be set before we start the algorithm, it is often advantageous to use several different values of k and examine the differences in the results.
We can execute the same process for 3, 4, and 5 clusters, and the results are shown in the figure:
wines_K3 <- kmeans(winesNorm, centers = 3, nstart = 25)
wines_K4 <- kmeans(winesNorm, centers = 4, nstart = 25)
wines_K5 <- kmeans(winesNorm, centers = 5, nstart = 25)
We can plot these clusters for different K value to compare.
p1 <- fviz_cluster(wines_K2, geom = "point", data = winesNorm) + ggtitle(" K = 2")
p2 <- fviz_cluster(wines_K3, geom = "point", data = winesNorm) + ggtitle(" K = 3")
p3 <- fviz_cluster(wines_K4, geom = "point", data = winesNorm) + ggtitle(" K = 4")
p4 <- fviz_cluster(wines_K5, geom = "point", data = winesNorm) + ggtitle(" K = 5")
grid.arrange(p1, p2, p3, p4, nrow = 2)
K-means clustering requires that you specify in advance the number of clusters to extract. A plot of the total within-groups sums of squares against the number of clusters in a k-means solution can be helpful. A bend in the graph can suggest the appropriate number of clusters.
Below are the methods to determine the optimal number of clusters
# Determining Optimal clusters (k) Using Elbow method
fviz_nbclust(x = winesNorm,FUNcluster = kmeans, method = 'wss' )
The above one line code work better to find the number of clusters using Elbow method, however, we can do the same thing by making a function which takes your data (winesNorm) as an input. Let’s see the below line of code which is used as a function to create a plot to find number of clusters.
wssplot <- function(data, nc = 15, set.seed = 1234){
wss <- (nrow(data) - 1)*sum(apply(data, 2, var))
for(i in 2:nc) {
set.seed(1234)
wss[i] <- sum(kmeans(x = data, centers = i, nstart = 25)$withinss)
}
plot(1:nc, wss, type = 'b', xlab = 'Number of Clusters', ylab = 'Within Group Sum of Square',
main = 'Elbow Method Plot to Find Optimal Number of Clusters', frame.plot = T,
col = 'blue', lwd = 1.5)
}
wssplot(winesNorm)
# Determining Optimal clusters (k) Using Average Silhouette Method
fviz_nbclust(x = winesNorm,FUNcluster = kmeans, method = 'silhouette' )
There is another method called Gap-Static used for finding the optimal value of K.
# compute gap statistic
set.seed(123)
gap_stat <- clusGap(x = winesNorm, FUN = kmeans, K.max = 15, nstart = 25, B = 50 )
# Print the result
print(gap_stat, method = "firstmax")
## Clustering Gap statistic ["clusGap"] from call:
## clusGap(x = winesNorm, FUNcluster = kmeans, K.max = 15, B = 50, nstart = 25)
## B=50 simulated reference sets, k = 1..15; spaceH0="scaledPCA"
## --> Number of clusters (method 'firstmax'): 3
## logW E.logW gap SE.sim
## [1,] 5.377557 5.862345 0.4847882 0.01285667
## [2,] 5.203497 5.756033 0.5525361 0.01335832
## [3,] 5.066929 5.693411 0.6264815 0.01219051
## [4,] 5.023946 5.647048 0.6231019 0.01197491
## [5,] 4.989519 5.609867 0.6203484 0.01247597
## [6,] 4.957563 5.577967 0.6204043 0.01285007
## [7,] 4.929594 5.549966 0.6203728 0.01296987
## [8,] 4.906154 5.524312 0.6181580 0.01306180
## [9,] 4.876410 5.500781 0.6243715 0.01334715
## [10,] 4.854848 5.479572 0.6247242 0.01342649
## [11,] 4.824462 5.459325 0.6348630 0.01349015
## [12,] 4.802637 5.440691 0.6380531 0.01350594
## [13,] 4.780001 5.422362 0.6423608 0.01367134
## [14,] 4.762135 5.404725 0.6425899 0.01349749
## [15,] 4.742632 5.387435 0.6448032 0.01335952
# plot the result to determine the optimal number of clusters.
fviz_gap_stat(gap_stat)
With most of these approaches suggesting 3 as the number of optimal clusters, we can perform the final analysis and extract the results using 3 clusters.
# Compute k-means clustering with k = 3
set.seed(123)
final <- kmeans(winesNorm, centers = 3, nstart = 25)
print(final)
## K-means clustering with 3 clusters of sizes 62, 51, 65
##
## Cluster means:
## Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208 0.88274724
## 2 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047 -0.97657548
## 3 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869 -0.07576891
## Flavanoids Nonflavanoid_Phenols Proanthocyanins Color_Intensity
## 1 0.97506900 -0.56050853 0.57865427 0.1705823
## 2 -1.21182921 0.72402116 -0.77751312 0.9388902
## 3 0.02075402 -0.03343924 0.05810161 -0.8993770
## Hue OD280 Proline
## 1 0.4726504 0.7770551 1.1220202
## 2 -1.1615122 -1.2887761 -0.4059428
## 3 0.4605046 0.2700025 -0.7517257
##
## Clustering vector:
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 2 3 3 3 3 3 3 3 3
## [71] 3 3 3 1 3 3 3 3 3 3 3 3 3 2 3 3 3 3 3 3 3 3 3 3 3 1 3 3 3 3 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 1 3 3 3 3 3 3 3 3 2 2 2 2 2 2 2 2 2 2
## [141] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [176] 2 2 2
##
## Within cluster sum of squares by cluster:
## [1] 385.6983 326.3537 558.6971
## (between_SS / total_SS = 44.8 %)
##
## Available components:
##
## [1] "cluster" "centers" "totss" "withinss"
## [5] "tot.withinss" "betweenss" "size" "iter"
## [9] "ifault"
We can visualize the results using the below code.
fviz_cluster(final, data = winesNorm)
We can extract the clusters and add to our initial data to do some descriptive statistics at the cluster level
winesNorm %>%
mutate(Cluster = final$cluster) %>%
group_by(Cluster) %>%
summarize_all('median')
## # A tibble: 3 x 14
## Cluster Alcohol Malic_Acid Ash Ash_Alcanity Magnesium Total_Phenols
## <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 0.905 -0.511 0.286 -0.747 0.403 0.847
## 2 2 0.135 0.836 0.0491 0.451 -0.192 -1.03
## 3 3 -0.925 -0.650 -0.461 0.151 -0.822 -0.152
## # ... with 7 more variables: Flavanoids <dbl>, Nonflavanoid_Phenols <dbl>,
## # Proanthocyanins <dbl>, Color_Intensity <dbl>, Hue <dbl>, OD280 <dbl>,
## # Proline <dbl>
K Means clustering is a simple algorithm used to partition n observations into k clusters in which each observation is belongs to the cluster with the nearest mean.
So far we have learned: