K- means clustering is a method of unsupervised learning which is used when you have unlabeled data (i.e., data without defined categories or groups).
The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:
The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets.
Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.
We are dealing with the Wholesale Customers Dataset which can be downloaded from here:https://archive.ics.uci.edu/ml/datasets/Wholesale+customers to explore k-means clustering technique. The code is ran in order to identify the best number of clusters i.e the best k value in the dataset.
The data is imported and then explored. Fortunately, there is no missing data.
CustomerData<- read_excel("~/Harrisburg/MachineLearning/Lab 3/Wholesale customers data.csv.xlsx")
Customerdata <- data.frame(CustomerData)
summary(Customerdata)
## Channel Region Fresh Milk
## Min. :1.000 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.000 1st Qu.:2.000 1st Qu.: 3128 1st Qu.: 1533
## Median :1.000 Median :3.000 Median : 8504 Median : 3627
## Mean :1.323 Mean :2.543 Mean : 12000 Mean : 5796
## 3rd Qu.:2.000 3rd Qu.:3.000 3rd Qu.: 16934 3rd Qu.: 7190
## Max. :2.000 Max. :3.000 Max. :112151 Max. :73498
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2153 1st Qu.: 742.2 1st Qu.: 256.8 1st Qu.: 408.2
## Median : 4756 Median : 1526.0 Median : 816.5 Median : 965.5
## Mean : 7951 Mean : 3071.9 Mean : 2881.5 Mean : 1524.9
## 3rd Qu.:10656 3rd Qu.: 3554.2 3rd Qu.: 3922.0 3rd Qu.: 1820.2
## Max. :92780 Max. :60869.0 Max. :40827.0 Max. :47943.0
sum(is.na(Customerdata))
## [1] 0
With the function below, we will try to remove the top 5 customers from each category. We also remove the two ID fields :Channel and Region
top.n.custs <- function (Customerdata,cols,n=5)
{
#Initialize a vector to hold customers being removed
idx.to.remove <-integer(0)
for (c in cols) # For every column in the data we passed to this function
{
col.order <-order(Customerdata[,c],decreasing=T) #Sort column "c" in descending order (bigger on top)
#Order returns the sorted index (e.g. row 15, 3, 7, 1, ...) rather than the actual values sorted.
idx <-head(col.order, n)#Take the first n of the sorted column C to
idx.to.remove <-union(idx.to.remove,idx)
#combine and de-duplicate the row ids that need to be removed
}
#Return the indexes of customers to be removed
return(idx.to.remove)
}
We then determine how many customers we can remove using the function below.
#How Many Customers to be Removed?
top.custs <-top.n.custs(Customerdata, cols = 1:5,n=5)
length(top.custs)
## [1] 18
#Examine the customers
Customerdata[top.custs,] #Exammine the customers
## Channel Region Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 2 3 12669 9656 7561 214 2674 1338
## 2 2 3 7057 9810 9568 1762 3293 1776
## 3 2 3 6353 8808 7684 2405 3516 7844
## 5 2 3 22615 5410 7198 3915 1777 5185
## 6 2 3 9413 8259 5126 666 1795 1451
## 4 1 3 13265 1196 4221 6404 507 1788
## 182 1 3 112151 29627 18148 16745 4948 8550
## 126 1 3 76237 3473 7102 16538 778 918
## 285 1 3 68951 4411 12609 8692 751 2406
## 40 1 3 56159 555 902 10002 212 2916
## 259 1 1 56083 4563 2124 6422 730 3321
## 87 2 3 22925 73498 32114 987 20070 903
## 48 2 3 44466 54259 55571 7782 24171 6465
## 86 2 3 16117 46197 92780 1026 40827 2944
## 184 1 3 36847 43950 20170 36534 239 47943
## 62 2 3 35942 38369 59598 3254 26701 2017
## 334 2 2 8565 4980 67298 131 38102 1215
## 66 2 3 85 20959 45828 36 24231 1423
#Remove the Customers from the dataset
data.rm.top<-Customerdata[-c(top.custs),]
We can now examine the summary stats of the remaining data
#Examine summary stats for the remaining data
print(summary(data.rm.top))
## Channel Region Fresh Milk
## Min. :1.00 Min. :1.000 Min. : 3 Min. : 55
## 1st Qu.:1.00 1st Qu.:2.000 1st Qu.: 3072 1st Qu.: 1497
## Median :1.00 Median :3.000 Median : 8130 Median : 3582
## Mean :1.31 Mean :2.531 Mean :11076 Mean : 5172
## 3rd Qu.:2.00 3rd Qu.:3.000 3rd Qu.:16251 3rd Qu.: 6962
## Max. :2.00 Max. :3.000 Max. :56082 Max. :36423
## Grocery Frozen Detergents_Paper Delicassen
## Min. : 3 Min. : 25.0 Min. : 3.0 Min. : 3.0
## 1st Qu.: 2132 1st Qu.: 738.8 1st Qu.: 255.2 1st Qu.: 398.0
## Median : 4603 Median : 1487.5 Median : 799.5 Median : 904.5
## Mean : 7211 Mean : 2910.3 Mean : 2541.6 Mean : 1352.0
## 3rd Qu.:10391 3rd Qu.: 3428.0 3rd Qu.: 3879.2 3rd Qu.: 1752.2
## Max. :39694 Max. :60869.0 Max. :19410.0 Max. :16523.0
We can now try other K vlue and compare the withins and betweeens to help us select the best k.
set.seed(76964057) #Set the seed for reproducibility
#Try K from 2 to 20
rng<-2:20
#Number of times to run the K Means algorithm
tries <-100
#Set up an empty vector to hold all of points
avg.totw.ss <-integer(length(rng))
avg.totb.ss <- integer(length(rng))
avg.tot.ss <- integer(length(rng))
# For each value of the range variable
for(v in rng){
#Set up an empty vectors to hold the tries
v.totw.ss <-integer(tries)
b.totb.ss <- integer(tries)
tot.ss <- integer(tries)
#Run kmeans
for(i in 1:tries){
k.temp <-kmeans(data.rm.top,centers=v)
#Store the total withinss
v.totw.ss[i] <-k.temp$tot.withinss
#Store the betweenss
b.totb.ss[i] <- k.temp$betweenss
#Store the total sum of squares
tot.ss[i] <- k.temp$totss
}
#Average the withinss and betweenss
avg.totw.ss[v-1] <-mean(v.totw.ss)
avg.totb.ss[v-1] <-mean(b.totb.ss)
avg.tot.ss[v-1] <- mean(tot.ss)
}
plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
ylab="Average Total Within Sum of Squares",
xlab="Value of K")
At this point, the graph bends and stops making gains around 5.
plot(rng,avg.totb.ss,type="b", main="Total between SS by Various K",
ylab="Average Total Between Sum of Squares",
xlab="Value of K")
#Plot the ratio of betweenss/total ss and withinss / total ss for evaluation
plot(rng,avg.totb.ss/avg.tot.ss,type="b", main="Ratio of between ss / the total ss by Various K",
ylab="Ratio Between SS / Total SS",
xlab="Value of K")
abline(h=0.85, col="red")
plot(rng,avg.totw.ss/avg.tot.ss,type="b", main="Ratio of within ss / the total ss by Various K",
ylab="Ratio Between SS / Total SS",
xlab="Value of K")
abline(h=0.15, col="red")
We then create the best number of clusters
#Create the best number of clusters, Remove columns 1 and 2
n <- 3
#return(as.integer(n))
k <-kmeans(data.rm.top[,-c(1,2)], centers=n)
Below is a print of all the analysis with 3 clusters.
#Display cluster centers
print(k$centers)
## Fresh Milk Grocery Frozen Detergents_Paper Delicassen
## 1 5165.797 12491.532 19078.823 1644.633 8395.608 1843.6709
## 2 6623.979 3210.054 4070.158 2569.363 1221.833 973.7792
## 3 25984.252 4127.505 5426.165 4675.359 1126.621 1856.1456
#Give a count of data points in each cluster
print(table(k$cluster))
##
## 1 2 3
## 79 240 103
#Generate a plot of the clusters
library(cluster)
clusplot(data.rm.top, k$cluster, main='2D representation of the Cluster solution',
color=TRUE, shade=TRUE,
labels=2, lines=0)
Interpretation : with the analysis, the cluster plot explains 60% of the information about multivarite data
In this analysis, we will perform K-means analysis on the wine dataset from UCI:https://archive.ics.uci.edu/ml/datasets/wine+quality. The dataset has 13 variables with 178 measurements.
The dataset is imported and explored missing numebers
library(readxl)
WineData <- read_excel("~/Harrisburg/MachineLearning/Lab 3/WineDataset.xlsx")
##Exploring the data by checking for missing data
sum(is.na(WineData))
## [1] 0
summary(WineData)
## Class Identifier Alcohol Malic Acid Ash
## Min. :1.000 Min. :11.03 Min. :0.740 Min. :1.360
## 1st Qu.:1.000 1st Qu.:12.36 1st Qu.:1.603 1st Qu.:2.210
## Median :2.000 Median :13.05 Median :1.865 Median :2.360
## Mean :1.938 Mean :13.00 Mean :2.336 Mean :2.367
## 3rd Qu.:3.000 3rd Qu.:13.68 3rd Qu.:3.083 3rd Qu.:2.558
## Max. :3.000 Max. :14.83 Max. :5.800 Max. :3.230
## Alcalinity of ash Magnesium Total phenols Flavanoids
## Min. :10.60 Min. : 70.00 Min. :0.980 Min. :0.340
## 1st Qu.:17.20 1st Qu.: 88.00 1st Qu.:1.742 1st Qu.:1.205
## Median :19.50 Median : 98.00 Median :2.355 Median :2.135
## Mean :19.49 Mean : 99.74 Mean :2.295 Mean :2.029
## 3rd Qu.:21.50 3rd Qu.:107.00 3rd Qu.:2.800 3rd Qu.:2.875
## Max. :30.00 Max. :162.00 Max. :3.880 Max. :5.080
## Nonflavanoid phenols Proanthocyanins Color intensity Hue
## Min. :0.1300 Min. :0.410 Min. : 1.280 Min. :0.4800
## 1st Qu.:0.2700 1st Qu.:1.250 1st Qu.: 3.220 1st Qu.:0.7825
## Median :0.3400 Median :1.555 Median : 4.690 Median :0.9650
## Mean :0.3619 Mean :1.591 Mean : 5.058 Mean :0.9574
## 3rd Qu.:0.4375 3rd Qu.:1.950 3rd Qu.: 6.200 3rd Qu.:1.1200
## Max. :0.6600 Max. :3.580 Max. :13.000 Max. :1.7100
## Diluted Wines Proline
## Min. :1.270 Min. : 278.0
## 1st Qu.:1.938 1st Qu.: 500.5
## Median :2.780 Median : 673.5
## Mean :2.612 Mean : 746.9
## 3rd Qu.:3.170 3rd Qu.: 985.0
## Max. :4.000 Max. :1680.0
head(WineData)
## # A tibble: 6 x 14
## `Class Identifi~ Alcohol `Malic Acid` Ash `Alcalinity of ~ Magnesium
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 14.2 1.71 2.43 15.6 127
## 2 1 13.2 1.78 2.14 11.2 100
## 3 1 13.2 2.36 2.67 18.6 101
## 4 1 14.4 1.95 2.5 16.8 113
## 5 1 13.2 2.59 2.87 21 118
## 6 1 14.2 1.76 2.45 15.2 112
## # ... with 8 more variables: `Total phenols` <dbl>, Flavanoids <dbl>,
## # `Nonflavanoid phenols` <dbl>, Proanthocyanins <dbl>, `Color
## # intensity` <dbl>, Hue <dbl>, `Diluted Wines` <dbl>, Proline <dbl>
Plot the within (cluster) sum of squares to determine the initial value for “k” using the wwsplot and NDClust functions.
wssplot <- function(data, nc=15, seed=1234)
{
wss <- (nrow(data)-1)*sum(apply(WineData,2,var))
for (i in 2:nc)
{
set.seed(seed)
wss[i] <- sum(kmeans(data, centers=i)$withinss)
}
x<-plot(1:nc, wss, type="b", xlab="Number of Clusters",
ylab="Within groups sum of squares")
}
The first column is removed since it is a categorical variable and the rest of the dataset is run through the function created.
##Remove the first column which is categorical
wine<- scale(WineData[-1])
wssplot(wine)
The k-Means analysis is then done using the variable “nc” for the number of clusters.
#Start the k-Means analysis using the variable "nc" for the number of clusters
set.seed(1234)
nc <- NbClust(wine, min.nc=2, max.nc = 15, method = "kmeans")
## *** : The Hubert index is a graphical method of determining the number of clusters.
## In the plot of Hubert index, we seek a significant knee that corresponds to a
## significant increase of the value of the measure i.e the significant peak in Hubert
## index second differences plot.
##
## *** : The D index is a graphical method of determining the number of clusters.
## In the plot of D index, we seek a significant knee (the significant peak in Dindex
## second differences plot) that corresponds to a significant increase of the value of
## the measure.
##
## *******************************************************************
## * Among all indices:
## * 4 proposed 2 as the best number of clusters
## * 15 proposed 3 as the best number of clusters
## * 1 proposed 10 as the best number of clusters
## * 1 proposed 12 as the best number of clusters
## * 1 proposed 14 as the best number of clusters
## * 1 proposed 15 as the best number of clusters
##
## ***** Conclusion *****
##
## * According to the majority rule, the best number of clusters is 3
##
##
## *******************************************************************
print(table(nc$Best.n[1,]))
##
## 0 1 2 3 10 12 14 15
## 2 1 4 15 1 1 1 1
From the barplot below,3 would be the appropriate number of clusters to choose.
barplot(table(nc$Best.n[1,]), xlab = "Number of Clusters", ylab = "Number of Criteria", main = "Number of Clusters Chosen by 26 Criteria")
#n <- readline(prompt = "Enter the best number of clusters: ")
n <- as.integer(3)
#Conduct the k-Means analysis using the best number of clusters
set.seed(1234)
fit.km <- kmeans(wine, n, nstart=25)
print(fit.km$size)
## [1] 62 65 51
print(fit.km$centers)
## Alcohol Malic Acid Ash Alcalinity of ash Magnesium
## 1 0.8328826 -0.3029551 0.3636801 -0.6084749 0.57596208
## 2 -0.9234669 -0.3929331 -0.4931257 0.1701220 -0.49032869
## 3 0.1644436 0.8690954 0.1863726 0.5228924 -0.07526047
## Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins
## 1 0.88274724 0.97506900 -0.56050853 0.57865427
## 2 -0.07576891 0.02075402 -0.03343924 0.05810161
## 3 -0.97657548 -1.21182921 0.72402116 -0.77751312
## Color intensity Hue Diluted Wines Proline
## 1 0.1705823 0.4726504 0.7770551 1.1220202
## 2 -0.8993770 0.4605046 0.2700025 -0.7517257
## 3 0.9388902 -1.1615122 -1.2887761 -0.4059428
print(aggregate(WineData[-1], by=list(cluster=fit.km$cluster), mean))
## cluster Alcohol Malic Acid Ash Alcalinity of ash Magnesium
## 1 1 13.67677 1.997903 2.466290 17.46290 107.96774
## 2 2 12.25092 1.897385 2.231231 20.06308 92.73846
## 3 3 13.13412 3.307255 2.417647 21.24118 98.66667
## Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins
## 1 2.847581 3.0032258 0.2920968 1.922097
## 2 2.247692 2.0500000 0.3576923 1.624154
## 3 1.683922 0.8188235 0.4519608 1.145882
## Color intensity Hue Diluted Wines Proline
## 1 5.453548 1.0654839 3.163387 1100.2258
## 2 2.973077 1.0627077 2.803385 510.1692
## 3 7.234706 0.6919608 1.696667 619.0588
A confusion table is created to evaluate the performance of the model.RandIndex is used to quantify the agreement between the categorical variable:Class Identifier and the Cluster.
#Use a confusion or truth table to evaluate how well the k-Means analysis performed
ct.km <- table(WineData$`Class Identifier`, fit.km$cluster)
print(ct.km)
##
## 1 2 3
## 1 59 0 0
## 2 3 65 3
## 3 0 0 48
##quantify the agreement between Class Identifier and cluster
randIndex(ct.km)
## ARI
## 0.897495
From the Rand Index above, the value of 0.89 is not too bad given the scale ranges from -1 (no agreement) to 1 (perfect agreement).
clusplot(wine, fit.km$cluster, main='2D representation of the Cluster solution', color=TRUE, shade=TRUE, labels=2, lines=0)
Interpretation:The analysis explains 55% of the multivariate data.
We trained a classifier to classify our wine dataset using 3 clusters. It is clear that there were some misclassifications. But, after the model is designed. We can check for the accuracy to see the significance of those misclassifications.
df <- data.frame(k=fit.km$cluster, wine)
print(str(df))
## 'data.frame': 178 obs. of 14 variables:
## $ k : int 1 1 1 1 1 1 1 1 1 1 ...
## $ Alcohol : num 1.514 0.246 0.196 1.687 0.295 ...
## $ Malic.Acid : num -0.5607 -0.498 0.0212 -0.3458 0.2271 ...
## $ Ash : num 0.231 -0.826 1.106 0.487 1.835 ...
## $ Alcalinity.of.ash : num -1.166 -2.484 -0.268 -0.807 0.451 ...
## $ Magnesium : num 1.9085 0.0181 0.0881 0.9283 1.2784 ...
## $ Total.phenols : num 0.807 0.567 0.807 2.484 0.807 ...
## $ Flavanoids : num 1.032 0.732 1.212 1.462 0.661 ...
## $ Nonflavanoid.phenols: num -0.658 -0.818 -0.497 -0.979 0.226 ...
## $ Proanthocyanins : num 1.221 -0.543 2.13 1.029 0.4 ...
## $ Color.intensity : num 0.251 -0.292 0.268 1.183 -0.318 ...
## $ Hue : num 0.361 0.405 0.317 -0.426 0.361 ...
## $ Diluted.Wines : num 1.843 1.11 0.786 1.181 0.448 ...
## $ Proline : num 1.0102 0.9625 1.3912 2.328 -0.0378 ...
## NULL
#Randomize the dataset
rdf <- df[sample(1:nrow(df)), ]
print(head(rdf))
## k Alcohol Malic.Acid Ash Alcalinity.of.ash Magnesium
## 93 2 -0.3826162 -0.7217931 -0.38826018 0.3608424 -1.3822227
## 69 2 0.4180475 -1.2499245 -0.02375431 -0.7470867 0.7182523
## 13 1 0.9230815 -0.5427655 0.15849862 -1.0465271 -0.7520802
## 57 1 1.5020229 -0.5696196 -0.24245783 -0.9566950 1.2783790
## 117 2 -1.4542737 -0.7755014 -1.37242601 0.3907864 -0.9621277
## 161 3 -0.7891070 1.3370245 0.04914686 0.4506745 -0.8220960
## Total.phenols Flavanoids Nonflavanoid.phenols Proanthocyanins
## 93 -1.462188745 -0.5699201 1.7528342 0.05084419
## 69 0.375309174 -0.7301029 1.5117800 -2.04574255
## 13 0.487156874 0.7315653 -0.5773564 0.38280376
## 57 1.445851440 0.9718395 -0.8184106 0.76717799
## 117 -0.503494178 -0.4297602 -0.4970050 -0.10639981
## 161 0.007809591 -1.1105371 1.1100230 -0.96250606
## Color.intensity Hue Diluted.Wines Proline
## 93 -0.8661960 0.01115870 -0.7770322 -0.799896093
## 69 -0.8144336 0.27365854 -0.9601332 0.009865569
## 13 0.2337547 0.84240820 0.4060824 1.819921051
## 57 0.5702101 -0.07634125 0.9835550 0.708483475
## 117 -1.3406845 -0.03259127 1.0117244 -0.799896093
## 161 1.1180287 -1.73884025 -1.4530976 -0.720507695
head(rdf)
## k Alcohol Malic.Acid Ash Alcalinity.of.ash Magnesium
## 93 2 -0.3826162 -0.7217931 -0.38826018 0.3608424 -1.3822227
## 69 2 0.4180475 -1.2499245 -0.02375431 -0.7470867 0.7182523
## 13 1 0.9230815 -0.5427655 0.15849862 -1.0465271 -0.7520802
## 57 1 1.5020229 -0.5696196 -0.24245783 -0.9566950 1.2783790
## 117 2 -1.4542737 -0.7755014 -1.37242601 0.3907864 -0.9621277
## 161 3 -0.7891070 1.3370245 0.04914686 0.4506745 -0.8220960
## Total.phenols Flavanoids Nonflavanoid.phenols Proanthocyanins
## 93 -1.462188745 -0.5699201 1.7528342 0.05084419
## 69 0.375309174 -0.7301029 1.5117800 -2.04574255
## 13 0.487156874 0.7315653 -0.5773564 0.38280376
## 57 1.445851440 0.9718395 -0.8184106 0.76717799
## 117 -0.503494178 -0.4297602 -0.4970050 -0.10639981
## 161 0.007809591 -1.1105371 1.1100230 -0.96250606
## Color.intensity Hue Diluted.Wines Proline
## 93 -0.8661960 0.01115870 -0.7770322 -0.799896093
## 69 -0.8144336 0.27365854 -0.9601332 0.009865569
## 13 0.2337547 0.84240820 0.4060824 1.819921051
## 57 0.5702101 -0.07634125 0.9835550 0.708483475
## 117 -1.3406845 -0.03259127 1.0117244 -0.799896093
## 161 1.1180287 -1.73884025 -1.4530976 -0.720507695
The wine dataset is then split for training and testing.
winetrain <- rdf[1:(as.integer(.8*nrow(rdf))-1), ]
winetest <- rdf[(as.integer(.8*nrow(rdf))):nrow(rdf), ]
#Train the classifier and plot the results
fit <- rpart(k ~ ., data=winetrain, method="class")
fancyRpartPlot(fit)
The prediction and accuracy of the model is then estimated. The accuracy of the model is now 97%.
pred<-predict(fit, winetest, type="class")
print(table(pred,winetest$k))
##
## pred 1 2 3
## 1 11 1 0
## 2 0 12 1
## 3 0 1 11
p<-table(pred, winetest$k)
Accuracy<- (sum(diag(p))/sum(p)*100)
Accuracy
## [1] 91.89189