K-means Clustering Method

K- means clustering is a method of unsupervised learning which is used when you have unlabeled data (i.e., data without defined categories or groups).

The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:

The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)

The K-means clustering algorithm is used to find groups which have not been explicitly labeled in the data. This can be used to confirm business assumptions about what types of groups exist or to identify unknown groups in complex data sets.

Once the algorithm has been run and the groups are defined, any new data can be easily assigned to the correct group.

Part1 : WholeSale Customer Data

We are dealing with the Wholesale Customers Dataset which can be downloaded from here:https://archive.ics.uci.edu/ml/datasets/Wholesale+customers to explore k-means clustering technique. The code is ran in order to identify the best number of clusters i.e the best k value in the dataset.

The data is imported and then explored. Fortunately, there is no missing data.

CustomerData<- read_excel("~/Harrisburg/MachineLearning/Lab 3/Wholesale customers data.csv.xlsx")

Customerdata <- data.frame(CustomerData)

summary(Customerdata)

##     Channel          Region          Fresh             Milk      
##  Min.   :1.000   Min.   :1.000   Min.   :     3   Min.   :   55  
##  1st Qu.:1.000   1st Qu.:2.000   1st Qu.:  3128   1st Qu.: 1533  
##  Median :1.000   Median :3.000   Median :  8504   Median : 3627  
##  Mean   :1.323   Mean   :2.543   Mean   : 12000   Mean   : 5796  
##  3rd Qu.:2.000   3rd Qu.:3.000   3rd Qu.: 16934   3rd Qu.: 7190  
##  Max.   :2.000   Max.   :3.000   Max.   :112151   Max.   :73498  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2153   1st Qu.:  742.2   1st Qu.:  256.8   1st Qu.:  408.2  
##  Median : 4756   Median : 1526.0   Median :  816.5   Median :  965.5  
##  Mean   : 7951   Mean   : 3071.9   Mean   : 2881.5   Mean   : 1524.9  
##  3rd Qu.:10656   3rd Qu.: 3554.2   3rd Qu.: 3922.0   3rd Qu.: 1820.2  
##  Max.   :92780   Max.   :60869.0   Max.   :40827.0   Max.   :47943.0

sum(is.na(Customerdata))

## [1] 0

With the function below, we will try to remove the top 5 customers from each category. We also remove the two ID fields :Channel and Region

top.n.custs <- function (Customerdata,cols,n=5)
  {
 #Initialize a vector to hold customers being removed
 idx.to.remove <-integer(0)

    for (c in cols) # For every column in the data we passed to this function
      {
 col.order <-order(Customerdata[,c],decreasing=T) #Sort column "c" in descending order (bigger on top)
 #Order returns the sorted index (e.g. row 15, 3, 7, 1, ...) rather than the actual values sorted.

 idx <-head(col.order, n)#Take the first n of the sorted column C to
 idx.to.remove <-union(idx.to.remove,idx)
 #combine and de-duplicate the row ids that need to be removed

    }
 
  #Return the indexes of customers to be removed
 return(idx.to.remove) 
}

We then determine how many customers we can remove using the function below.

#How Many Customers to be Removed?
top.custs <-top.n.custs(Customerdata, cols = 1:5,n=5)
length(top.custs)

## [1] 18

#Examine the customers
Customerdata[top.custs,]  #Exammine the customers

##     Channel Region  Fresh  Milk Grocery Frozen Detergents_Paper Delicassen
## 1         2      3  12669  9656    7561    214             2674       1338
## 2         2      3   7057  9810    9568   1762             3293       1776
## 3         2      3   6353  8808    7684   2405             3516       7844
## 5         2      3  22615  5410    7198   3915             1777       5185
## 6         2      3   9413  8259    5126    666             1795       1451
## 4         1      3  13265  1196    4221   6404              507       1788
## 182       1      3 112151 29627   18148  16745             4948       8550
## 126       1      3  76237  3473    7102  16538              778        918
## 285       1      3  68951  4411   12609   8692              751       2406
## 40        1      3  56159   555     902  10002              212       2916
## 259       1      1  56083  4563    2124   6422              730       3321
## 87        2      3  22925 73498   32114    987            20070        903
## 48        2      3  44466 54259   55571   7782            24171       6465
## 86        2      3  16117 46197   92780   1026            40827       2944
## 184       1      3  36847 43950   20170  36534              239      47943
## 62        2      3  35942 38369   59598   3254            26701       2017
## 334       2      2   8565  4980   67298    131            38102       1215
## 66        2      3     85 20959   45828     36            24231       1423

#Remove the Customers from the dataset
data.rm.top<-Customerdata[-c(top.custs),]

We can now examine the summary stats of the remaining data

#Examine summary stats for the remaining data
print(summary(data.rm.top))

##     Channel         Region          Fresh            Milk      
##  Min.   :1.00   Min.   :1.000   Min.   :    3   Min.   :   55  
##  1st Qu.:1.00   1st Qu.:2.000   1st Qu.: 3072   1st Qu.: 1497  
##  Median :1.00   Median :3.000   Median : 8130   Median : 3582  
##  Mean   :1.31   Mean   :2.531   Mean   :11076   Mean   : 5172  
##  3rd Qu.:2.00   3rd Qu.:3.000   3rd Qu.:16251   3rd Qu.: 6962  
##  Max.   :2.00   Max.   :3.000   Max.   :56082   Max.   :36423  
##     Grocery          Frozen        Detergents_Paper    Delicassen     
##  Min.   :    3   Min.   :   25.0   Min.   :    3.0   Min.   :    3.0  
##  1st Qu.: 2132   1st Qu.:  738.8   1st Qu.:  255.2   1st Qu.:  398.0  
##  Median : 4603   Median : 1487.5   Median :  799.5   Median :  904.5  
##  Mean   : 7211   Mean   : 2910.3   Mean   : 2541.6   Mean   : 1352.0  
##  3rd Qu.:10391   3rd Qu.: 3428.0   3rd Qu.: 3879.2   3rd Qu.: 1752.2  
##  Max.   :39694   Max.   :60869.0   Max.   :19410.0   Max.   :16523.0

We can now try other K vlue and compare the withins and betweeens to help us select the best k.

set.seed(76964057) #Set the seed for reproducibility
#Try K from 2 to 20
rng<-2:20

#Number of times to run the K Means algorithm
tries <-100

#Set up an empty vector to hold all of points
avg.totw.ss <-integer(length(rng))
avg.totb.ss <- integer(length(rng))
avg.tot.ss <- integer(length(rng))

# For each value of the range variable
for(v in rng){

 #Set up an empty vectors to hold the tries
 v.totw.ss <-integer(tries)
 b.totb.ss <- integer(tries)
 tot.ss <- integer(tries)

 #Run kmeans
 for(i in 1:tries){
 k.temp <-kmeans(data.rm.top,centers=v)

 #Store the total withinss
 v.totw.ss[i] <-k.temp$tot.withinss

 #Store the betweenss
 b.totb.ss[i] <- k.temp$betweenss

 #Store the total sum of squares
 tot.ss[i] <- k.temp$totss
 }

 #Average the withinss and betweenss
 avg.totw.ss[v-1] <-mean(v.totw.ss)
 avg.totb.ss[v-1] <-mean(b.totb.ss)
 avg.tot.ss[v-1] <- mean(tot.ss)
}

plot(rng,avg.totw.ss,type="b", main="Total Within SS by Various K",
 ylab="Average Total Within Sum of Squares",
 xlab="Value of K")

At this point, the graph bends and stops making gains around 5.

plot(rng,avg.totb.ss,type="b", main="Total between SS by Various K",

 ylab="Average Total Between Sum of Squares",
 xlab="Value of K")

#Plot the ratio of betweenss/total ss and withinss / total ss for evaluation
plot(rng,avg.totb.ss/avg.tot.ss,type="b", main="Ratio of between ss / the total ss by Various K",
 ylab="Ratio Between SS / Total SS",
 xlab="Value of K")
abline(h=0.85, col="red")

plot(rng,avg.totw.ss/avg.tot.ss,type="b", main="Ratio of within ss / the total ss by Various K",
 ylab="Ratio Between SS / Total SS",
 xlab="Value of K")
abline(h=0.15, col="red")

We then create the best number of clusters

#Create the best number of clusters, Remove columns 1 and 2
n <- 3
#return(as.integer(n))
k <-kmeans(data.rm.top[,-c(1,2)], centers=n)

Below is a print of all the analysis with 3 clusters.

#Display&nbsp;cluster centers
print(k$centers)

##       Fresh      Milk   Grocery   Frozen Detergents_Paper Delicassen
## 1  5165.797 12491.532 19078.823 1644.633         8395.608  1843.6709
## 2  6623.979  3210.054  4070.158 2569.363         1221.833   973.7792
## 3 25984.252  4127.505  5426.165 4675.359         1126.621  1856.1456

#Give a count of data points in each cluster
print(table(k$cluster))

## 
##   1   2   3 
##  79 240 103

#Generate a plot of the clusters

library(cluster)
clusplot(data.rm.top, k$cluster, main='2D representation of the Cluster solution',
 color=TRUE, shade=TRUE,
 labels=2, lines=0)

Interpretation : with the analysis, the cluster plot explains 60% of the information about multivarite data

Part 2: Wineset Dataset

In this analysis, we will perform K-means analysis on the wine dataset from UCI:https://archive.ics.uci.edu/ml/datasets/wine+quality. The dataset has 13 variables with 178 measurements.

The dataset is imported and explored missing numebers

library(readxl)
WineData <- read_excel("~/Harrisburg/MachineLearning/Lab 3/WineDataset.xlsx")

##Exploring the data by checking for missing data
sum(is.na(WineData))

## [1] 0

summary(WineData)

##  Class Identifier    Alcohol        Malic Acid         Ash       
##  Min.   :1.000    Min.   :11.03   Min.   :0.740   Min.   :1.360  
##  1st Qu.:1.000    1st Qu.:12.36   1st Qu.:1.603   1st Qu.:2.210  
##  Median :2.000    Median :13.05   Median :1.865   Median :2.360  
##  Mean   :1.938    Mean   :13.00   Mean   :2.336   Mean   :2.367  
##  3rd Qu.:3.000    3rd Qu.:13.68   3rd Qu.:3.083   3rd Qu.:2.558  
##  Max.   :3.000    Max.   :14.83   Max.   :5.800   Max.   :3.230  
##  Alcalinity of ash   Magnesium      Total phenols     Flavanoids   
##  Min.   :10.60     Min.   : 70.00   Min.   :0.980   Min.   :0.340  
##  1st Qu.:17.20     1st Qu.: 88.00   1st Qu.:1.742   1st Qu.:1.205  
##  Median :19.50     Median : 98.00   Median :2.355   Median :2.135  
##  Mean   :19.49     Mean   : 99.74   Mean   :2.295   Mean   :2.029  
##  3rd Qu.:21.50     3rd Qu.:107.00   3rd Qu.:2.800   3rd Qu.:2.875  
##  Max.   :30.00     Max.   :162.00   Max.   :3.880   Max.   :5.080  
##  Nonflavanoid phenols Proanthocyanins Color intensity       Hue        
##  Min.   :0.1300       Min.   :0.410   Min.   : 1.280   Min.   :0.4800  
##  1st Qu.:0.2700       1st Qu.:1.250   1st Qu.: 3.220   1st Qu.:0.7825  
##  Median :0.3400       Median :1.555   Median : 4.690   Median :0.9650  
##  Mean   :0.3619       Mean   :1.591   Mean   : 5.058   Mean   :0.9574  
##  3rd Qu.:0.4375       3rd Qu.:1.950   3rd Qu.: 6.200   3rd Qu.:1.1200  
##  Max.   :0.6600       Max.   :3.580   Max.   :13.000   Max.   :1.7100  
##  Diluted Wines      Proline      
##  Min.   :1.270   Min.   : 278.0  
##  1st Qu.:1.938   1st Qu.: 500.5  
##  Median :2.780   Median : 673.5  
##  Mean   :2.612   Mean   : 746.9  
##  3rd Qu.:3.170   3rd Qu.: 985.0  
##  Max.   :4.000   Max.   :1680.0

head(WineData)

## # A tibble: 6 x 14
##   `Class Identifi~ Alcohol `Malic Acid`   Ash `Alcalinity of ~ Magnesium
##              <dbl>   <dbl>        <dbl> <dbl>            <dbl>     <dbl>
## 1                1    14.2         1.71  2.43             15.6       127
## 2                1    13.2         1.78  2.14             11.2       100
## 3                1    13.2         2.36  2.67             18.6       101
## 4                1    14.4         1.95  2.5              16.8       113
## 5                1    13.2         2.59  2.87             21         118
## 6                1    14.2         1.76  2.45             15.2       112
## # ... with 8 more variables: `Total phenols` <dbl>, Flavanoids <dbl>,
## #   `Nonflavanoid phenols` <dbl>, Proanthocyanins <dbl>, `Color
## #   intensity` <dbl>, Hue <dbl>, `Diluted Wines` <dbl>, Proline <dbl>

Plot the within (cluster) sum of squares to determine the initial value for “k” using the wwsplot and NDClust functions.

wssplot <- function(data, nc=15, seed=1234)
  {
 wss <- (nrow(data)-1)*sum(apply(WineData,2,var))
 
 for (i in 2:nc)
   {
 set.seed(seed)
 wss[i] <- sum(kmeans(data, centers=i)$withinss)
 }
 x<-plot(1:nc, wss, type="b", xlab="Number of Clusters",
 ylab="Within groups sum of squares")
 }

The first column is removed since it is a categorical variable and the rest of the dataset is run through the function created.

##Remove the first column which is categorical
wine<- scale(WineData[-1]) 


wssplot(wine)

The k-Means analysis is then done using the variable “nc” for the number of clusters.

#Start the k-Means analysis using the variable "nc" for the number of clusters

set.seed(1234)

nc <- NbClust(wine, min.nc=2, max.nc = 15, method = "kmeans")

## *** : The Hubert index is a graphical method of determining the number of clusters.
##                 In the plot of Hubert index, we seek a significant knee that corresponds to a 
##                 significant increase of the value of the measure i.e the significant peak in Hubert
##                 index second differences plot. 
##

## *** : The D index is a graphical method of determining the number of clusters. 
##                 In the plot of D index, we seek a significant knee (the significant peak in Dindex
##                 second differences plot) that corresponds to a significant increase of the value of
##                 the measure. 
##  
## ******************************************************************* 
## * Among all indices:                                                
## * 4 proposed 2 as the best number of clusters 
## * 15 proposed 3 as the best number of clusters 
## * 1 proposed 10 as the best number of clusters 
## * 1 proposed 12 as the best number of clusters 
## * 1 proposed 14 as the best number of clusters 
## * 1 proposed 15 as the best number of clusters 
## 
##                    ***** Conclusion *****                            
##  
## * According to the majority rule, the best number of clusters is  3 
##  
##  
## *******************************************************************

print(table(nc$Best.n[1,]))

## 
##  0  1  2  3 10 12 14 15 
##  2  1  4 15  1  1  1  1

From the barplot below,3 would be the appropriate number of clusters to choose.

barplot(table(nc$Best.n[1,]), xlab = "Number of Clusters", ylab = "Number of Criteria", main = "Number of Clusters Chosen by 26 Criteria")

#n <- readline(prompt = "Enter the best number of clusters: ")
n <- as.integer(3)

#Conduct the k-Means analysis using the best number of clusters
set.seed(1234)
fit.km <- kmeans(wine, n, nstart=25)


print(fit.km$size)

## [1] 62 65 51

print(fit.km$centers)

##      Alcohol Malic Acid        Ash Alcalinity of ash   Magnesium
## 1  0.8328826 -0.3029551  0.3636801        -0.6084749  0.57596208
## 2 -0.9234669 -0.3929331 -0.4931257         0.1701220 -0.49032869
## 3  0.1644436  0.8690954  0.1863726         0.5228924 -0.07526047
##   Total phenols  Flavanoids Nonflavanoid phenols Proanthocyanins
## 1    0.88274724  0.97506900          -0.56050853      0.57865427
## 2   -0.07576891  0.02075402          -0.03343924      0.05810161
## 3   -0.97657548 -1.21182921           0.72402116     -0.77751312
##   Color intensity        Hue Diluted Wines    Proline
## 1       0.1705823  0.4726504     0.7770551  1.1220202
## 2      -0.8993770  0.4605046     0.2700025 -0.7517257
## 3       0.9388902 -1.1615122    -1.2887761 -0.4059428

print(aggregate(WineData[-1], by=list(cluster=fit.km$cluster), mean))

##   cluster  Alcohol Malic Acid      Ash Alcalinity of ash Magnesium
## 1       1 13.67677   1.997903 2.466290          17.46290 107.96774
## 2       2 12.25092   1.897385 2.231231          20.06308  92.73846
## 3       3 13.13412   3.307255 2.417647          21.24118  98.66667
##   Total phenols Flavanoids Nonflavanoid phenols Proanthocyanins
## 1      2.847581  3.0032258            0.2920968        1.922097
## 2      2.247692  2.0500000            0.3576923        1.624154
## 3      1.683922  0.8188235            0.4519608        1.145882
##   Color intensity       Hue Diluted Wines   Proline
## 1        5.453548 1.0654839      3.163387 1100.2258
## 2        2.973077 1.0627077      2.803385  510.1692
## 3        7.234706 0.6919608      1.696667  619.0588

A confusion table is created to evaluate the performance of the model.RandIndex is used to quantify the agreement between the categorical variable:Class Identifier and the Cluster.

#Use a confusion or truth table to evaluate how well the k-Means analysis performed
ct.km <- table(WineData$`Class Identifier`, fit.km$cluster)
print(ct.km)

##    
##      1  2  3
##   1 59  0  0
##   2  3 65  3
##   3  0  0 48

##quantify the agreement between Class Identifier and cluster


randIndex(ct.km)

##      ARI 
## 0.897495

From the Rand Index above, the value of 0.89 is not too bad given the scale ranges from -1 (no agreement) to 1 (perfect agreement).

clusplot(wine, fit.km$cluster, main='2D representation of the Cluster solution', color=TRUE, shade=TRUE, labels=2, lines=0)

Interpretation:The analysis explains 55% of the multivariate data.

We trained a classifier to classify our wine dataset using 3 clusters. It is clear that there were some misclassifications. But, after the model is designed. We can check for the accuracy to see the significance of those misclassifications.

df <- data.frame(k=fit.km$cluster, wine)
print(str(df))

## 'data.frame':    178 obs. of  14 variables:
##  $ k                   : int  1 1 1 1 1 1 1 1 1 1 ...
##  $ Alcohol             : num  1.514 0.246 0.196 1.687 0.295 ...
##  $ Malic.Acid          : num  -0.5607 -0.498 0.0212 -0.3458 0.2271 ...
##  $ Ash                 : num  0.231 -0.826 1.106 0.487 1.835 ...
##  $ Alcalinity.of.ash   : num  -1.166 -2.484 -0.268 -0.807 0.451 ...
##  $ Magnesium           : num  1.9085 0.0181 0.0881 0.9283 1.2784 ...
##  $ Total.phenols       : num  0.807 0.567 0.807 2.484 0.807 ...
##  $ Flavanoids          : num  1.032 0.732 1.212 1.462 0.661 ...
##  $ Nonflavanoid.phenols: num  -0.658 -0.818 -0.497 -0.979 0.226 ...
##  $ Proanthocyanins     : num  1.221 -0.543 2.13 1.029 0.4 ...
##  $ Color.intensity     : num  0.251 -0.292 0.268 1.183 -0.318 ...
##  $ Hue                 : num  0.361 0.405 0.317 -0.426 0.361 ...
##  $ Diluted.Wines       : num  1.843 1.11 0.786 1.181 0.448 ...
##  $ Proline             : num  1.0102 0.9625 1.3912 2.328 -0.0378 ...
## NULL

#Randomize the dataset
rdf <- df[sample(1:nrow(df)), ]


print(head(rdf))

##     k    Alcohol Malic.Acid         Ash Alcalinity.of.ash  Magnesium
## 93  2 -0.3826162 -0.7217931 -0.38826018         0.3608424 -1.3822227
## 69  2  0.4180475 -1.2499245 -0.02375431        -0.7470867  0.7182523
## 13  1  0.9230815 -0.5427655  0.15849862        -1.0465271 -0.7520802
## 57  1  1.5020229 -0.5696196 -0.24245783        -0.9566950  1.2783790
## 117 2 -1.4542737 -0.7755014 -1.37242601         0.3907864 -0.9621277
## 161 3 -0.7891070  1.3370245  0.04914686         0.4506745 -0.8220960
##     Total.phenols Flavanoids Nonflavanoid.phenols Proanthocyanins
## 93   -1.462188745 -0.5699201            1.7528342      0.05084419
## 69    0.375309174 -0.7301029            1.5117800     -2.04574255
## 13    0.487156874  0.7315653           -0.5773564      0.38280376
## 57    1.445851440  0.9718395           -0.8184106      0.76717799
## 117  -0.503494178 -0.4297602           -0.4970050     -0.10639981
## 161   0.007809591 -1.1105371            1.1100230     -0.96250606
##     Color.intensity         Hue Diluted.Wines      Proline
## 93       -0.8661960  0.01115870    -0.7770322 -0.799896093
## 69       -0.8144336  0.27365854    -0.9601332  0.009865569
## 13        0.2337547  0.84240820     0.4060824  1.819921051
## 57        0.5702101 -0.07634125     0.9835550  0.708483475
## 117      -1.3406845 -0.03259127     1.0117244 -0.799896093
## 161       1.1180287 -1.73884025    -1.4530976 -0.720507695

head(rdf)

##     k    Alcohol Malic.Acid         Ash Alcalinity.of.ash  Magnesium
## 93  2 -0.3826162 -0.7217931 -0.38826018         0.3608424 -1.3822227
## 69  2  0.4180475 -1.2499245 -0.02375431        -0.7470867  0.7182523
## 13  1  0.9230815 -0.5427655  0.15849862        -1.0465271 -0.7520802
## 57  1  1.5020229 -0.5696196 -0.24245783        -0.9566950  1.2783790
## 117 2 -1.4542737 -0.7755014 -1.37242601         0.3907864 -0.9621277
## 161 3 -0.7891070  1.3370245  0.04914686         0.4506745 -0.8220960
##     Total.phenols Flavanoids Nonflavanoid.phenols Proanthocyanins
## 93   -1.462188745 -0.5699201            1.7528342      0.05084419
## 69    0.375309174 -0.7301029            1.5117800     -2.04574255
## 13    0.487156874  0.7315653           -0.5773564      0.38280376
## 57    1.445851440  0.9718395           -0.8184106      0.76717799
## 117  -0.503494178 -0.4297602           -0.4970050     -0.10639981
## 161   0.007809591 -1.1105371            1.1100230     -0.96250606
##     Color.intensity         Hue Diluted.Wines      Proline
## 93       -0.8661960  0.01115870    -0.7770322 -0.799896093
## 69       -0.8144336  0.27365854    -0.9601332  0.009865569
## 13        0.2337547  0.84240820     0.4060824  1.819921051
## 57        0.5702101 -0.07634125     0.9835550  0.708483475
## 117      -1.3406845 -0.03259127     1.0117244 -0.799896093
## 161       1.1180287 -1.73884025    -1.4530976 -0.720507695

The wine dataset is then split for training and testing.

winetrain <- rdf[1:(as.integer(.8*nrow(rdf))-1), ]
winetest <- rdf[(as.integer(.8*nrow(rdf))):nrow(rdf), ]


#Train the classifier and plot the results
fit <- rpart(k ~ ., data=winetrain, method="class")

fancyRpartPlot(fit)

The prediction and accuracy of the model is then estimated. The accuracy of the model is now 97%.

pred<-predict(fit, winetest, type="class") 

print(table(pred,winetest$k))

##     
## pred  1  2  3
##    1 11  1  0
##    2  0 12  1
##    3  0  1 11

p<-table(pred, winetest$k)

Accuracy<- (sum(diag(p))/sum(p)*100)

Accuracy

## [1] 91.89189

K-means Clustering Method

Rodda Ouma

August 4, 2018

Part1 : WholeSale Customer Data

Part 2: Wineset Dataset