Clustering

INTRODUCTION

I divided the process into 7 sections:

Reading data, EDA and data manipulation
Clusterability of the data
Determining the optimal number of clusters
Applying KMeans and PAM
Quality measurement
Hierarchical clustering
Affinity Propagation (AP)

1- Reading data, EDA and data manipulation

Data provided from here

# Setting up the working directory
setwd("D:\\SOFTS\\Unsupervised Learning\\Clustering")

cust <- read.csv("Cust_Segmentation.csv")
# Inspecting the dataset
str(cust)

## 'data.frame':    850 obs. of  10 variables:
##  $ Customer.Id    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Age            : int  41 47 33 29 47 40 38 42 26 47 ...
##  $ Edu            : int  2 1 2 2 1 1 2 3 1 3 ...
##  $ Years.Employed : int  6 26 10 4 31 23 4 0 5 23 ...
##  $ Income         : int  19 100 57 19 253 81 56 64 18 115 ...
##  $ Card.Debt      : num  0.124 4.582 6.111 0.681 9.308 ...
##  $ Other.Debt     : num  1.073 8.218 5.802 0.516 8.908 ...
##  $ Defaulted      : int  0 0 1 0 0 NA 0 0 NA 0 ...
##  $ Address        : chr  "NBA001" "NBA021" "NBA013" "NBA009" ...
##  $ DebtIncomeRatio: num  6.3 12.8 20.9 6.3 7.2 10.9 1.6 6.6 15.5 4 ...

sum(is.na(cust))

## [1] 150

table(cust$Defaulted, useNA = "ifany")

## 
##    0    1 <NA> 
##  517  183  150

summary(cust)

##   Customer.Id         Age             Edu        Years.Employed  
##  Min.   :  1.0   Min.   :20.00   Min.   :1.000   Min.   : 0.000  
##  1st Qu.:213.2   1st Qu.:29.00   1st Qu.:1.000   1st Qu.: 3.000  
##  Median :425.5   Median :34.00   Median :1.000   Median : 7.000  
##  Mean   :425.5   Mean   :35.03   Mean   :1.711   Mean   : 8.566  
##  3rd Qu.:637.8   3rd Qu.:41.00   3rd Qu.:2.000   3rd Qu.:13.000  
##  Max.   :850.0   Max.   :56.00   Max.   :5.000   Max.   :33.000  
##                                                                  
##      Income         Card.Debt         Other.Debt       Defaulted     
##  Min.   : 13.00   Min.   : 0.0120   Min.   : 0.046   Min.   :0.0000  
##  1st Qu.: 24.00   1st Qu.: 0.3825   1st Qu.: 1.046   1st Qu.:0.0000  
##  Median : 35.00   Median : 0.8850   Median : 2.003   Median :0.0000  
##  Mean   : 46.68   Mean   : 1.5768   Mean   : 3.079   Mean   :0.2614  
##  3rd Qu.: 55.75   3rd Qu.: 1.8985   3rd Qu.: 3.903   3rd Qu.:1.0000  
##  Max.   :446.00   Max.   :20.5610   Max.   :35.197   Max.   :1.0000  
##                                                      NA's   :150     
##    Address          DebtIncomeRatio
##  Length:850         Min.   : 0.10  
##  Class :character   1st Qu.: 5.10  
##  Mode  :character   Median : 8.70  
##                     Mean   :10.17  
##                     3rd Qu.:13.80  
##                     Max.   :41.30  
##

dim(cust)

## [1] 850  10

We have NULL values in “Defaulted” and since the proportion of them is high (%17), we can’t simply drop them. Instead we try to impute them by the findings from other columns. By the definition default means failure to repay, so it may be relevant to debt

# Let's create a column for total debts
cust$Tot.Debt <- cust$Card.Debt + cust$Other.Debt
# Omiting NULLs for now for the purpose of plotting
ggplot(na.omit(cust), aes(x= Defaulted,y = Tot.Debt)) + 
  geom_bar(stat = "identity",fill = "darkolivegreen3") + theme_grey()

At the moment we are not interested in “Address” and “Customer.id” (which is identical to row). An alternative way would be to perform one-hot encoding on “Address”.

unique(cust$Address)

##  [1] "NBA001" "NBA021" "NBA013" "NBA009" "NBA008" "NBA016" "NBA006" "NBA011"
##  [9] "NBA010" "NBA003" "NBA000" "NBA019" "NBA005" "NBA004" "NBA022" "NBA018"
## [17] "NBA002" "NBA007" "NBA026" "NBA020" "NBA012" "NBA014" "NBA024" "NBA015"
## [25] "NBA017" "NBA023" "NBA025" "NBA027" "NBA031" "NBA030" "NBA034" "NBA029"

cust$Customer.Id  <- NULL
#cust$Address <- NULL

# Let's see if we can find a better indication for imputation
corr <- cor(na.omit(cust[,-c(8)]))
head(corr[, c(1:6,8,9)])

##                      Age         Edu Years.Employed    Income  Card.Debt
## Age            1.0000000  0.02232500      0.5364968 0.4787099 0.29521432
## Edu            0.0223250  1.00000000     -0.1536208 0.2351905 0.08827721
## Years.Employed 0.5364968 -0.15362077      1.0000000 0.6196813 0.40369784
## Income         0.4787099  0.23519050      0.6196813 1.0000000 0.57019584
## Card.Debt      0.2952143  0.08827721      0.4036978 0.5701958 1.00000000
## Other.Debt     0.3402130  0.16545833      0.4060894 0.6106627 0.63310841
##                Other.Debt DebtIncomeRatio  Tot.Debt
## Age             0.3402130     0.016398077 0.3551216
## Edu             0.1654583     0.008838431 0.1488629
## Years.Employed  0.4060894    -0.031182215 0.4460161
## Income          0.6106627    -0.026777293 0.6548030
## Card.Debt       0.6331084     0.501772450 0.8551813
## Other.Debt      1.0000000     0.584867409 0.9426418

p.mat <- cor_pmat(cust[,-c(8)])
p.mat

##                          Age          Edu Years.Employed       Income
## Age             0.000000e+00 7.054521e-01   1.302892e-69 2.498712e-49
## Edu             7.054521e-01 0.000000e+00   9.656215e-06 1.272493e-10
## Years.Employed  1.302892e-69 9.656215e-06   0.000000e+00 2.494527e-93
## Income          2.498712e-49 1.272493e-10   2.494527e-93 0.000000e+00
## Card.Debt       1.185165e-16 3.833610e-03   7.073432e-31 8.277685e-69
## Other.Debt      3.879856e-24 3.730578e-05   1.311358e-36 2.046181e-85
## Defaulted       2.592350e-04 2.376509e-03   2.347198e-14 6.056036e-02
## DebtIncomeRatio 8.104186e-01 8.146335e-01   3.274991e-01 3.000762e-01
## Tot.Debt        3.032113e-25 6.248996e-05   1.124862e-41 5.510299e-99
##                     Card.Debt    Other.Debt    Defaulted DebtIncomeRatio
## Age              1.185165e-16  3.879856e-24 2.592350e-04    8.104186e-01
## Edu              3.833610e-03  3.730578e-05 2.376509e-03    8.146335e-01
## Years.Employed   7.073432e-31  1.311358e-36 2.347198e-14    3.274991e-01
## Income           8.277685e-69  2.046181e-85 6.056036e-02    3.000762e-01
## Card.Debt        0.000000e+00 3.951032e-101 5.254008e-11    9.582665e-59
## Other.Debt      3.951032e-101  0.000000e+00 1.092506e-04    3.435477e-75
## Defaulted        5.254008e-11  1.092506e-04 0.000000e+00    8.657458e-27
## DebtIncomeRatio  9.582665e-59  3.435477e-75 8.657458e-27    0.000000e+00
## Tot.Debt        2.839299e-246  0.000000e+00 5.930600e-08    1.909865e-85
##                      Tot.Debt
## Age              3.032113e-25
## Edu              6.248996e-05
## Years.Employed   1.124862e-41
## Income           5.510299e-99
## Card.Debt       2.839299e-246
## Other.Debt       0.000000e+00
## Defaulted        5.930600e-08
## DebtIncomeRatio  1.909865e-85
## Tot.Debt         0.000000e+00

ggcorrplot(corr, lab = T)

It turned out the “Years.Employed” and “DebtIncomeRatio” have better negative and positive correlation with “Defaulted” respectively which make sense. The more experienced, the less they tend to default and the less debt based on the income, the less probability to defaulted. However let’s use these columns and another imputation methods to impute the missing values in Defaulted.

def_emp <- aggregate(data = cust, Defaulted ~ Years.Employed, mean, na.rm = TRUE)
def_debt <- aggregate(data = cust, Defaulted ~ DebtIncomeRatio, mean, na.rm = TRUE)
ggplot(def_emp,aes(x = Years.Employed, y = Defaulted)) + geom_line(linetype = "dashed")+
  geom_point(size = 2, shape = 21, fill = "darkolivegreen3")+ 
  ggtitle("Defaulting based on employed years")+xlab("Years employed")+ylab("Defaulted")+theme_grey() + 
  geom_vline(xintercept = 15, linetype="dashed",color = "red", size=.5)

# After around 15 years of experience the rate of Default is decreasing

ggplot(def_debt,aes(x = DebtIncomeRatio, y = Defaulted))+
  geom_point(size = 2, shape = 21, fill = "darkolivegreen3")+ 
  ggtitle("Defaulting based on debt income ratio")+xlab("Debt income ratio")+ylab("Defaulted")+theme_grey() + 
  geom_vline(xintercept = 22, linetype="dashed",color = "red", size=.5)

# Apparently as the ratio exceeds 22, the customer is more likely to commit default

# Another way would be use mice package to impute the Defaulted column based on all the other columns
cust2 <- cust # A copy of original df
res <- mice(cust2, m=1, maxit = 50, method = 'pmm', seed = 500)

## 
##  iter imp variable
##   1   1  Defaulted
##   2   1  Defaulted
##   3   1  Defaulted
##   4   1  Defaulted
##   5   1  Defaulted
##   6   1  Defaulted
##   7   1  Defaulted
##   8   1  Defaulted
##   9   1  Defaulted
##   10   1  Defaulted
##   11   1  Defaulted
##   12   1  Defaulted
##   13   1  Defaulted
##   14   1  Defaulted
##   15   1  Defaulted
##   16   1  Defaulted
##   17   1  Defaulted
##   18   1  Defaulted
##   19   1  Defaulted
##   20   1  Defaulted
##   21   1  Defaulted
##   22   1  Defaulted
##   23   1  Defaulted
##   24   1  Defaulted
##   25   1  Defaulted
##   26   1  Defaulted
##   27   1  Defaulted
##   28   1  Defaulted
##   29   1  Defaulted
##   30   1  Defaulted
##   31   1  Defaulted
##   32   1  Defaulted
##   33   1  Defaulted
##   34   1  Defaulted
##   35   1  Defaulted
##   36   1  Defaulted
##   37   1  Defaulted
##   38   1  Defaulted
##   39   1  Defaulted
##   40   1  Defaulted
##   41   1  Defaulted
##   42   1  Defaulted
##   43   1  Defaulted
##   44   1  Defaulted
##   45   1  Defaulted
##   46   1  Defaulted
##   47   1  Defaulted
##   48   1  Defaulted
##   49   1  Defaulted
##   50   1  Defaulted

## Warning: Number of logged events: 51

Defaulted_mice <- res$imp$Defaulted
for (i in 1:length(cust2$Defaulted)){
  ifelse (is.na(cust2$Defaulted[i]),cust2$Defaulted[i] <- Defaulted_mice[as.character(i),],cust2$Defaulted[i])
}
sum(is.na(cust2$Defaulted))

## [1] 0

# Using employed years to impute our missing values in "Defaulted"
cust$Defaulted <- ifelse(!is.na(cust$Defaulted), cust$Defaulted,
                         (ifelse(cust$Years.Employed >= 15, 
                                 cust$Defaulted <- 0, cust$Defaulted <- 1)))
sum(is.na(cust$Defaulted))

## [1] 0

# Now comparing these two imputed values
compare <- data.frame(cust2$Defaulted, cust$Defaulted)
colnames(compare) <- c("mice",'empirical')
table(compare)

##     empirical
## mice   0   1
##    0 550  87
##    1   3 210

# Although we can't literally compare the accuracy since we are not using glm and here but it's better to continue with mice imputation
cust <- cust2
table(cust$Address[cust$Defaulted == 1]) %>%
  as.data.frame() %>% 
  arrange(desc(Freq)) # Most of the customers who defaulted lives in the neighborhood NBA000

##      Var1 Freq
## 1  NBA000   26
## 2  NBA002   24
## 3  NBA003   23
## 4  NBA001   18
## 5  NBA006   18
## 6  NBA004   15
## 7  NBA005   14
## 8  NBA008   12
## 9  NBA011   11
## 10 NBA009    6
## 11 NBA012    6
## 12 NBA014    6
## 13 NBA007    5
## 14 NBA010    5
## 15 NBA013    5
## 16 NBA015    5
## 17 NBA018    4
## 18 NBA016    2
## 19 NBA026    2
## 20 NBA017    1
## 21 NBA019    1
## 22 NBA020    1
## 23 NBA021    1
## 24 NBA023    1
## 25 NBA029    1

# Scaling for normalization
cust_scaled <- scale(cust[,-c(8)])
cust_scaled <- as.data.frame(cust_scaled)
cust_scaled$Address <- cust$Address

# Head of our dataframe so far
knitr::kable(head(cust_scaled), format="markdown")

Age	Edu	Years.Employed	Income	Card.Debt	Other.Debt	Defaulted	DebtIncomeRatio	Tot.Debt	Address
0.7424783	0.3119388	-0.3785669	-0.7180358	-0.6834088	-0.5901417	-0.5779157	-0.5761859	-0.6863713	NBA001
1.4886141	-0.7658984	2.5722067	1.3835101	1.4136414	1.5120716	-0.5779157	0.3911565	1.6162894	NBA021
-0.2523695	0.3119388	0.2115878	0.2678746	2.1328854	0.8012322	1.7283206	1.5966138	1.4402608	NBA013
-0.7497933	0.3119388	-0.6736443	-0.7180358	-0.4213951	-0.7540231	-0.5779157	-0.5761859	-0.6863713	NBA009
1.4886141	-0.7658984	3.3099001	5.3530970	3.6367592	1.7150845	-0.5779157	-0.4422462	2.6911158	NBA008
0.6181223	-0.7658984	2.1295907	0.8905549	-0.2722778	1.3982078	-0.5779157	0.1083949	0.8282289	NBA016

2- Clusterability of the data

# From factoextra (returns H, more info can be found in Hopkins folder)
# If the value of Hopkins statistic is close to 1 (far above 0.5), then we can conclude that the dataset is significantly clusterable.
cust_scaled <- cust_scaled[,-10]
get_clust_tendency(cust_scaled, n = nrow(cust_scaled) - 1, seed =123, graph = F)$hopkins_stat

## [1] 0.8761302

# From clustertend (returns 1-H)
hopkins(cust_scaled, n = nrow(cust_scaled) - 1)

## $H
## [1] 0.1218657

# VAT: Visual Assessment of cluster Tendency
fviz_dist(dist(cust_scaled), show_labels = FALSE)+
  labs(title = "Customer data")

### 3- Determining the optimal number of clusters

# Manually 
wss <- (nrow(cust_scaled)-1)*sum(apply(cust_scaled,2,var))
for (i in 2:10) wss[i] <- sum(kmeans(cust_scaled,
                                     centers=i)$withinss)
plot(1:10, wss, type="b", xlab="Number of Clusters",
     ylab="Within groups sum of squares")

# From factoextra
fviz_nbclust(cust_scaled, FUNcluster = kmeans, method = "wss")

fviz_nbclust(cust_scaled, FUNcluster = kmeans, method = "silhouette")

di <- get_dist(cust_scaled, method="euclidean")
fviz_nbclust(cust_scaled, FUNcluster = kmeans, diss = di, method = "gap_stat", iter.max=30)

# From vegan
fit <- cascadeKM(scale(cust_scaled, center = TRUE,  
                       scale = TRUE), 2, 10, iter = 500)
plot(fit, sortg = TRUE, grpmts.plot = TRUE)

calinski.best <- as.numeric(which.max(fit$results[2,]))

# From mclust
cust_scaled_clust <- Mclust(as.matrix(cust_scaled), G=2:10)
m.best <- dim(cust_scaled_clust$z)[2]
cat("model-based optimal number of clusters:", m.best, "\n")

## model-based optimal number of clusters: 10

plot(cust_scaled_clust)

# From NbClust 
nb <- NbClust(cust_scaled, distance="euclidean", min.nc=2, 
              max.nc=10, method="complete", index="ch")
nb$Best.nc

## Number_clusters     Value_Index 
##           2.000         153.572

# From clusterR
opt<-Optimal_Clusters_KMeans(cust_scaled, max_clusters=10, 
                             plot_clusters = TRUE)

opt<-Optimal_Clusters_KMeans(cust_scaled, max_clusters=10, 
                             plot_clusters = TRUE, criterion="silhouette")

opt_md<-Optimal_Clusters_KMeans(cust_scaled, 10, 'euclidean', plot_clusters=TRUE,
                                criterion="AIC")

opt_md<-Optimal_Clusters_Medoids(cust_scaled, 10, 'euclidean', plot_clusters=TRUE)
opt_md<-Optimal_Clusters_Medoids(cust_scaled, 10, 'euclidean', plot_clusters=TRUE,criterion="silhouette")

### 4- Applying KMeans and PAM

# With k = 2 and k = 3 
km2 <- kmeans(cust_scaled, 2, iter.max = 200)
km3 <- kmeans(cust_scaled, 3, iter.max = 200)
fviz_cluster(km3, data = cust_scaled)

fviz_cluster(km2, data = cust_scaled, repel = T,)

km2$centers

##          Age         Edu Years.Employed     Income  Card.Debt Other.Debt
## 1  0.8548769  0.33266641      1.0485861  1.2913721  1.3096706  1.5153080
## 2 -0.1921625 -0.07477804     -0.2357052 -0.2902796 -0.2943928 -0.3406168
##    Defaulted DebtIncomeRatio   Tot.Debt
## 1  0.1021283       0.7949790  1.5746090
## 2 -0.0229568      -0.1786984 -0.3539467

km2$tot.withinss

## [1] 5588.591

km3$centers

##          Age         Edu Years.Employed      Income  Card.Debt Other.Debt
## 1 -0.7798103  0.06106287     -0.7546209 -0.52382102 -0.2607314 -0.3323292
## 2  0.4681357 -0.16252901      0.3640069  0.04552653 -0.2455679 -0.2392926
## 3  0.8018041  0.39104609      1.0968199  1.50823662  1.7178229  1.9237848
##    Defaulted DebtIncomeRatio   Tot.Debt
## 1  0.5089313       0.1464984 -0.3341555
## 2 -0.5074962      -0.4152083 -0.2650047
## 3  0.2049352       1.0293157  2.0223206

km3$tot.withinss

## [1] 4786.98

# Increasing the nboot = 200 with custom kmeans to avoid warning message
MyKmeansFUN2 <- function(x,k) list(cluster=kmeans(cust_scaled, 2, iter.max=200))
MyKmeansFUN3 <- function(x,k) list(cluster=kmeans(cust_scaled, 3, iter.max=200))
# The warning messages are about the unequal dimension of our matrix
fviz_nbclust(cust_scaled, FUNcluster = MyKmeansFUN2, method = "gap_stat", 
             diss = dist(cust_scaled), nboot = 200)

fviz_nbclust(cust_scaled, FUNcluster = MyKmeansFUN3, method = "gap_stat", 
             diss = dist(cust_scaled, method = "minkowski"), nboot = 100)

pam2 <-eclust(cust_scaled, "pam", k= 2, graph = T, nboot = 100) # 3 Number of clusters

e.km2 <-eclust(cust_scaled, "kmeans", k= 2, graph = T, nboot = 100)

pam3 <-eclust(cust_scaled, "pam", k= 3, graph = T, nboot = 100)

e.km3 <-eclust(cust_scaled, "kmeans", k= 3, graph = T, nboot = 100)

c2 <- pam(cust_scaled,2)
c3 <- pam(cust_scaled,3)
c2$medoids

##               Age       Edu Years.Employed      Income  Card.Debt Other.Debt
## [1,] -0.003657528 0.3119388     0.06404914 -0.43264071 -0.4359776 -0.4262603
## [2,] -0.376725427 0.3119388    -0.52610557 -0.09535555  0.1078066  0.4605236
##       Defaulted DebtIncomeRatio   Tot.Debt
## [1,] -0.5779157      -0.3827174 -0.4714457
## [2,]  1.7283206       0.7185646  0.3561071

class(c2)

## [1] "pam"       "partition"

c3$medoids

##             Age        Edu Years.Employed     Income  Card.Debt Other.Debt
## [1,] -0.2523695 -0.7658984     -0.2310282 -0.5364208 -0.4863105 -0.6066181
## [2,]  0.8668342  0.3119388      0.5066652  0.2159846  0.4497886  0.7762234
## [3,] -0.9985053  0.3119388     -0.6736443 -0.6402008 -0.2308825 -0.3709466
##       Defaulted DebtIncomeRatio   Tot.Debt
## [1,] -0.5779157      -0.6208324 -0.6143325
## [2,] -0.5779157       0.7185646  0.7133241
## [3,]  1.7283206       0.4506852 -0.3476105

opt_aut <- pamk(cust_scaled, krange=2:5, criterion="asw", 
                usepam=TRUE, scaling=FALSE, alpha=0.001, 
                diss=inherits(cust_scaled, "dist"), critout=FALSE)
pamk(cust_scaled, krange=2:5, criterion="ch", 
                usepam=TRUE, scaling=FALSE, alpha=0.001, 
                diss=inherits(cust_scaled, "dist"), critout=T) # With Calinski-Harabasz criterion

## 2  clusters  172.947 
## 3  clusters  219.2966 
## 4  clusters  220.8544 
## 5  clusters  204.9717

## $pamobject
## Medoids:
##       ID        Age        Edu Years.Employed       Income    Card.Debt
## [1,] 120 -0.2523695 -0.7658984     -0.2310282 -0.536420757 -0.486310545
## [2,] 171  0.4937663  0.3119388      1.0968199  1.098115009  1.611210028
## [3,] 499  0.2450544  0.3119388      0.3591265  0.008424498  0.009492704
## [4,] 105 -0.9985053  0.3119388     -0.6736443 -0.640200806 -0.230882496
##       Other.Debt  Defaulted DebtIncomeRatio    Tot.Debt
## [1,] -0.60661812 -0.5779157      -0.6208324 -0.61433252
## [2,]  1.36702033 -0.5779157       0.6143893  1.60180227
## [3,] -0.04818554 -0.5779157      -0.0850736 -0.02849658
## [4,] -0.37094662  1.7283206       0.4506852 -0.34761046
## Clustering vector:
##   [1] 1 2 2 1 2 2 1 3 1 3 3 1 1 3 4 1 1 3 3 3 1 3 4 3 2 3 1 3 3 3 3 3 4 3 1 3 4
##  [38] 4 1 2 4 2 3 2 3 2 1 1 1 3 3 2 4 3 2 4 4 1 3 1 3 3 2 2 1 1 3 4 1 4 3 3 3 3
##  [75] 1 1 1 1 2 3 3 2 2 4 3 4 4 1 4 4 3 1 4 4 3 1 1 1 4 1 4 2 3 4 4 2 1 1 1 3 3
## [112] 3 1 4 1 3 1 4 2 1 1 1 3 3 1 3 1 1 1 1 1 3 3 3 1 3 3 1 3 4 4 1 3 1 2 3 3 4
## [149] 1 1 1 3 4 4 3 4 3 3 4 4 4 3 3 3 1 3 3 1 1 3 2 1 3 4 1 4 3 1 4 3 4 3 1 2 4
## [186] 2 1 1 3 3 4 3 3 4 3 3 1 3 2 1 2 1 4 1 1 3 1 2 3 4 1 1 1 3 1 4 4 2 3 3 4 3
## [223] 1 4 1 1 2 1 3 3 1 3 2 4 2 1 3 4 2 1 1 1 1 1 1 3 3 3 3 3 1 3 1 3 3 4 3 4 1
## [260] 1 1 2 3 4 2 3 1 2 1 1 4 1 1 3 3 4 4 4 1 1 1 2 3 1 2 1 2 1 3 2 1 3 1 2 1 1
## [297] 4 1 4 2 3 3 1 1 4 3 3 4 4 1 3 1 1 1 1 4 1 4 3 4 2 1 2 1 1 1 4 4 3 3 3 2 4
## [334] 3 3 4 1 2 3 3 3 1 1 4 2 3 1 3 1 1 2 3 1 1 1 1 2 3 3 1 2 4 1 4 3 1 3 3 2 1
## [371] 4 3 3 1 4 3 4 3 4 4 1 1 1 1 2 1 3 3 1 1 1 2 1 2 1 3 1 3 1 1 3 1 4 1 1 2 4
## [408] 2 3 1 1 3 3 4 4 3 2 4 2 1 3 2 1 3 2 3 1 3 3 1 2 3 4 1 2 1 1 1 2 3 1 3 1 2
## [445] 3 4 1 3 3 1 2 4 1 1 2 1 4 1 1 3 3 2 1 3 1 3 3 1 1 2 2 3 3 3 3 4 3 1 1 4 4
## [482] 4 3 1 1 1 3 2 4 4 1 2 1 2 1 1 4 2 3 4 4 4 2 2 4 3 3 4 3 4 4 3 4 2 3 3 4 1
## [519] 4 2 2 3 3 4 3 3 4 3 4 1 1 3 2 2 1 4 4 1 1 4 3 1 4 3 3 1 4 1 3 1 2 2 3 3 1
## [556] 3 2 1 4 1 3 3 3 4 4 3 3 1 2 1 3 2 1 4 1 3 1 4 4 3 2 4 3 1 1 4 3 1 1 2 4 2
## [593] 3 4 4 4 4 1 3 3 1 3 4 2 3 2 3 3 4 1 4 1 4 1 3 4 4 4 4 1 2 1 4 4 1 3 1 1 3
## [630] 2 3 4 3 3 2 1 3 2 1 3 1 1 2 4 3 2 3 4 1 1 4 1 2 3 1 1 1 2 3 1 1 1 3 2 1 1
## [667] 3 1 1 4 3 3 2 1 1 3 1 3 3 1 3 1 4 3 1 3 4 3 1 1 1 1 3 2 4 1 2 1 1 4 3 4 1
## [704] 4 4 3 1 1 2 4 2 3 2 2 1 4 1 3 1 2 1 1 1 3 4 2 1 4 1 3 3 3 4 1 2 1 1 3 3 3
## [741] 4 3 1 4 3 3 3 3 4 3 1 1 1 3 4 1 1 1 3 4 2 2 1 3 1 4 3 3 1 3 4 1 4 1 3 3 2
## [778] 3 3 3 1 1 1 3 2 1 3 3 3 1 3 2 2 2 4 3 4 1 4 1 1 2 4 1 1 1 3 3 2 4 1 1 1 4
## [815] 1 1 4 1 3 1 4 3 1 4 4 3 2 3 3 1 2 1 3 3 1 4 1 1 1 1 3 3 1 1 1 1 1 4 1 3
## Objective function:
##    build     swap 
## 2.006165 1.987011 
## 
## Available components:
##  [1] "medoids"    "id.med"     "clustering" "objective"  "isolation" 
##  [6] "clusinfo"   "silinfo"    "diss"       "call"       "data"      
## 
## $nc
## [1] 4
## 
## $crit
## [1]   0.0000 172.9470 219.2966 220.8544 204.9717

# Announcement and plot
cat("number of clusters estimated by optimum average silhouette width:", opt_aut$nc, "\n")

## number of clusters estimated by optimum average silhouette width: 2

plot(pam(cust_scaled, opt_aut$nc)) #For k = 2

plot(pam(cust_scaled, 3)) #For k = 3

# Applying distance matrix with different methods 
dis.euc <- dist(cust_scaled, method = "euclidean")
dis.max <- dist(cust_scaled, method = "maximum")
dis.man <- dist(cust_scaled, method = "manhattan")
dis.can <- dist(cust_scaled, method = "canberra")
dis.min <- dist(cust_scaled, method = "minkowski")

# Silhouette plot with k = 3 and 2 with different distance methods
plot(silhouette(km3$cluster, dis.min),  border=NA)

plot(silhouette(km3$cluster, dis.can),  border=NA)

plot(silhouette(km3$cluster, dis.man),  border=NA)

plot(silhouette(opt_aut$pamobject$cluster, dis.max),  border=NA)

plot(silhouette(opt_aut$pamobject$cluster, dis.min),  border=NA)

plot(silhouette(opt_aut$pamobject$cluster, dis.can),  border=NA)

plot(silhouette(opt_aut$pamobject$cluster, dis.euc),  border=NA)

Looks like for k = 2, euclidean and minkowski metrics, the seconds cluster deemed to have completely negative silhouette width while in canberra metric, we see different silhouette width with mostly positive value.

# Check the assignment into clusters
cust_scaled$cluster2 <- km2$cluster
cust_scaled$cluster3 <- km3$cluster

aggregate(data = cust_scaled, Income ~ cluster2, mean)

##   cluster2     Income
## 1        1  1.2913721
## 2        2 -0.2902796

aggregate(data = cust_scaled, Income ~ cluster3, mean)

##   cluster3      Income
## 1        1 -0.52382102
## 2        2  0.04552653
## 3        3  1.50823662

aggregate(data = cust_scaled, Edu ~ cluster2, mean)

##   cluster2         Edu
## 1        1  0.33266641
## 2        2 -0.07477804

aggregate(data = cust_scaled, Edu ~ cluster3, mean)

##   cluster3         Edu
## 1        1  0.06106287
## 2        2 -0.16252901
## 3        3  0.39104609

aggregate(data = cust_scaled, Tot.Debt ~ cluster2, mean)

##   cluster2   Tot.Debt
## 1        1  1.5746090
## 2        2 -0.3539467

aggregate(data = cust_scaled, Tot.Debt ~ cluster3, mean)

##   cluster3   Tot.Debt
## 1        1 -0.3341555
## 2        2 -0.2650047
## 3        3  2.0223206

km2$size

## [1] 156 694

km3$size

## [1] 348 393 109

dim(cust_scaled)

## [1] 850  11

fviz_cluster(list(data=cust_scaled[,-c(10,11)], cluster=km2$cluster), 
             ellipse.type="norm", geom="point", stand=FALSE, 
             palette="jco", ggtheme=theme_grey())

fviz_cluster(list(data=cust_scaled[,-c(10,11)], cluster=km3$cluster), 
             ellipse.type="norm", geom="text", stand=FALSE, 
             palette="simpsons", ggtheme=theme_grey())

# Boxplot
groupBWplot(cust_scaled, km2$cluster, alpha=0.05)

groupBWplot(cust_scaled, km3$cluster, alpha=0.05)

k2 <- flexclust::cclust(cust_scaled, 2, simple=FALSE, save.data=TRUE)

k2

## kcca object of family 'kmeans' 
## 
## call:
## flexclust::cclust(x = cust_scaled, k = 2, simple = FALSE, save.data = TRUE)
## 
## cluster sizes:
## 
##   1   2 
## 700 150

summary(k2)

## kcca object of family 'kmeans' 
## 
## call:
## flexclust::cclust(x = cust_scaled, k = 2, simple = FALSE, save.data = TRUE)
## 
## cluster info:
##   size  av_dist  max_dist separation
## 1  700 2.164105  5.145281   2.770636
## 2  150 3.408690 12.229879   2.699075
## 
## convergence after 4 iterations
## sum of within cluster distances: 2026.177

k3 <- flexclust::cclust(cust_scaled, 3, simple=FALSE, save.data=TRUE)
k3

## kcca object of family 'kmeans' 
## 
## call:
## flexclust::cclust(x = cust_scaled, k = 3, simple = FALSE, save.data = TRUE)
## 
## cluster sizes:
## 
##   1   2   3 
## 109 348 393

summary(k3)

## kcca object of family 'kmeans' 
## 
## call:
## flexclust::cclust(x = cust_scaled, k = 3, simple = FALSE, save.data = TRUE)
## 
## cluster info:
##   size  av_dist  max_dist separation
## 1  109 3.547315 11.529581   3.025906
## 2  348 2.053865  4.676066   1.901020
## 3  393 1.846303  4.894070   1.679237
## 
## convergence after 7 iterations
## sum of within cluster distances: 1826.999

5- Quality measurement

# For k = 2
d1 <- flexclust::cclust(cust_scaled[,1:3], 2, dist="euclidean")
d2 <- flexclust::cclust(cust_scaled[,4:7], 2, dist="euclidean")
randIndex(d1, d2)

##        ARI 
## 0.03828895

comPart(d1, d2)

##        ARI         RI          J         FM 
## 0.03828895 0.55142798 0.48241855 0.65362212

# For k = 3
d3 <- flexclust::cclust(cust_scaled[,1:3], 3, dist="manhattan")
d4 <- flexclust::cclust(cust_scaled[,4:7], 3, dist="manhattan")
randIndex(d3, d4)

##       ARI 
## 0.1104012

comPart(d3, d4)

##       ARI        RI         J        FM 
## 0.1104012 0.5784134 0.2865732 0.4475317

# Calinski-Harabasz & Duda-Hart
round(calinhara(cust_scaled, km2$cluster),digits=2)

## [1] 342.45

round(calinhara(cust_scaled, km3$cluster),digits=2)

## [1] 291.85

dudahart2(cust_scaled, km2$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.7123348
## 
## $compare
## [1] 0.8986265
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

dudahart2(cust_scaled, km3$cluster)

## $p.value
## [1] 0
## 
## $dh
## [1] 0.3819788
## 
## $compare
## [1] 0.8986265
## 
## $cluster1
## [1] FALSE
## 
## $alpha
## [1] 0.001
## 
## $z
## [1] 3.090232

# Shadow statistics
shadow(k2)

##         1         2 
## 0.6109953 0.7739530

shadow(k3)

##         1         2         3 
## 0.7748512 0.7947313 0.7622407

plot(shadow(k2))

plot(shadow(k3))

### 6- Hierarchical clustering

cust_scaled$cluster2 <- NULL
cust_scaled$cluster3 <- NULL
cust_scaled$Address <- cust$Address
length(unique(cust_scaled$Address)) #32 neighborhood

## [1] 32

colnames(cust_scaled)

##  [1] "Age"             "Edu"             "Years.Employed"  "Income"         
##  [5] "Card.Debt"       "Other.Debt"      "Defaulted"       "DebtIncomeRatio"
##  [9] "Tot.Debt"        "Address"

# We need to use data table format to have columns for Address based on other variables
cust_scaled <- data.table(cust_scaled)
# Mean of every columns for every unique address to construct a new dataframe
ed <- cust_scaled[,.(Education = sum(Edu)/.N),.(Address)]
ag <- cust_scaled[,.(Age = sum(Age)/.N),.(Address)]
ye <- cust_scaled[,.(Years.Employed = sum(Years.Employed)/.N),.(Address)]
inc <- cust_scaled[,.(Income = sum(Income)/.N),.(Address)]
dir <- cust_scaled[,.(DebtIncomeRatio = sum(DebtIncomeRatio)/.N),.(Address)]
td <- cust_scaled[,.(Tot.Debt = sum(Tot.Debt)/.N),.(Address)]
de <- cust_scaled[,.(Defaulted = sum(Defaulted)/.N),.(Address)]
new <- as.data.frame(Reduce(function(x, y) merge(x, y, all=TRUE), list(ed,ag,ye,inc,dir,td,de)))

rownames(new) <- sort(unique(cust_scaled$Address))
rownames(new) == new$Address

##  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
## [31] TRUE TRUE

new$Address <- NULL

# Dissimilarity matrix
dis <- dist(new, method = "euclidean")
hc01 <- hclust(dis, method = "complete") # Complete linkage
plot(hc01, cex = 0.6, hang = -1) # Simple dendrogram

# Better visualization with three cluster
fviz_dend(hc01, k =3, # Cut in three groups
          cex = 0.5, # label size
          k_colors = c("#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

hc02 <- hclust(dis, method = "ward.D2") # Ward linkage
fviz_dend(hc02, k =3, # Cut in three groups
          cex = 0.5, # label size
          k_colors = c("#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# This time with cophentic distance
res.coph <- cophenetic(hc02)
# Correlation between cophenetic distance and the euclidean distance
cor(dis, res.coph) # Not quite correlated, so it worth comparing their dendrograms

## [1] 0.5916284

res.hc <- hclust(d =res.coph, method = "complete")
fviz_dend(res.hc, k =3, # Cut in three groups
          cex = 0.5, # label size
          k_colors = c("#00AFBB", "#E7B800", "#FC4E07"),
          color_labels_by_k = TRUE, # color labels by groups
          rect = TRUE # Add rectangle around groups
)

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

hc02 <- agnes(new, metric = "euclidean", method = "complete")
hc02$ac # Close to 1, thus a good cluster structure

## [1] 0.8366533

fviz_dend(hc02, cex = 0.8, k =3,k_colors = c("#00AFBB", "#E7B800", "#FC4E07"))

## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.

# Applying multiple metric and method 
a <- data.frame(matrix(vector(), 4, 5, # Crating an empty df to store values
                       dimnames=list(c( "average", "single", "complete", "ward"), c("euclidean", "manhattan", "canberra","minkowski","maximum"))),
                stringsAsFactors=F)

for (i in colnames(a)) {
  for (j in row.names(a)){
    a[j,i] <- agnes(new, method = j, metric = i)$ac
  }
}

knitr::kable(a, format="markdown")

	euclidean	manhattan	canberra	minkowski	maximum
average	0.8090837	0.8293903	0.8090837	0.8090837	0.8090837
single	0.7941292	0.7992851	0.7941292	0.7941292	0.7941292
complete	0.8366533	0.8441085	0.8366533	0.8366533	0.8366533
ward	0.8952770	0.8938113	0.8952770	0.8952770	0.8952770

colnames(a)[apply(a,1,which.max)]

## [1] "manhattan" "manhattan" "manhattan" "euclidean"

# For average method the best metric was manhattan, for the rest it was euclidean
which(a == max(a), arr.ind = TRUE)

##      row col
## ward   4   1
## ward   4   3
## ward   4   4
## ward   4   5

# Here ward is the best method for all metrics except for manhattan
hc03 <- agnes(new, method = "ward")
cutree(as.hclust(hc03), k = 3)

## NBA000 NBA001 NBA002 NBA003 NBA004 NBA005 NBA006 NBA007 NBA008 NBA009 NBA010 
##      1      1      1      1      1      1      1      1      1      1      1 
## NBA011 NBA012 NBA013 NBA014 NBA015 NBA016 NBA017 NBA018 NBA019 NBA020 NBA021 
##      1      1      1      2      1      2      2      2      2      2      2 
## NBA022 NBA023 NBA024 NBA025 NBA026 NBA027 NBA029 NBA030 NBA031 NBA034 
##      2      2      2      2      2      2      3      3      2      2

pltree(hc03, cex = 0.6, hang = -1, main = "Dendrogram - Agnes")

hc04 <- diana(new,metric = "euclidean") # Can also perform it with DIANA
cutree(as.hclust(hc04), k = 3)

## NBA000 NBA001 NBA002 NBA003 NBA004 NBA005 NBA006 NBA007 NBA008 NBA009 NBA010 
##      1      1      1      1      1      1      1      1      1      1      1 
## NBA011 NBA012 NBA013 NBA014 NBA015 NBA016 NBA017 NBA018 NBA019 NBA020 NBA021 
##      1      1      1      2      1      1      1      1      1      2      2 
## NBA022 NBA023 NBA024 NBA025 NBA026 NBA027 NBA029 NBA030 NBA031 NBA034 
##      2      2      2      2      2      2      3      2      2      2

hc04$dc

## [1] 0.8072509

pltree(hc04, cex = 0.6, hang = -1, main = "Dendrogram - Diana")

hc05 <- hclust(dis, method = "ward.D2" )
sub_grp <- cutree(hc05, k = 3) # Cutting into 3 groups
table(sub_grp)

## sub_grp
##  1  2  3 
## 15 15  2

new$cluster <- sub_grp

knitr::kable(new, format="markdown")

	Education	Age	Years.Employed	Income	DebtIncomeRatio	Tot.Debt	Defaulted	cluster
NBA000	-0.0653043	-0.5259526	-0.4154516	-0.3595623	-0.0736639	-0.3218842	0.4214534	1
NBA001	0.0538651	-0.5168448	-0.3931130	-0.2393321	-0.0117107	-0.1184801	0.0067639	1
NBA002	-0.2042086	-0.5921590	-0.2788224	-0.3091278	0.0077829	-0.2081700	0.2016571	1
NBA003	0.0375802	-0.5191695	-0.3249165	-0.2331000	0.2131114	-0.1272292	0.3865104	1
NBA004	0.0146044	-0.4131746	-0.2233969	-0.2559462	-0.0583883	-0.2271147	0.0185247	1
NBA005	-0.0891169	-0.2436935	-0.3476867	-0.2751605	0.1381592	-0.0987493	0.1729519	1
NBA006	0.0963713	-0.4115451	-0.2280774	-0.1399810	0.0321981	-0.0347796	0.2523294	1
NBA007	0.1016291	-0.2766340	-0.0906865	-0.0852307	-0.1119341	-0.2246416	-0.2966673	1
NBA008	0.0479786	-0.0594908	0.2627747	0.1985114	0.1086986	0.2668028	-0.0131231	1
NBA009	-0.2868597	0.0460849	0.1492937	-0.0071425	-0.0903650	0.0291741	-0.2704175	1
NBA010	-0.0958915	-0.0103795	0.0441115	-0.1465444	-0.0219248	-0.1314298	-0.2662621	1
NBA011	-0.2269798	0.4454057	0.1296219	0.1194115	0.2113300	0.1741193	0.1267676	1
NBA012	0.1579620	0.1429049	0.2800879	0.1177642	-0.0850736	0.1675263	-0.0837222	1
NBA013	0.1649610	0.2846222	-0.0700769	-0.1071487	-0.0444858	-0.0627840	-0.0537711	1
NBA014	0.2670289	0.7062078	0.6603513	1.0289283	0.0953730	0.6026117	-0.0013566	2
NBA015	-0.2868597	0.4937663	0.1951946	-0.1097694	0.0926859	0.0146451	0.0627055	1
NBA016	0.2629462	0.5389867	0.3054761	0.4648208	-0.2156310	0.0401865	-0.3682578	2
NBA017	0.2580469	0.6865181	0.2410956	0.4430035	-0.0285213	0.2998057	-0.4626039	2
NBA018	0.5429039	0.7868911	0.3802035	0.1659478	-0.0276709	0.4221355	0.0810090	2
NBA019	-0.2943447	0.9601012	0.3406842	0.0992320	-0.0646106	-0.0357774	-0.4337759	2
NBA020	-0.2269798	1.5041586	1.0599352	0.8192061	0.5864852	0.9542968	-0.2896361	2
NBA021	-0.1191961	1.0906750	1.2296047	0.5351082	-0.2844949	0.3314790	-0.3472920	2
NBA022	-0.1671000	1.0188249	0.8837085	1.1874812	-0.6373682	0.3569009	-0.5779157	2
NBA023	0.6058944	1.3529530	0.6944417	0.8103612	-0.1919548	0.2421765	-0.3682578	2
NBA024	0.0424795	1.1777241	0.3960112	0.2873334	-0.3715558	0.0394243	-0.5779157	2
NBA025	-0.4066194	1.7925953	1.4574700	1.3863929	-0.2553920	0.5145836	-0.5779157	2
NBA026	0.0963713	1.6627124	0.5066652	0.8075309	0.1381592	0.8161828	-0.1166684	2
NBA027	0.3119388	1.7995040	-0.0834895	0.0149108	0.1493209	0.1867267	-0.5779157	2
NBA029	2.4676132	2.2347499	-0.5261056	0.8127199	1.0608550	1.7540164	1.7283206	3
NBA030	2.4676132	1.9860379	0.9492812	-0.5364208	0.5102140	-0.2221876	-0.5779157	3
NBA031	0.3119388	2.0482159	1.0230506	0.1381496	-1.0152105	-0.6139356	-0.5779157	2
NBA034	-0.7658984	2.3591058	1.3918973	1.7467403	-0.2487777	0.9990978	-0.5779157	2

# plots with borders
plot(hc05, cex = 0.6)
rect.hclust(hc05, k = 3, border = 2:5)

fviz_cluster(list(data = new, cluster = sub_grp)) #KMeans-like visualization

# Comparing two dendrograms
dend1 <- as.dendrogram (hc01)
dend2 <- as.dendrogram (hc02)
dend_list <- dendlist(dend1, dend2)
tanglegram(sort(dend1), sort(dend2), # A somewhat better tanglegram
           highlight_distinct_edges = FALSE,
           common_subtrees_color_lines = FALSE,
           common_subtrees_color_branches = TRUE,
           main = paste("entanglement =", round(entanglement(dend_list), 2))
) #  A lower entanglement coefficient corresponds to a good alignment.

# Optimal number of clusters
fviz_nbclust(new, FUN = hcut, method = "wss") # Elbow: 2

fviz_nbclust(new, FUN = hcut, method = "silhouette") # Silhouette: 3

gap_stat <- clusGap(new, FUN = hcut, nstart = 50, K.max = 10, B = 100)
fviz_gap_stat(gap_stat)  # Gap statistic: 2

7- Affinity Propagation (AP)

Luckiliy unlike K-Means we don’t have to specify the number of clusters. It’s based on similarity between pairs of data points then consider them as so-called exemplars

# r is an exponent in similarity matrix of s = -d^r (in negDistMat, r=2 means negative squared distances)
set.seed(100)
apres01 <- apcluster(negDistMat(r=2), cust_scaled, details=TRUE)

# Clustering results
length(apres01@clusters) # 67 clusters!

## [1] 60

plot(apres01, cust_scaled)

# Let's select only few columns to have a closer visualization
# But which columns should we choose to explain the most variance? The almighty PCA is the answer!
res.pca01   <-  PCA(cust_scaled[,c(1:9)],   graph   =   FALSE)
fviz_cos2(res.pca01,    choice  =   "var",  axes    =   1:2)

# We choose Tot.debt, Years.Employed and Income (since the ones with .Debt are actually the ingredients of Total.debt)
c <- cust_scaled[,c("Tot.Debt","Years.Employed","Income")]

# Performing the AP again on the new dataframe
apres02 <- apcluster(negDistMat(r=2), c, details=TRUE)
length(apres02@clusters) # 48 clusters!

## [1] 48

plot(apres02, c)

cou <- c()
for (i in 1:48) {
  cou[i]<-length(apres02@clusters[[i]])==1
}
sum(cou) # 6 clusters with only one observation!

## [1] 6

# We didn't provide q, which by default is the median of similarities
# Let's change it to 0.001 to see if we can get a smaller number of clusters
apres03 <- apcluster(s=apres02@sim, q=0.001) # Reusing similarity matrix from previous run
show(apres03) # 4 clusters

## 
## APResult object
## 
## Number of samples     =  850 
## Number of iterations  =  133 
## Input preference      =  -146.3287 
## Sum of similarities   =  -861.2376 
## Sum of preferences    =  -585.3148 
## Net similarity        =  -1446.552 
## Number of clusters    =  4 
## 
## Exemplars:
##    171 396 557 661
## Clusters:
##    Cluster 1, exemplar 171:
##       2 3 6 10 24 25 40 42 44 46 55 63 64 73 79 82 83 102 106 119 145 157 158 
##       171 198 201 218 222 227 235 239 246 262 265 283 285 287 290 300 321 323 
##       332 338 345 357 361 368 369 385 392 398 417 419 422 425 431 435 439 451 
##       470 471 474 488 492 494 498 503 504 514 521 534 541 551 554 569 581 590 
##       592 604 606 621 630 635 638 640 643 647 653 654 658 664 676 684 697 709 
##       711 713 714 718 720 726 732 735 747 761 762 777 785 793 794 802 809 826 
##       827 831
##    Cluster 2, exemplar 396:
##       11 12 19 20 22 23 26 28 29 30 31 32 35 50 51 52 54 59 60 61 62 66 67 71 
##       75 77 80 81 85 91 98 101 103 108 109 110 111 113 116 118 123 124 125 126 
##       131 132 135 137 139 141 143 147 151 152 155 156 160 162 163 164 168 169 
##       173 175 177 180 182 186 188 189 190 191 192 193 195 206 207 209 211 212 
##       213 214 217 219 220 223 230 231 232 233 237 242 244 247 248 249 253 255 
##       257 263 267 268 270 273 274 289 292 294 298 301 302 306 307 311 315 319 
##       326 329 330 331 335 336 339 340 341 342 343 348 353 355 356 367 370 373 
##       377 378 386 387 393 394 396 402 406 408 409 411 412 413 416 421 423 424 
##       426 428 429 432 437 440 441 442 445 447 449 450 452 460 467 469 473 475 
##       477 479 481 483 487 493 495 499 507 509 515 516 520 525 527 528 532 535 
##       540 544 545 546 547 549 555 561 562 563 564 566 567 571 575 580 583 584 
##       585 586 587 593 599 600 601 602 610 612 614 615 617 619 620 626 629 631 
##       632 633 634 637 639 645 646 650 657 659 663 665 666 667 668 669 670 671 
##       672 678 679 680 681 683 686 688 694 698 702 706 712 716 717 719 724 730 
##       731 736 740 741 742 746 748 753 759 764 767 768 770 774 775 776 778 779 
##       781 783 784 787 788 791 805 807 811 814 822 833 839 841 842 843 844 849 
##       850
##    Cluster 3, exemplar 557:
##       5 184 199 208 282 351 444 455 462 533 552 557 572 673 792
##    Cluster 4, exemplar 661:
##       1 4 7 8 9 13 14 15 16 17 18 21 27 33 34 36 37 38 39 41 43 45 47 48 49 53 
##       56 57 58 65 68 69 70 72 74 76 78 84 86 87 88 89 90 92 93 94 95 96 97 99 
##       100 104 105 107 112 114 115 117 120 121 122 127 128 129 130 133 134 136 
##       138 140 142 144 146 148 149 150 153 154 159 161 165 166 167 170 172 174 
##       176 178 179 181 183 185 187 194 196 197 200 202 203 204 205 210 215 216 
##       221 224 225 226 228 229 234 236 238 240 241 243 245 250 251 252 254 256 
##       258 259 260 261 264 266 269 271 272 275 276 277 278 279 280 281 284 286 
##       288 291 293 295 296 297 299 303 304 305 308 309 310 312 313 314 316 317 
##       318 320 322 324 325 327 328 333 334 337 344 346 347 349 350 352 354 358 
##       359 360 362 363 364 365 366 371 372 374 375 376 379 380 381 382 383 384 
##       388 389 390 391 395 397 399 400 401 403 404 405 407 410 414 415 418 420 
##       427 430 433 434 436 438 443 446 448 453 454 456 457 458 459 461 463 464 
##       465 466 468 472 476 478 480 482 484 485 486 489 490 491 496 497 500 501 
##       502 505 506 508 510 511 512 513 517 518 519 522 523 524 526 529 530 531 
##       536 537 538 539 542 543 548 550 553 556 558 559 560 565 568 570 573 574 
##       576 577 578 579 582 588 589 591 594 595 596 597 598 603 605 607 608 609 
##       611 613 616 618 622 623 624 625 627 628 636 641 642 644 648 649 651 652 
##       655 656 660 661 662 674 675 677 682 685 687 689 690 691 692 693 695 696 
##       699 700 701 703 704 705 707 708 710 715 721 722 723 725 727 728 729 733 
##       734 737 738 739 743 744 745 749 750 751 752 754 755 756 757 758 760 763 
##       765 766 769 771 772 773 780 782 786 789 790 795 796 797 798 799 800 801 
##       803 804 806 808 810 812 813 815 816 817 818 819 820 821 823 824 825 828 
##       829 830 832 834 835 836 837 838 840 845 846 847 848

plot(apres03, c)