Data Set Introduction

This data set contain 214 instances, each corresponding to a glass/piece of glass. Each instance is described by 11 attributes.

ID RI Na Mg Al Si K Ca Ba Fe Type
1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00 1
2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00 1
3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00 1
4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00 1
5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00 1
6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26 1

Attribute Information:

Col number Col name Description
1 Id number: 1 to 214
2 RI refractive index
3 Na Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4 Mg Magnesium
5 Al Aluminum
6 Si Silicon
7 K Potassium
8 Ca Calcium
9 Ba Barium
10 Fe Iron
11 Type of glass (class attribute) 7 different types (see below)

Type of glass:

There are 7 types of glasses grouped in 5 groups

Code Glass description
1 building_windows_float_processed
2 building_windows_non_float_processed
3 vehicle_windows_float_processed
4 vehicle_windows_non_float_processed (none in this database)
5 containers
6 tableware
7 headlamps

As suggested by the data set description, this can be considered as a classification problem (unsupervised). The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified.

Cluster analysis will try to find pattern in the chemical composition of a sample if glass fragments.

The clusters found might be useful to draw inferences on the type of glass.

Exploratory Data Analysis and Preprocessing

There are 0 Na in the dataset

column n mean sd median trimmed mad min max range skew kurtosis se
ID 214 107.500 61.921 107.500 107.500 53.500 1.000 214.000 213.000 0.000 1.800 4.233
RI 214 1.518 0.003 1.518 1.518 0.001 1.511 1.534 0.023 1.614 7.789 0.000
Na 214 13.408 0.817 13.300 13.377 0.435 10.730 17.380 6.650 0.451 5.953 0.056
Mg 214 2.685 1.442 3.480 2.866 0.205 0.000 4.490 4.490 -1.144 2.571 0.099
Al 214 1.445 0.499 1.360 1.412 0.210 0.290 3.500 3.210 0.901 4.985 0.034
Si 214 72.651 0.775 72.790 72.707 0.385 69.810 75.410 5.600 -0.725 5.871 0.053
K 214 0.497 0.652 0.555 0.432 0.115 0.000 6.210 6.210 6.506 56.392 0.045
Ca 214 8.957 1.423 8.600 8.742 0.445 5.430 16.190 10.760 2.033 9.499 0.097
Ba 214 0.175 0.497 0.000 0.034 0.000 0.000 3.150 3.150 3.392 15.222 0.034
Fe 214 0.057 0.097 0.000 0.036 0.000 0.000 0.510 0.510 1.742 5.572 0.007
Type 214 2.780 2.104 2.000 2.483 1.000 1.000 7.000 6.000 1.107 2.699 0.144

We can remove the attribute ID since it is just for identification purpose, there is no valuable information in it.

glass<-glass %>% select(-ID)

Attribute Type needs to be transformed into a factor.

glass$Type<-factor(glass$Type)
levels(glass$Type)
## [1] "1" "2" "3" "5" "6" "7"

From the summary table and the histogram plot we can also detect possible outliers. For example, the maximum value for K Ba and Fe is way higher than the 3rd quantiles/median. The histogram below will shed some lights.

First we check K. It seems that there are few instances with a very high value of K.

We decide that these observations are outliers and discard the same from the analysis since they could create issues during clustering.

glass<-glass[glass$K<6,]

Different story for Ba, most of the examples in our data set have Ba = 0. We decide to discard such variable from the analysis since most likely it will not be very “useful” during clustering.

glass<-glass %>% select(-Ba)

Also for Fe we have many example with value 0. For the same logic used before, we decide to discard Fe from our analysis.

glass<-glass %>% select(-Fe)

Ca (Calcium) and RI (refractive index) are rather highly correlated (positively). From the scatter plots here below it seems that the relationship between is linear, RI increases as Ca increases.

This suggest that we can consider only one of these 2 variables. We decide to remove RI and keep calcium in as the data set is mainly composed by chemical element.

glass<-glass %>% select(-RI)

The variables have rather different scale, this can hijack the analysis since attributes on “higher” scaled will rule and steer the clustering. Therefore, the next step is to standardize the data.

For each numeric column, each value \(x_{ij}\) will be modified in the following way:

attribute: i , instance: j

\[ s_{ij} = \frac{x_{ij} - \mu_{i}}{\sigma_{i}} \] Only numeric variables get scaled. Attribute Type is removed since it stores the class for each row (glass fragment). Such variable will be used for external evaluation.

my_scale <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
glass_scl<- glass %>% mutate_if(is.numeric, my_scale)
glass_scl<-glass_scl %>% select(-Type)

First 6 examples of the new scaled data set that will be used for clustering

Na Mg Al Si K Ca
0.279 1.249 -0.692 -1.185 -1.120 -0.160
0.584 0.625 -0.147 0.079 0.108 -0.809
0.144 0.589 0.231 0.425 -0.155 -0.844
-0.246 0.688 -0.293 -0.080 0.371 -0.534
-0.173 0.639 -0.398 0.545 0.312 -0.640
-0.758 0.632 0.398 0.399 0.575 -0.640

Outlier detection could be more aggressive but with the information and the domain knowledge available we stop here and start our cluster analysis.

K-means Clustering

k-means in R implicitly assumes Euclidean distances. We are ok with this since we are dealing with continuous numeric variable scaled between 0 and 1.

eucl_dist<- dist(glass_scl, method = "euclidean")

The heatmap below visualizes the distance matrix. Red refers to low distance values hence high similarity while blue corresponds to high distance value and low similarity. The original data frame is already sorted according to the class type, for this reason the map shows boxy blue and red areas.

fviz_dist(eucl_dist, show_labels = F)

The k-means algorithm uses a random set of initial points to arrive at the final classification. Due to the fact that the initial centers are randomly chosen, the same command kmeans(..) may give different results every time it is run, and thus slight differences in the quality of the partitions. The nstart argument in the kmeans() function allows to run the algorithm several times with different initial centers, in order to obtain a potentially better partition:

nstart is set equal to 20. It means that the first random cluster/class assignment is performed 20 times, the function picks the best result (the result with the lowest SSE) and start clustering from there.

no need to this part manually with different seed

iter.max is the number of times the algorithm will repeat the cluster assignment and moving of centroids. We set iter.max = 10, since usually it shouldn’t take many iterations to find the optimal value according to k-means.

With the code below we run the kmeans algorithm 20 times, each time with at different value of K (from 1 to 20). For every run of k-means we store few important clustering values such as: total within sum of squares, total between sum of squares, total sum of squares and number of iteration to get to the optimal value

tot_within_ss<-rep(NA, n_cluster)
tot_btwn_ss<-rep(NA, n_cluster)
tot_ss<-rep(NA, n_cluster)
iter<-rep(NA, n_cluster)


for (i in seq(n_cluster)){
  set.seed(123)
  k<-kmeans(glass_scl, centers = i, iter.max = 10, nstart = 20, trace=FALSE)
  tot_within_ss[i]<-k$tot.withinss
  tot_btwn_ss[i]<-k$betweenss
  tot_ss[i]<-k$totss
  iter[i]<-k$iter
}

We store all these in a data frame for convenience

K tot_within_ss tot_btwn_ss tot_ss iter
1 1266.000 0.000 1266 1
2 923.704 342.296 1266 1
3 727.838 538.162 1266 3
4 573.340 692.660 1266 3
5 480.079 785.921 1266 3
6 433.135 832.865 1266 4
7 396.504 869.496 1266 3
8 351.760 914.240 1266 4
9 331.683 934.317 1266 5
10 295.656 970.344 1266 4
11 284.256 981.744 1266 6
12 268.885 997.115 1266 3
13 246.104 1019.896 1266 5
14 226.813 1039.187 1266 6
15 214.057 1051.943 1266 4
16 196.561 1069.439 1266 5
17 185.403 1080.597 1266 4
18 178.046 1087.954 1266 5
19 167.244 1098.756 1266 4
20 157.016 1108.984 1266 4

The plot here below shows how within-cluster SSE (blue line) and BSS (red line) varies depending on the number of clusters. As expected within-cluster SSE decreases as K increases while BSS increases as K increases.

Based on the plot here above we would choose a K between 5 and 10 probably since the ‘elbow’ shape is not very sharp. We pick k = 5 because the slope of the function SSE(k) tends to flatten after 5. Additionally for higher values of k we end up with cluster composed by 1 or very few value. This choice is also comparable with the number of classes in our data set (6 type of glasses, coming from 5 ‘overgroup’).

We check also the average silhouette to determine the optimal value for K. Each point of a cluster has a silhouette value calculated with the following formula:

\[ s = \frac{(b – a)}{max(a,b)} \] \(a\) = average distance of i to the points in its cluster

\(b\) = min (average distance of i to points in another cluster)

This method suggests K=3, however K=5 has a similar average Silhouette value.

set.seed(123)
k_opt<-kmeans(glass_scl, centers = k_best, iter.max = 10, nstart = 20, trace=FALSE)

The table below shows the centroids (value of each component: Na, Mg, Al, Si, K, Ca), the size and the within-cluster SSE for each cluster.

cluster Na Mg Al Si K Ca size withinss
1 0.6258018 0.4543167 -0.8527016 -1.0607431 -0.8436043 0.3495860 38 93.98991
2 0.4352799 -0.4711667 1.4115040 -0.3885802 3.6289213 -1.6811328 6 70.47981
3 -0.9081129 -1.7192798 -0.2547484 -0.0992432 -0.3968132 2.4748820 19 141.74728
4 -0.3976774 0.5255852 -0.0805090 0.1823138 0.4084275 -0.3745720 123 102.93498
5 1.5298582 -1.7852961 1.4875564 0.8500286 -1.2466806 -0.1595338 26 70.92652

Cluster 3 is the second smallest group and it has a much higher value of SSE compared to the other.

Our clusters have dimensionality 6 hence we cannot plot them as they are. We use PCA (principal component analysis) to plot our clusters. Basically, PCA can be thought as way to construct new axes as a combination of the original variables that point to the directions of maximal variance (in the original variable space).

The plot below shows the same results found earlier, cluster 3 is ‘sparser’ than the other ones. Cluster 1 and 4 are well defined and compact.

External Evaluation

The confusion matrix here below shows how the composition of each cluster in term of glass type.

Ideally we would like to have as little variety as possible in a cluster which corresponds to purity = 1 and entropy = 0

1 2 3 5 6 7
cluster1 21 5 6 0 4 2
cluster2 0 1 0 1 0 4
cluster3 0 10 0 8 1 0
cluster4 49 60 11 2 0 1
cluster5 0 0 0 0 4 22

We use entropy and purity to evaluate our cluster externally against the glass type.

\[ Entropy(i): e_i = \sum_{j=1}^L p_{ij}\cdot log_2(p_{ij}) \] \[ Purity(i) = max_j p_{ij} \] with i: cluster and j: class (glass type)

cl_n<-nrow(cm_km)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)

for (i in seq(1:cl_n)){
  p<-cm_km[i,]/sum(cm_km[i,])
  pur[i]<-max(p) %>% round(2)
  ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}

\[ Tot \_ entropy = \sum_i^K \frac{m_i}{m} \cdot entropy(i) \] \[ Tot \_ purity = \sum_i^K \frac{m_i}{m} \cdot purity(i) \]

cluster1 cluster2 cluster3 cluster4 cluster5
Entropy 1.84 1.25 1.24 1.50 0.62
Purity 0.55 0.67 0.53 0.49 0.85
weights<-apply(cm_km, FUN=sum, MARGIN = 1)/sum(cm_km)
tot_purity_km<-sum(weights*pur) %>% round(2)
tot_entropy_km<-sum(weights*ent) %>% round(2)
km_entpur<-rbind(tot_entropy_km, tot_purity_km) %>% as.data.frame() %>% rename(value=V1)
km_entpur %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
value
tot_entropy_km 1.42
tot_purity_km 0.55

The heatmap here below is produced with the observations grouped based on the results of the k-means analysis. The map shows 2-3 very similar groups, corresponding to the grey boxes (distance value < 2 approximately)

Hierarchical Clustering

To perform hierarchical cluster analysis we use the function hclust. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed according to the particular clustering method being used.

We choose to perform Hierarchical clustering with Ward distance since the results returned are the most comparable with k-mean. This method minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. In R, ward.D2 method uses squared error.

Ward method is less susceptible to noise and outliers and tends to form globular clusters.

hclust_ward<-hclust(eucl_dist, method = "ward.D2")

The algorithm uses all the examples in our data frame no need to do any subsampling to have the same size in both clustering algorithm

hclust_ward$order %>% max()
## [1] 212

The barplot here below can help us deciding the number of cluster. It plots the dendrogram height against the number of cluster.

In correspondence of number of cluster = 4 we have the threshold after which the jump in height is not significant anymore. We choose this value as number of cluster

The Dendrogram here below shows how all the observations in the data set get clustered/split.

As we move up the tree, objects that are similar to each other are combined into branches, which are themselves fused at a higher height. The height of the fusion, provided on the vertical axis, indicates the distance between two objects/clusters. The higher the height of the fusion, the less similar the objects are.

The dendrogram can also help us determining the optimal number of clusters. If we cut the dendrogram at height 20 we get 4 clusters.

This clusters seems well separated.

The table below tell us how many observation we have in each cluster.

hc_cl Freq
cluster1 36
cluster2 127
cluster3 21
cluster4 28

As done earlier we plot the cluster in 2D thanks to PCA. Again cluster 3 is the most widespread while 2 and 4 are most compact ones.

External Evaluation

For each observation we attach the result from the hierarchical clustering. By comparing this with the actual class we can calculate purity and entropy for each cluster and the total one.

1 2 3 5 6 7
cluster1 20 3 6 0 4 3
cluster2 50 60 11 1 0 5
cluster3 0 11 0 10 0 0
cluster4 0 2 0 0 5 21
cl_n<-nrow(cm_hc)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)

for (i in seq(1:cl_n)){
  p<-cm_hc[i,]/sum(cm_hc[i,])
  pur[i]<-max(p) %>% round(2)
  ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}
cluster1 cluster2 cluster3 cluster4
Entropy 1.85 1.58 1.00 1.03
Purity 0.56 0.47 0.52 0.75
value
tot_entropy_hc 1.50
tot_purity_hc 0.53

The heatmap shows 3 rather clear groups with low distance value (gray)

Evaluation, Comparison and Conclusion

From the 3 heatmaps above it is clear that there is a pattern in the data and both the algorithms spot such patterns. The main difference is about the number of clusters found.

Hierarchical clustering makes it easier to determine the number of clusters. The cut is quite clear for K = 4. K-means seems less clear in determining the optimal K, this is especially true when within-cluster SSE is used to decide K.

The entropy and purity for Hierarchical and K-means are very similar.

K-means is known to have problem with outliers. In this project we removed only the observations that were very far away from the rest of the points. Therefore the best strategy might be to put more effort in finding and excluding outliers. This should yields more crisp clusters.

More outliers ?

One additional method to detect outliers could be LOF (Local Outlier Factor Score). The values exceeding the set lof values are considered as outliers.

LOF is somehow similar to DBSCAN, it compares the local density of an point to the local densities of its neighbors. Points that have a substantially lower density than their neighbors are considered outliers.

The red line is an arbitrary cut to decide which observations can be considered as outliers

lof <- lof(glass_scl, k = 5)

ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)), aes(index, lof)) +
  geom_line() +
  geom_hline(yintercept = 2, color = "red")

This is just a suggestion for further investigation, we do not remove observation based on LOF method in this project.

References

https://www.statsandr.com/blog/outliers-detection-in-r/

https://rawgit.com/mhahsler/Introduction_to_Data_Mining_R_Examples/master/chap7.html#internal-cluster-validation

https://www.statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)