Data Set Introduction

This data set contain 214 instances, each corresponding to a glass/piece of glass. Each instance is described by 11 attributes.

ID	RI	Na	Mg	Al	Si	K	Ca	Fe	Type
1	1.52101	13.64	4.49	1.10	71.78	0.06	8.75	0.00	1
2	1.51761	13.89	3.60	1.36	72.73	0.48	7.83	0.00	1
3	1.51618	13.53	3.55	1.54	72.99	0.39	7.78	0.00	1
4	1.51766	13.21	3.69	1.29	72.61	0.57	8.22	0.00	1
5	1.51742	13.27	3.62	1.24	73.08	0.55	8.07	0.00	1
6	1.51596	12.79	3.61	1.62	72.97	0.64	8.07	0.26	1

Attribute Information:

Col number	Col name	Description
1	Id	number: 1 to 214
2	RI	refractive index
3	Na	Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10)
4	Mg	Magnesium
5	Al	Aluminum
6	Si	Silicon
7	K	Potassium
8	Ca	Calcium
9	Ba	Barium
10	Fe	Iron
11	Type of glass (class attribute)	7 different types (see below)

Type of glass:

There are 7 types of glasses grouped in 5 groups

Code	Glass description
1	building_windows_float_processed
2	building_windows_non_float_processed
3	vehicle_windows_float_processed
4	vehicle_windows_non_float_processed (none in this database)
5	containers
6	tableware
7	headlamps

As suggested by the data set description, this can be considered as a classification problem (unsupervised). The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified.

Cluster analysis will try to find pattern in the chemical composition of a sample if glass fragments.

The clusters found might be useful to draw inferences on the type of glass.

Exploratory Data Analysis and Preprocessing

There are 0 Na in the dataset

column	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
ID	214	107.500	61.921	107.500	107.500	53.500	1.000	214.000	213.000	0.000	1.800	4.233
RI	214	1.518	0.003	1.518	1.518	0.001	1.511	1.534	0.023	1.614	7.789	0.000
Na	214	13.408	0.817	13.300	13.377	0.435	10.730	17.380	6.650	0.451	5.953	0.056
Mg	214	2.685	1.442	3.480	2.866	0.205	0.000	4.490	4.490	-1.144	2.571	0.099
Al	214	1.445	0.499	1.360	1.412	0.210	0.290	3.500	3.210	0.901	4.985	0.034
Si	214	72.651	0.775	72.790	72.707	0.385	69.810	75.410	5.600	-0.725	5.871	0.053
K	214	0.497	0.652	0.555	0.432	0.115	0.000	6.210	6.210	6.506	56.392	0.045
Ca	214	8.957	1.423	8.600	8.742	0.445	5.430	16.190	10.760	2.033	9.499	0.097
Ba	214	0.175	0.497	0.000	0.034	0.000	0.000	3.150	3.150	3.392	15.222	0.034
Fe	214	0.057	0.097	0.000	0.036	0.000	0.000	0.510	0.510	1.742	5.572	0.007
Type	214	2.780	2.104	2.000	2.483	1.000	1.000	7.000	6.000	1.107	2.699	0.144

We can remove the attribute ID since it is just for identification purpose, there is no valuable information in it.

glass<-glass %>% select(-ID)

Attribute Type needs to be transformed into a factor.

glass$Type<-factor(glass$Type)
levels(glass$Type)

## [1] "1" "2" "3" "5" "6" "7"

From the summary table and the histogram plot we can also detect possible outliers. For example, the maximum value for K Ba and Fe is way higher than the 3rd quantiles/median. The histogram below will shed some lights.

First we check K. It seems that there are few instances with a very high value of K.

We decide that these observations are outliers and discard the same from the analysis since they could create issues during clustering.

glass<-glass[glass$K<6,]

Different story for Ba, most of the examples in our data set have Ba = 0. We decide to discard such variable from the analysis since most likely it will not be very “useful” during clustering.

glass<-glass %>% select(-Ba)

Also for Fe we have many example with value 0. For the same logic used before, we decide to discard Fe from our analysis.

glass<-glass %>% select(-Fe)

Ca (Calcium) and RI (refractive index) are rather highly correlated (positively). From the scatter plots here below it seems that the relationship between is linear, RI increases as Ca increases.

This suggest that we can consider only one of these 2 variables. We decide to remove RI and keep calcium in as the data set is mainly composed by chemical element.

glass<-glass %>% select(-RI)

The variables have rather different scale, this can hijack the analysis since attributes on “higher” scaled will rule and steer the clustering. Therefore, the next step is to standardize the data.

For each numeric column, each value \(x_{ij}\) will be modified in the following way:

attribute: i , instance: j

\[ s_{ij} = \frac{x_{ij} - \mu_{i}}{\sigma_{i}} \] Only numeric variables get scaled. Attribute Type is removed since it stores the class for each row (glass fragment). Such variable will be used for external evaluation.

my_scale <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
glass_scl<- glass %>% mutate_if(is.numeric, my_scale)
glass_scl<-glass_scl %>% select(-Type)

First 6 examples of the new scaled data set that will be used for clustering

Na	Mg	Al	Si	K	Ca
0.279	1.249	-0.692	-1.185	-1.120	-0.160
0.584	0.625	-0.147	0.079	0.108	-0.809
0.144	0.589	0.231	0.425	-0.155	-0.844
-0.246	0.688	-0.293	-0.080	0.371	-0.534
-0.173	0.639	-0.398	0.545	0.312	-0.640
-0.758	0.632	0.398	0.399	0.575	-0.640

Outlier detection could be more aggressive but with the information and the domain knowledge available we stop here and start our cluster analysis.

K-means Clustering

k-means in R implicitly assumes Euclidean distances. We are ok with this since we are dealing with continuous numeric variable scaled between 0 and 1.

eucl_dist<- dist(glass_scl, method = "euclidean")

The heatmap below visualizes the distance matrix. Red refers to low distance values hence high similarity while blue corresponds to high distance value and low similarity. The original data frame is already sorted according to the class type, for this reason the map shows boxy blue and red areas.

fviz_dist(eucl_dist, show_labels = F)

The k-means algorithm uses a random set of initial points to arrive at the final classification. Due to the fact that the initial centers are randomly chosen, the same command kmeans(..) may give different results every time it is run, and thus slight differences in the quality of the partitions. The nstart argument in the kmeans() function allows to run the algorithm several times with different initial centers, in order to obtain a potentially better partition:

nstart is set equal to 20. It means that the first random cluster/class assignment is performed 20 times, the function picks the best result (the result with the lowest SSE) and start clustering from there.

no need to this part manually with different seed

iter.max is the number of times the algorithm will repeat the cluster assignment and moving of centroids. We set iter.max = 10, since usually it shouldn’t take many iterations to find the optimal value according to k-means.

With the code below we run the kmeans algorithm 20 times, each time with at different value of K (from 1 to 20). For every run of k-means we store few important clustering values such as: total within sum of squares, total between sum of squares, total sum of squares and number of iteration to get to the optimal value

tot_within_ss<-rep(NA, n_cluster)
tot_btwn_ss<-rep(NA, n_cluster)
tot_ss<-rep(NA, n_cluster)
iter<-rep(NA, n_cluster)


for (i in seq(n_cluster)){
  set.seed(123)
  k<-kmeans(glass_scl, centers = i, iter.max = 10, nstart = 20, trace=FALSE)
  tot_within_ss[i]<-k$tot.withinss
  tot_btwn_ss[i]<-k$betweenss
  tot_ss[i]<-k$totss
  iter[i]<-k$iter
}

We store all these in a data frame for convenience

K	tot_within_ss	tot_btwn_ss	tot_ss	iter
1	1266.000	0.000	1266	1
2	923.704	342.296	1266	1
3	727.838	538.162	1266	3
4	573.340	692.660	1266	3
5	480.079	785.921	1266	3
6	433.135	832.865	1266	4
7	396.504	869.496	1266	3
8	351.760	914.240	1266	4
9	331.683	934.317	1266	5
10	295.656	970.344	1266	4
11	284.256	981.744	1266	6
12	268.885	997.115	1266	3
13	246.104	1019.896	1266	5
14	226.813	1039.187	1266	6
15	214.057	1051.943	1266	4
16	196.561	1069.439	1266	5
17	185.403	1080.597	1266	4
18	178.046	1087.954	1266	5
19	167.244	1098.756	1266	4
20	157.016	1108.984	1266	4

The plot here below shows how within-cluster SSE (blue line) and BSS (red line) varies depending on the number of clusters. As expected within-cluster SSE decreases as K increases while BSS increases as K increases.

Based on the plot here above we would choose a K between 5 and 10 probably since the ‘elbow’ shape is not very sharp. We pick k = 5 because the slope of the function SSE(k) tends to flatten after 5. Additionally for higher values of k we end up with cluster composed by 1 or very few value. This choice is also comparable with the number of classes in our data set (6 type of glasses, coming from 5 ‘overgroup’).

We check also the average silhouette to determine the optimal value for K. Each point of a cluster has a silhouette value calculated with the following formula:

\[ s = \frac{(b – a)}{max(a,b)} \] \(a\) = average distance of i to the points in its cluster

\(b\) = min (average distance of i to points in another cluster)

This method suggests K=3, however K=5 has a similar average Silhouette value.

set.seed(123)
k_opt<-kmeans(glass_scl, centers = k_best, iter.max = 10, nstart = 20, trace=FALSE)

The table below shows the centroids (value of each component: Na, Mg, Al, Si, K, Ca), the size and the within-cluster SSE for each cluster.

cluster	Na	Mg	Al	Si	K	Ca	size	withinss
1	0.6258018	0.4543167	-0.8527016	-1.0607431	-0.8436043	0.3495860	38	93.98991
2	0.4352799	-0.4711667	1.4115040	-0.3885802	3.6289213	-1.6811328	6	70.47981
3	-0.9081129	-1.7192798	-0.2547484	-0.0992432	-0.3968132	2.4748820	19	141.74728
4	-0.3976774	0.5255852	-0.0805090	0.1823138	0.4084275	-0.3745720	123	102.93498
5	1.5298582	-1.7852961	1.4875564	0.8500286	-1.2466806	-0.1595338	26	70.92652

Cluster 3 is the second smallest group and it has a much higher value of SSE compared to the other.

Our clusters have dimensionality 6 hence we cannot plot them as they are. We use PCA (principal component analysis) to plot our clusters. Basically, PCA can be thought as way to construct new axes as a combination of the original variables that point to the directions of maximal variance (in the original variable space).

The plot below shows the same results found earlier, cluster 3 is ‘sparser’ than the other ones. Cluster 1 and 4 are well defined and compact.

External Evaluation

The confusion matrix here below shows how the composition of each cluster in term of glass type.

Ideally we would like to have as little variety as possible in a cluster which corresponds to purity = 1 and entropy = 0

	1	2	3	5	6	7
cluster1	21	5	6	0	4	2
cluster2	0	1	0	1	0	4
cluster3	0	10	0	8	1	0
cluster4	49	60	11	2	0	1
cluster5	0	0	0	0	4	22

We use entropy and purity to evaluate our cluster externally against the glass type.

\[ Entropy(i): e_i = \sum_{j=1}^L p_{ij}\cdot log_2(p_{ij}) \] \[ Purity(i) = max_j p_{ij} \] with i: cluster and j: class (glass type)

cl_n<-nrow(cm_km)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)

for (i in seq(1:cl_n)){
  p<-cm_km[i,]/sum(cm_km[i,])
  pur[i]<-max(p) %>% round(2)
  ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}

\[ Tot \_ entropy = \sum_i^K \frac{m_i}{m} \cdot entropy(i) \] \[ Tot \_ purity = \sum_i^K \frac{m_i}{m} \cdot purity(i) \]

	cluster1	cluster2	cluster3	cluster4	cluster5
Entropy	1.84	1.25	1.24	1.50	0.62
Purity	0.55	0.67	0.53	0.49	0.85

weights<-apply(cm_km, FUN=sum, MARGIN = 1)/sum(cm_km)
tot_purity_km<-sum(weights*pur) %>% round(2)
tot_entropy_km<-sum(weights*ent) %>% round(2)
km_entpur<-rbind(tot_entropy_km, tot_purity_km) %>% as.data.frame() %>% rename(value=V1)
km_entpur %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))

	value
tot_entropy_km	1.42
tot_purity_km	0.55

The heatmap here below is produced with the observations grouped based on the results of the k-means analysis. The map shows 2-3 very similar groups, corresponding to the grey boxes (distance value < 2 approximately)

Hierarchical Clustering

To perform hierarchical cluster analysis we use the function hclust. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed according to the particular clustering method being used.

We choose to perform Hierarchical clustering with Ward distance since the results returned are the most comparable with k-mean. This method minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. In R, ward.D2 method uses squared error.

Ward method is less susceptible to noise and outliers and tends to form globular clusters.

hclust_ward<-hclust(eucl_dist, method = "ward.D2")

The algorithm uses all the examples in our data frame no need to do any subsampling to have the same size in both clustering algorithm

hclust_ward$order %>% max()

## [1] 212

The barplot here below can help us deciding the number of cluster. It plots the dendrogram height against the number of cluster.

In correspondence of number of cluster = 4 we have the threshold after which the jump in height is not significant anymore. We choose this value as number of cluster

The Dendrogram here below shows how all the observations in the data set get clustered/split.

As we move up the tree, objects that are similar to each other are combined into branches, which are themselves fused at a higher height. The height of the fusion, provided on the vertical axis, indicates the distance between two objects/clusters. The higher the height of the fusion, the less similar the objects are.

The dendrogram can also help us determining the optimal number of clusters. If we cut the dendrogram at height 20 we get 4 clusters.

This clusters seems well separated.

The table below tell us how many observation we have in each cluster.

hc_cl	Freq
cluster1	36
cluster2	127
cluster3	21
cluster4	28

As done earlier we plot the cluster in 2D thanks to PCA. Again cluster 3 is the most widespread while 2 and 4 are most compact ones.

External Evaluation

For each observation we attach the result from the hierarchical clustering. By comparing this with the actual class we can calculate purity and entropy for each cluster and the total one.

	1	2	3	5	6	7
cluster1	20	3	6	0	4	3
cluster2	50	60	11	1	0	5
cluster3	0	11	0	10	0	0
cluster4	0	2	0	0	5	21

cl_n<-nrow(cm_hc)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)

for (i in seq(1:cl_n)){
  p<-cm_hc[i,]/sum(cm_hc[i,])
  pur[i]<-max(p) %>% round(2)
  ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}

	cluster1	cluster2	cluster3	cluster4
Entropy	1.85	1.58	1.00	1.03
Purity	0.56	0.47	0.52	0.75

	value
tot_entropy_hc	1.50
tot_purity_hc	0.53

The heatmap shows 3 rather clear groups with low distance value (gray)

Evaluation, Comparison and Conclusion

From the 3 heatmaps above it is clear that there is a pattern in the data and both the algorithms spot such patterns. The main difference is about the number of clusters found.

Hierarchical clustering makes it easier to determine the number of clusters. The cut is quite clear for K = 4. K-means seems less clear in determining the optimal K, this is especially true when within-cluster SSE is used to decide K.

The entropy and purity for Hierarchical and K-means are very similar.

K-means is known to have problem with outliers. In this project we removed only the observations that were very far away from the rest of the points. Therefore the best strategy might be to put more effort in finding and excluding outliers. This should yields more crisp clusters.

More outliers ?

One additional method to detect outliers could be LOF (Local Outlier Factor Score). The values exceeding the set lof values are considered as outliers.

LOF is somehow similar to DBSCAN, it compares the local density of an point to the local densities of its neighbors. Points that have a substantially lower density than their neighbors are considered outliers.

The red line is an arbitrary cut to decide which observations can be considered as outliers

lof <- lof(glass_scl, k = 5)

ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)), aes(index, lof)) +
  geom_line() +
  geom_hline(yintercept = 2, color = "red")

This is just a suggestion for further investigation, we do not remove observation based on LOF method in this project.

References

https://www.statsandr.com/blog/outliers-detection-in-r/

https://rawgit.com/mhahsler/Introduction_to_Data_Mining_R_Examples/master/chap7.html#internal-cluster-validation

https://www.statsandr.com/blog/clustering-analysis-k-means-and-hierarchical-clustering-by-hand-and-in-r/

Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)