This data set contain 214 instances, each corresponding to a glass/piece of glass. Each instance is described by 11 attributes.
| ID | RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0.00 | 1 |
| 2 | 1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0.00 | 1 |
| 3 | 1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0.00 | 1 |
| 4 | 1.51766 | 13.21 | 3.69 | 1.29 | 72.61 | 0.57 | 8.22 | 0 | 0.00 | 1 |
| 5 | 1.51742 | 13.27 | 3.62 | 1.24 | 73.08 | 0.55 | 8.07 | 0 | 0.00 | 1 |
| 6 | 1.51596 | 12.79 | 3.61 | 1.62 | 72.97 | 0.64 | 8.07 | 0 | 0.26 | 1 |
Attribute Information:
| Col number | Col name | Description |
|---|---|---|
| 1 | Id | number: 1 to 214 |
| 2 | RI | refractive index |
| 3 | Na | Sodium (unit measurement: weight percent in corresponding oxide, as are attributes 4-10) |
| 4 | Mg | Magnesium |
| 5 | Al | Aluminum |
| 6 | Si | Silicon |
| 7 | K | Potassium |
| 8 | Ca | Calcium |
| 9 | Ba | Barium |
| 10 | Fe | Iron |
| 11 | Type of glass (class attribute) | 7 different types (see below) |
Type of glass:
There are 7 types of glasses grouped in 5 groups
| Code | Glass description |
|---|---|
| 1 | building_windows_float_processed |
| 2 | building_windows_non_float_processed |
| 3 | vehicle_windows_float_processed |
| 4 | vehicle_windows_non_float_processed (none in this database) |
| 5 | containers |
| 6 | tableware |
| 7 | headlamps |
As suggested by the data set description, this can be considered as a classification problem (unsupervised). The study of classification of types of glass was motivated by criminological investigation. At the scene of the crime, the glass left can be used as evidence if it is correctly identified.
Cluster analysis will try to find pattern in the chemical composition of a sample if glass fragments.
The clusters found might be useful to draw inferences on the type of glass.
There are 0 Na in the dataset
| column | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 214 | 107.500 | 61.921 | 107.500 | 107.500 | 53.500 | 1.000 | 214.000 | 213.000 | 0.000 | 1.800 | 4.233 |
| RI | 214 | 1.518 | 0.003 | 1.518 | 1.518 | 0.001 | 1.511 | 1.534 | 0.023 | 1.614 | 7.789 | 0.000 |
| Na | 214 | 13.408 | 0.817 | 13.300 | 13.377 | 0.435 | 10.730 | 17.380 | 6.650 | 0.451 | 5.953 | 0.056 |
| Mg | 214 | 2.685 | 1.442 | 3.480 | 2.866 | 0.205 | 0.000 | 4.490 | 4.490 | -1.144 | 2.571 | 0.099 |
| Al | 214 | 1.445 | 0.499 | 1.360 | 1.412 | 0.210 | 0.290 | 3.500 | 3.210 | 0.901 | 4.985 | 0.034 |
| Si | 214 | 72.651 | 0.775 | 72.790 | 72.707 | 0.385 | 69.810 | 75.410 | 5.600 | -0.725 | 5.871 | 0.053 |
| K | 214 | 0.497 | 0.652 | 0.555 | 0.432 | 0.115 | 0.000 | 6.210 | 6.210 | 6.506 | 56.392 | 0.045 |
| Ca | 214 | 8.957 | 1.423 | 8.600 | 8.742 | 0.445 | 5.430 | 16.190 | 10.760 | 2.033 | 9.499 | 0.097 |
| Ba | 214 | 0.175 | 0.497 | 0.000 | 0.034 | 0.000 | 0.000 | 3.150 | 3.150 | 3.392 | 15.222 | 0.034 |
| Fe | 214 | 0.057 | 0.097 | 0.000 | 0.036 | 0.000 | 0.000 | 0.510 | 0.510 | 1.742 | 5.572 | 0.007 |
| Type | 214 | 2.780 | 2.104 | 2.000 | 2.483 | 1.000 | 1.000 | 7.000 | 6.000 | 1.107 | 2.699 | 0.144 |
We can remove the attribute ID since it is just for identification purpose, there is no valuable information in it.
glass<-glass %>% select(-ID)
Attribute Type needs to be transformed into a factor.
glass$Type<-factor(glass$Type)
levels(glass$Type)
## [1] "1" "2" "3" "5" "6" "7"
From the summary table and the histogram plot we can also detect possible outliers. For example, the maximum value for K Ba and Fe is way higher than the 3rd quantiles/median. The histogram below will shed some lights.
First we check K. It seems that there are few instances with a very high value of K.
We decide that these observations are outliers and discard the same from the analysis since they could create issues during clustering.
glass<-glass[glass$K<6,]
Different story for Ba, most of the examples in our data set have Ba = 0. We decide to discard such variable from the analysis since most likely it will not be very “useful” during clustering.
glass<-glass %>% select(-Ba)
Also for Fe we have many example with value 0. For the same logic used before, we decide to discard Fe from our analysis.
glass<-glass %>% select(-Fe)
Ca (Calcium) and RI (refractive index) are rather highly correlated (positively). From the scatter plots here below it seems that the relationship between is linear, RI increases as Ca increases.
This suggest that we can consider only one of these 2 variables. We decide to remove RI and keep calcium in as the data set is mainly composed by chemical element.
glass<-glass %>% select(-RI)
The variables have rather different scale, this can hijack the analysis since attributes on “higher” scaled will rule and steer the clustering. Therefore, the next step is to standardize the data.
For each numeric column, each value \(x_{ij}\) will be modified in the following way:
attribute: i , instance: j
\[
s_{ij} = \frac{x_{ij} - \mu_{i}}{\sigma_{i}}
\] Only numeric variables get scaled. Attribute Type is removed since it stores the class for each row (glass fragment). Such variable will be used for external evaluation.
my_scale <- function(x) (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
glass_scl<- glass %>% mutate_if(is.numeric, my_scale)
glass_scl<-glass_scl %>% select(-Type)
First 6 examples of the new scaled data set that will be used for clustering
| Na | Mg | Al | Si | K | Ca |
|---|---|---|---|---|---|
| 0.279 | 1.249 | -0.692 | -1.185 | -1.120 | -0.160 |
| 0.584 | 0.625 | -0.147 | 0.079 | 0.108 | -0.809 |
| 0.144 | 0.589 | 0.231 | 0.425 | -0.155 | -0.844 |
| -0.246 | 0.688 | -0.293 | -0.080 | 0.371 | -0.534 |
| -0.173 | 0.639 | -0.398 | 0.545 | 0.312 | -0.640 |
| -0.758 | 0.632 | 0.398 | 0.399 | 0.575 | -0.640 |
Outlier detection could be more aggressive but with the information and the domain knowledge available we stop here and start our cluster analysis.
k-means in R implicitly assumes Euclidean distances. We are ok with this since we are dealing with continuous numeric variable scaled between 0 and 1.
eucl_dist<- dist(glass_scl, method = "euclidean")
The heatmap below visualizes the distance matrix. Red refers to low distance values hence high similarity while blue corresponds to high distance value and low similarity. The original data frame is already sorted according to the class type, for this reason the map shows boxy blue and red areas.
fviz_dist(eucl_dist, show_labels = F)
The k-means algorithm uses a random set of initial points to arrive at the final classification. Due to the fact that the initial centers are randomly chosen, the same command kmeans(..) may give different results every time it is run, and thus slight differences in the quality of the partitions. The nstart argument in the kmeans() function allows to run the algorithm several times with different initial centers, in order to obtain a potentially better partition:
nstart is set equal to 20. It means that the first random cluster/class assignment is performed 20 times, the function picks the best result (the result with the lowest SSE) and start clustering from there.
no need to this part manually with different seed
iter.max is the number of times the algorithm will repeat the cluster assignment and moving of centroids. We set iter.max = 10, since usually it shouldn’t take many iterations to find the optimal value according to k-means.
With the code below we run the kmeans algorithm 20 times, each time with at different value of K (from 1 to 20). For every run of k-means we store few important clustering values such as: total within sum of squares, total between sum of squares, total sum of squares and number of iteration to get to the optimal value
tot_within_ss<-rep(NA, n_cluster)
tot_btwn_ss<-rep(NA, n_cluster)
tot_ss<-rep(NA, n_cluster)
iter<-rep(NA, n_cluster)
for (i in seq(n_cluster)){
set.seed(123)
k<-kmeans(glass_scl, centers = i, iter.max = 10, nstart = 20, trace=FALSE)
tot_within_ss[i]<-k$tot.withinss
tot_btwn_ss[i]<-k$betweenss
tot_ss[i]<-k$totss
iter[i]<-k$iter
}
We store all these in a data frame for convenience
| K | tot_within_ss | tot_btwn_ss | tot_ss | iter |
|---|---|---|---|---|
| 1 | 1266.000 | 0.000 | 1266 | 1 |
| 2 | 923.704 | 342.296 | 1266 | 1 |
| 3 | 727.838 | 538.162 | 1266 | 3 |
| 4 | 573.340 | 692.660 | 1266 | 3 |
| 5 | 480.079 | 785.921 | 1266 | 3 |
| 6 | 433.135 | 832.865 | 1266 | 4 |
| 7 | 396.504 | 869.496 | 1266 | 3 |
| 8 | 351.760 | 914.240 | 1266 | 4 |
| 9 | 331.683 | 934.317 | 1266 | 5 |
| 10 | 295.656 | 970.344 | 1266 | 4 |
| 11 | 284.256 | 981.744 | 1266 | 6 |
| 12 | 268.885 | 997.115 | 1266 | 3 |
| 13 | 246.104 | 1019.896 | 1266 | 5 |
| 14 | 226.813 | 1039.187 | 1266 | 6 |
| 15 | 214.057 | 1051.943 | 1266 | 4 |
| 16 | 196.561 | 1069.439 | 1266 | 5 |
| 17 | 185.403 | 1080.597 | 1266 | 4 |
| 18 | 178.046 | 1087.954 | 1266 | 5 |
| 19 | 167.244 | 1098.756 | 1266 | 4 |
| 20 | 157.016 | 1108.984 | 1266 | 4 |
The plot here below shows how within-cluster SSE (blue line) and BSS (red line) varies depending on the number of clusters. As expected within-cluster SSE decreases as K increases while BSS increases as K increases.
Based on the plot here above we would choose a K between 5 and 10 probably since the ‘elbow’ shape is not very sharp. We pick k = 5 because the slope of the function SSE(k) tends to flatten after 5. Additionally for higher values of k we end up with cluster composed by 1 or very few value. This choice is also comparable with the number of classes in our data set (6 type of glasses, coming from 5 ‘overgroup’).
We check also the average silhouette to determine the optimal value for K. Each point of a cluster has a silhouette value calculated with the following formula:
\[ s = \frac{(b – a)}{max(a,b)} \] \(a\) = average distance of i to the points in its cluster
\(b\) = min (average distance of i to points in another cluster)
This method suggests K=3, however K=5 has a similar average Silhouette value.
set.seed(123)
k_opt<-kmeans(glass_scl, centers = k_best, iter.max = 10, nstart = 20, trace=FALSE)
The table below shows the centroids (value of each component: Na, Mg, Al, Si, K, Ca), the size and the within-cluster SSE for each cluster.
| cluster | Na | Mg | Al | Si | K | Ca | size | withinss |
|---|---|---|---|---|---|---|---|---|
| 1 | 0.6258018 | 0.4543167 | -0.8527016 | -1.0607431 | -0.8436043 | 0.3495860 | 38 | 93.98991 |
| 2 | 0.4352799 | -0.4711667 | 1.4115040 | -0.3885802 | 3.6289213 | -1.6811328 | 6 | 70.47981 |
| 3 | -0.9081129 | -1.7192798 | -0.2547484 | -0.0992432 | -0.3968132 | 2.4748820 | 19 | 141.74728 |
| 4 | -0.3976774 | 0.5255852 | -0.0805090 | 0.1823138 | 0.4084275 | -0.3745720 | 123 | 102.93498 |
| 5 | 1.5298582 | -1.7852961 | 1.4875564 | 0.8500286 | -1.2466806 | -0.1595338 | 26 | 70.92652 |
Cluster 3 is the second smallest group and it has a much higher value of SSE compared to the other.
Our clusters have dimensionality 6 hence we cannot plot them as they are. We use PCA (principal component analysis) to plot our clusters. Basically, PCA can be thought as way to construct new axes as a combination of the original variables that point to the directions of maximal variance (in the original variable space).
The plot below shows the same results found earlier, cluster 3 is ‘sparser’ than the other ones. Cluster 1 and 4 are well defined and compact.
The confusion matrix here below shows how the composition of each cluster in term of glass type.
Ideally we would like to have as little variety as possible in a cluster which corresponds to purity = 1 and entropy = 0
| 1 | 2 | 3 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|
| cluster1 | 21 | 5 | 6 | 0 | 4 | 2 |
| cluster2 | 0 | 1 | 0 | 1 | 0 | 4 |
| cluster3 | 0 | 10 | 0 | 8 | 1 | 0 |
| cluster4 | 49 | 60 | 11 | 2 | 0 | 1 |
| cluster5 | 0 | 0 | 0 | 0 | 4 | 22 |
We use entropy and purity to evaluate our cluster externally against the glass type.
\[ Entropy(i): e_i = \sum_{j=1}^L p_{ij}\cdot log_2(p_{ij}) \] \[ Purity(i) = max_j p_{ij} \] with i: cluster and j: class (glass type)
cl_n<-nrow(cm_km)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)
for (i in seq(1:cl_n)){
p<-cm_km[i,]/sum(cm_km[i,])
pur[i]<-max(p) %>% round(2)
ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}
\[ Tot \_ entropy = \sum_i^K \frac{m_i}{m} \cdot entropy(i) \] \[ Tot \_ purity = \sum_i^K \frac{m_i}{m} \cdot purity(i) \]
| cluster1 | cluster2 | cluster3 | cluster4 | cluster5 | |
|---|---|---|---|---|---|
| Entropy | 1.84 | 1.25 | 1.24 | 1.50 | 0.62 |
| Purity | 0.55 | 0.67 | 0.53 | 0.49 | 0.85 |
weights<-apply(cm_km, FUN=sum, MARGIN = 1)/sum(cm_km)
tot_purity_km<-sum(weights*pur) %>% round(2)
tot_entropy_km<-sum(weights*ent) %>% round(2)
km_entpur<-rbind(tot_entropy_km, tot_purity_km) %>% as.data.frame() %>% rename(value=V1)
km_entpur %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed"))
| value | |
|---|---|
| tot_entropy_km | 1.42 |
| tot_purity_km | 0.55 |
The heatmap here below is produced with the observations grouped based on the results of the k-means analysis. The map shows 2-3 very similar groups, corresponding to the grey boxes (distance value < 2 approximately)
To perform hierarchical cluster analysis we use the function hclust. Initially, each object is assigned to its own cluster and then the algorithm proceeds iteratively, at each stage joining the two most similar clusters, continuing until there is just a single cluster. At each stage distances between clusters are recomputed according to the particular clustering method being used.
We choose to perform Hierarchical clustering with Ward distance since the results returned are the most comparable with k-mean. This method minimizes the total within-cluster variance. At each step the pair of clusters with minimum between-cluster distance are merged. In R, ward.D2 method uses squared error.
Ward method is less susceptible to noise and outliers and tends to form globular clusters.
hclust_ward<-hclust(eucl_dist, method = "ward.D2")
The algorithm uses all the examples in our data frame no need to do any subsampling to have the same size in both clustering algorithm
hclust_ward$order %>% max()
## [1] 212
The barplot here below can help us deciding the number of cluster. It plots the dendrogram height against the number of cluster.
In correspondence of number of cluster = 4 we have the threshold after which the jump in height is not significant anymore. We choose this value as number of cluster
The Dendrogram here below shows how all the observations in the data set get clustered/split.
As we move up the tree, objects that are similar to each other are combined into branches, which are themselves fused at a higher height. The height of the fusion, provided on the vertical axis, indicates the distance between two objects/clusters. The higher the height of the fusion, the less similar the objects are.
The dendrogram can also help us determining the optimal number of clusters. If we cut the dendrogram at height 20 we get 4 clusters.
This clusters seems well separated.
The table below tell us how many observation we have in each cluster.
| hc_cl | Freq |
|---|---|
| cluster1 | 36 |
| cluster2 | 127 |
| cluster3 | 21 |
| cluster4 | 28 |
As done earlier we plot the cluster in 2D thanks to PCA. Again cluster 3 is the most widespread while 2 and 4 are most compact ones.
For each observation we attach the result from the hierarchical clustering. By comparing this with the actual class we can calculate purity and entropy for each cluster and the total one.
| 1 | 2 | 3 | 5 | 6 | 7 | |
|---|---|---|---|---|---|---|
| cluster1 | 20 | 3 | 6 | 0 | 4 | 3 |
| cluster2 | 50 | 60 | 11 | 1 | 0 | 5 |
| cluster3 | 0 | 11 | 0 | 10 | 0 | 0 |
| cluster4 | 0 | 2 | 0 | 0 | 5 | 21 |
cl_n<-nrow(cm_hc)
ent<-rep(NA, cl_n)
pur<-rep(NA, cl_n)
for (i in seq(1:cl_n)){
p<-cm_hc[i,]/sum(cm_hc[i,])
pur[i]<-max(p) %>% round(2)
ent[i]<- -sum(p*log2(p + exp(-100))) %>% round(2)
}
| cluster1 | cluster2 | cluster3 | cluster4 | |
|---|---|---|---|---|
| Entropy | 1.85 | 1.58 | 1.00 | 1.03 |
| Purity | 0.56 | 0.47 | 0.52 | 0.75 |
| value | |
|---|---|
| tot_entropy_hc | 1.50 |
| tot_purity_hc | 0.53 |
The heatmap shows 3 rather clear groups with low distance value (gray)
From the 3 heatmaps above it is clear that there is a pattern in the data and both the algorithms spot such patterns. The main difference is about the number of clusters found.
Hierarchical clustering makes it easier to determine the number of clusters. The cut is quite clear for K = 4. K-means seems less clear in determining the optimal K, this is especially true when within-cluster SSE is used to decide K.
The entropy and purity for Hierarchical and K-means are very similar.
K-means is known to have problem with outliers. In this project we removed only the observations that were very far away from the rest of the points. Therefore the best strategy might be to put more effort in finding and excluding outliers. This should yields more crisp clusters.
One additional method to detect outliers could be LOF (Local Outlier Factor Score). The values exceeding the set lof values are considered as outliers.
LOF is somehow similar to DBSCAN, it compares the local density of an point to the local densities of its neighbors. Points that have a substantially lower density than their neighbors are considered outliers.
The red line is an arbitrary cut to decide which observations can be considered as outliers
lof <- lof(glass_scl, k = 5)
ggplot(tibble(index = seq_len(length(lof)), lof = sort(lof)), aes(index, lof)) +
geom_line() +
geom_hline(yintercept = 2, color = "red")
This is just a suggestion for further investigation, we do not remove observation based on LOF method in this project.
References
https://www.statsandr.com/blog/outliers-detection-in-r/
Practical Guide to Cluster Analysis in R by A. Kassambara (Datanovia)