1 Introduction

The diamonds dataset from R contains various measurements of geometrical and physical characteristics and the price of individual diamonds. But we do not know what the name and type of diamonds in each row. This dataset is suitable for unsupervised learning because we can attempt to cluster the diamonds together based on some unifying characteristics.

2 Data Description

The description of each column in the dataset is as follows

A data frame with 53940 rows and 10 variables:

  • price : price in US dollars ($326 - $18,823)
  • carat : weight of the diamond (0.2–5.01)
  • cut : quality of the cut (Fair, Good, Very Good, Premium, Ideal)
  • color : diamond colour, from D (best) to J (worst)
  • clarity : a measurement of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))
  • x : length in mm (0–10.74)
  • y : width in mm (0–58.9)
  • z : depth in mm (0–31.8)
  • depth : total depth percentage = z / mean(x, y) = 2 * z / (x + y) (43–79)
  • table : width of top of diamond relative to widest point (43–95)

3 EDA

Since the data is already available with R we can start immediately with EDA.

glimpse(diamonds)
#> Rows: 53,940
#> Columns: 10
#> $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
#> $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
#> $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
#> $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
#> $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
#> $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
#> $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
#> $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
#> $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
#> $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

The factor type columns (cut, color, and clarity) are all ordered so we can convert them into numbers that represents their level, the higher the number, the higher the level.

diamonds_cust <- diamonds %>% mutate(color = as.integer(color), cut = as.integer(cut), clarity = as.integer(clarity))
glimpse(diamonds_cust)
#> Rows: 53,940
#> Columns: 10
#> $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
#> $ cut     <int> 5, 4, 2, 4, 2, 3, 3, 3, 1, 3, 2, 5, 4, 5, 4, 4, 5, 2, 2, 3, 2,…
#> $ color   <int> 2, 2, 2, 6, 7, 7, 6, 5, 2, 5, 7, 7, 3, 7, 2, 2, 6, 7, 7, 7, 6,…
#> $ clarity <int> 2, 3, 5, 4, 2, 6, 7, 3, 4, 5, 3, 5, 3, 2, 2, 1, 2, 3, 3, 3, 2,…
#> $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
#> $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
#> $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
#> $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
#> $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
#> $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
summary(diamonds_cust)
#>      carat             cut            color          clarity     
#>  Min.   :0.2000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
#>  1st Qu.:0.4000   1st Qu.:3.000   1st Qu.:2.000   1st Qu.:3.000  
#>  Median :0.7000   Median :4.000   Median :4.000   Median :4.000  
#>  Mean   :0.7979   Mean   :3.904   Mean   :3.594   Mean   :4.051  
#>  3rd Qu.:1.0400   3rd Qu.:5.000   3rd Qu.:5.000   3rd Qu.:5.000  
#>  Max.   :5.0100   Max.   :5.000   Max.   :7.000   Max.   :8.000  
#>      depth           table           price             x         
#>  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
#>  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
#>  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
#>  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
#>  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
#>  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
#>        y                z         
#>  Min.   : 0.000   Min.   : 0.000  
#>  1st Qu.: 4.720   1st Qu.: 2.910  
#>  Median : 5.710   Median : 3.530  
#>  Mean   : 5.735   Mean   : 3.539  
#>  3rd Qu.: 6.540   3rd Qu.: 4.040  
#>  Max.   :58.900   Max.   :31.800

Let us now check the correlation between all these numerical columns.

library(GGally)
ggcorr(diamonds_cust, label = T)

There are some strongly correlated columns; we may be able to use pca to take out some of the columns later.

diamonds_cust %>% summarise_all(n_distinct)

The above table shows that most of the columns have unique values in the hundreds. This can cause a crash so we will subset the data first to prevent it and also to make the analysis faster.

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(diamonds_cust), nrow(diamonds_cust)*0.002)
diamonds_subset <- diamonds_cust[intrain,]
dim(diamonds_subset)
#> [1] 107  10

We have sampled randomly ~ 100 observations from the diamonds_cust dataset

# before clustering we need to scale the data 
diamonds_subset_scaled <- scale(diamonds_subset)
diamonds_subset %>% head(100)%>% summarise_all(n_distinct)

Now the number of unique values has decreased to below 100.

4 Principal Component Analysis (PCA)

Before clustering we will do PCA. PCA is used to reduce the number of dimensions (the number of predictor columns) but it can be used to identify outliers and also how many principal axes need to be kept to keep the desired proportion of information in the data. We will first use PCA to identify outliers that may need to be removed.

library(FactoMineR)
pca_diamonds <- PCA(diamonds_subset_scaled, scale.unit=F)

The PCA plot of individual components show that there are 2 outliers, data 31 and 46. Their distance is not too far from the other data though so we will leave them.

diamonds_subset_scaled[c(81,31,46,61),]
#>         carat        cut      color     clarity      depth      table    price
#> [1,] 1.759908 -0.9475777  0.7841719 -0.06749522 -2.0932551  1.8104633 2.175733
#> [2,] 2.480731 -2.7421053 -0.3811947 -1.27115997  2.1930379  1.8104633 2.838569
#> [3,] 3.006737  0.8469499  1.9495386 -0.06749522  0.2447229 -0.1359780 2.895002
#> [4,] 2.305395  0.8469499  0.2014886 -1.27115997  0.8681837 -0.6225884 2.272057
#>             x        y        z
#> [1,] 1.700956 1.729817 1.406407
#> [2,] 1.861021 1.814214 2.102851
#> [3,] 2.324366 2.371232 2.343928
#> [4,] 1.869445 1.797334 1.915347

5 KMeans clustering

library(factoextra)
fviz_nbclust(diamonds_subset_scaled, kmeans, method = "wss")

From the graph there are 2 elbow points, one at 2 and the other at 7, we’ll be using 2 as the number of clusters.

set.seed(100)
diamonds_cluster <- kmeans(diamonds_subset_scaled, centers = 2)
diamonds_subset$cluster <- diamonds_cluster$cluster
fviz_cluster(diamonds_cluster, data = diamonds_subset_scaled)

As can be seen, the data has been grouped into two separate clusters. According to the Kmeans clustering result, there are 2 groups or types of diamonds in the diamonds_subset dataset. What are the characteristics of these clusters ?

diamonds_subset %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

Cluster 1 has a higher number of all columns except cut and clarity. The reverse happens for cluster 2. It seems we can label cluster 1 diamonds as big and expensive while cluster 2 diamonds can be labeled as smaller and less expensive.

6 Conclusion

We have used PCA to identify outliers in the subset of diamonds dataset and from the visualization as well as the values, decided to keep the outliers. We then used KMeans clustering to neatly group the diamonds into 2 clusters, where cluster 1 is better in all columns except cut and clarity compared to cluster 2. If we were to assign a label to cluster 1 and cluster 2 we can use the label more expensive or bigger for cluster 1 and less expensive or smaller for cluster 2.