DATA 622-Final Project: White Diamonds Inventory and Market Demand Analysis

Introduction

The diamond industry is one of the most dynamic and competitive lucrative sectors in the global luxury market with inventory management playing an essential role in meeting customer need, want and demand by maintaining operational efficiency. The final project “White Diamonds Inventory and Market Demand Analysis” focuses on analyzing key inventory patterns and market dynamics of certified natural white diamonds.

As I am an inventory assistant and sales representative at a GIA-certified natural white diamonds wholesale company in New York. I am actively involved in managing inventory and facilitating sales to retailers across the United States. Our company manufactures diamonds in Hong Kong and India, specializing in both loose diamonds and finished diamond jewelry such as bracelets, rings, necklaces, pendants, and earrings.

My daily jobs are: (1). Updating and managing newly manufactured loose diamonds on the Diamond Track Online System, which is connected to major diamond e-commerce platforms like RapNet, Blue Nile, and R2Net. These platforms serve as intermediaries where vendors and consignees can browse available diamonds and place orders. (2). Preparing customer memoranda and invoices, communicating directly with clients to confirm orders, and coordinating the shipment of merchandise to ensure timely and accurate delivery. (3). Facilitating smooth interactions with retail store representatives, addressing their needs, and ensuring high-quality service to maintain long-term business relationships.

The purpose of this project is to analyze our company’s updated inventory summary to classify patterns of inventory and market demand for improving sales performance using descriptive statistics, exploratory data analysis and clustering algorithms to find out key features of diamonds’ 4C (cut, color, clarity and carat) that mainly influence inventory turnover and sales.

Loading Required Libraries

library(dplyr)     # For data manipulation

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(tidyr)     # For data cleaning and tidying
library(data.table)  # For fast data manipulation

## 
## Attaching package: 'data.table'

## The following objects are masked from 'package:dplyr':
## 
##     between, first, last

library(ggplot2)   # Equivalent to matplotlib and seaborn
library(cowplot)   # For combining plots
library(gridExtra) # For arranging multiple ggplots

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(caret)      # For train/test splits, cross-validation, and grid search

## Loading required package: lattice

library(modelr)    # Simplified model training utilities
library(MASS)         # For additional regression methods

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

library(class)        # For k-NN classification
library(mice)         # For advanced multiple imputation

## 
## Attaching package: 'mice'

## The following object is masked from 'package:stats':
## 
##     filter

## The following objects are masked from 'package:base':
## 
##     cbind, rbind

library(tidyr)        # For simple data cleaning with `replace_na()`
library(recipes)      # Flexible data preprocessing pipelines

## 
## Attaching package: 'recipes'

## The following object is masked from 'package:stats':
## 
##     step

library(scales)       # For scaling data
library(ModelMetrics)      # Additional model evaluation metrics

## 
## Attaching package: 'ModelMetrics'

## The following objects are masked from 'package:modelr':
## 
##     mae, mse, rmse

## The following objects are masked from 'package:caret':
## 
##     confusionMatrix, precision, recall, sensitivity, specificity

## The following object is masked from 'package:base':
## 
##     kappa

library(readr)

## 
## Attaching package: 'readr'

## The following object is masked from 'package:scales':
## 
##     col_factor

library(RColorBrewer)
library(jpeg)
library(FactoMineR)
# Absolute path to the image
img <- readJPEG("/Users/lwinnandarshwe/622FinalProject/Round.jpg")
# Display the image
grid::grid.raster(img)
# Add a caption 
grid::grid.text("BR 3.22CT F VS2 GIA #5506448239", 
                x = 0.5, y = 0.95,  
                gp = grid::gpar(fontsize = 12, col = "black"))

Here is the link of this diamond’s GIA Lab report: https://dtol-cert-images.s3.amazonaws.com/GIA_pdf/5506448239.pdf

Importing White Diamonds Dataset

The WhiteDiamondSummary_2024_12_11.csv file is an updated inventory list of natural white diamonds that our company is working on. The dataset has 1,957 rows and 21 columns that include both numerical and categorical features related to the characteristics, pricing, and physical dimensions of the diamonds.

WhiteDiamonds <- read_csv("~/622FInalProject/WhiteDiamondSummary_2024_12_11.csv")

## Rows: 1957 Columns: 21
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (8): Shape, Color, Clarity, Lab, Pol, Sym, Fluor, FluorescenceIntensity
## dbl (13): Stock#, Weight, PricePC, Total, List, ListPrice, CertNum, Depth%, ...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# Add a new column "Size" by calculating the product of "Len", "Width", and "Depth"
WhiteDiamonds$Size <- WhiteDiamonds$Len * WhiteDiamonds$Width * WhiteDiamonds$Depth

glimpse(WhiteDiamonds)

## Rows: 1,957
## Columns: 22
## $ `Stock#`              <dbl> 46763, 48874, 48427, 51335, 28090, 48428, 46762,…
## $ Shape                 <chr> "AC", "AC", "AC", "AC", "AC", "AC", "AC", "AC", …
## $ Weight                <dbl> 0.51, 0.60, 0.70, 0.71, 0.90, 0.90, 0.92, 1.00, …
## $ Color                 <chr> "F", "K", "G", "F", "F", "J", "D", "F", "G", "G"…
## $ Clarity               <chr> "VVS1", "VS1", "VS1", "IF", "VS2", "SI1", "VVS2"…
## $ PricePC               <dbl> 1430.00, 840.00, 1856.00, 2091.00, 2050.00, 1500…
## $ Total                 <dbl> 729.30, 504.00, 1299.20, 1484.61, 1845.00, 1350.…
## $ List                  <dbl> -45.00, -40.00, -42.00, -49.00, -50.00, -40.00, …
## $ ListPrice             <dbl> 2600, 1400, 3200, 4100, 4100, 2500, 5800, 6100, …
## $ Lab                   <chr> "GIA", "GIA", "GIA", "GIA", "GIA", "GIA", "GIA",…
## $ CertNum               <dbl> 5222192008, 5222560068, 6432525015, 6401442738, …
## $ `Depth%`              <dbl> 67.8, 67.2, 69.4, 68.0, 69.1, 69.8, 69.5, 69.6, …
## $ `Table%`              <dbl> 65, 61, 61, 64, 62, 69, 66, 67, 70, 71, 67, 62, …
## $ Len                   <dbl> 4.31, 4.80, 4.85, 5.00, 5.25, 5.27, 5.32, 5.28, …
## $ Width                 <dbl> 4.29, 4.71, 4.74, 4.88, 5.16, 5.13, 5.08, 5.25, …
## $ Depth                 <dbl> 2.91, 3.16, 3.29, 3.32, 3.57, 3.58, 3.53, 3.65, …
## $ Ratio                 <dbl> 1.00, 1.02, 1.02, 1.02, 1.02, 1.03, 1.05, 1.01, …
## $ Pol                   <chr> "VG", "VG", "EX", "EX", "EX", "EX", "EX", "VG", …
## $ Sym                   <chr> "VG", "VG", "VG", "VG", "EX", "VG", "VG", "EX", …
## $ Fluor                 <chr> "M", "N", "N", "N", "N", "N", "M", "N", "F", "N"…
## $ FluorescenceIntensity <chr> "M", "N", "N", "N", "N", "N", "M", "N", "F", "N"…
## $ Size                  <dbl> 53.80561, 71.44128, 75.63381, 81.00800, 96.71130…

# Calculate the frequency of each unique value in the 'Shape' column
table(WhiteDiamonds$Shape)

## 
##  AC  BR  CU  EC  HS  MQ  OV  PR  PS  RA 
##  23 637 249 215  42  72 322 127  90 180

# Sort the frequency counts in descending order
sort(table(WhiteDiamonds$Shape), decreasing = TRUE)

## 
##  BR  OV  CU  EC  RA  PR  PS  MQ  HS  AC 
## 637 322 249 215 180 127  90  72  42  23

Data Description

There are 13 numerical variables or quantitative details such as Weight (Carat), Price per Carat, Total Price, Ratio, Dimensions (Length, Width, Depth), and derived values like Size. There are 8 categorical variables or qualitative attributes such as Shape, Color, Clarity, Polish (Pol), Symmetry (Sym), Lab, Fluorescence (Fluor), and Fluorescence Intensity.

The diamonds Shape are categorized into 10 different shapes, with Round Brilliant (BR) dominating the inventory with 637 entities, followed by Oval (OV) with 322 entities, Cushion (CU) with 249 entities, and other fancy shapes such as Emerald Cut (EC) with 215, Radiant (RA) with 180, Princess (PR) with 127 and Marquise (MQ) with 72. Shapes like Heart Shape (HS) and Asscher Cut (AC) are less frequent, showing niche demand.

Diamonds vary in colors such as D, F, K and clarity such as IF, VVS1, VS2. These attributes significantly influence the price and desirability of diamonds.

Diamonds weight Ranges from 0.51 carats to larger stones exceeding 5 carats. Most diamonds fall in the range of 0.50–1.50 carats, indicating mid-sized stones’ highest market demand.

Some premium quality diamonds priced significantly higher. The average price per carat provides insights into cost trends across size and 4Cs.

Diamonds’ size is calculated from dimensions (Length × Width × Depth) that provided the overall physical dimensions of the diamonds, it correlates positively with weight but may also highlight outliers.

Data Preparation

Data preparation is essential to ensure the dataset is clean, consistent, and suitable for analysis, particularly for machine learning and statistical modeling.

#Remove Rows with Missing Values
WhiteDiamonds <- na.omit(WhiteDiamonds)
#Check missing values
sum(is.na(WhiteDiamonds))

## [1] 0

# Encode the 'cut' column
WhiteDiamonds$Shape <- as.numeric(factor(WhiteDiamonds$Shape))
# Encode the 'color' column
WhiteDiamonds$Color <- as.numeric(factor(WhiteDiamonds$Color))
# Encode the 'clarity' column
WhiteDiamonds$Clarity <- as.numeric(factor(WhiteDiamonds$Clarity))
# View the first few rows of the dataset
head(WhiteDiamonds)

## # A tibble: 6 × 22
##   `Stock#` Shape Weight Color Clarity PricePC Total  List ListPrice Lab  
##      <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl> <dbl> <dbl>     <dbl> <chr>
## 1    46763     1   0.51     3       9    1430  729.   -45      2600 GIA  
## 2    48874     1   0.6      8       7     840  504    -40      1400 GIA  
## 3    48427     1   0.7      4       7    1856 1299.   -42      3200 GIA  
## 4    51335     1   0.71     3       4    2091 1485.   -49      4100 GIA  
## 5    28090     1   0.9      3       8    2050 1845    -50      4100 GIA  
## 6    48428     1   0.9      7       5    1500 1350    -40      2500 GIA  
## # ℹ 12 more variables: CertNum <dbl>, `Depth%` <dbl>, `Table%` <dbl>,
## #   Len <dbl>, Width <dbl>, Depth <dbl>, Ratio <dbl>, Pol <chr>, Sym <chr>,
## #   Fluor <chr>, FluorescenceIntensity <chr>, Size <dbl>

I found 5 missing values across all observations and they were addressed by removal (for minimal impact). The Categorical features were converted to numeric representations using label encoding. The numerical features (Weight, Price, Size) had varying scales, which could bias the performance of certain machine learning models (K-Mean, k-Nearest Neighbors, PCA), To standardize the data, min-max scaling was applied to scale features to a range between 0 and 1. The cleaned and prepared dataset is now ready for exploratory data analysis (EDA), correlation analysis, and machine learning applications and predictive modeling to understand market demand and optimize inventory management.

Train/Test Split

# Define predictor variables 
X <- WhiteDiamonds[, !names(WhiteDiamonds) %in% "Shape"]  # Drop the 'Shape' column
y <- WhiteDiamonds$Shape  # Target variable

# Split data into training and testing sets
set.seed(42)
train_index <- createDataPartition(y, p = 0.8, list = FALSE)

X_train <- X[train_index, ]
X_test <- X[-train_index, ]
y_train <- y[train_index]
y_test <- y[-train_index]

# Print the dimensions of the training set
cat("X_train dimensions:", dim(X_train), "\n")

## X_train dimensions: 1565 21

cat("y_train length:", length(y_train), "\n")

## y_train length: 1565

The dataset WhiteDiamonds was split into two subsets: The predictor variable X_train contains 1,567 rows and 21 features representing 80% of the total dataset. The target variable y_train contains 1,567 labels corresponding to the diamonds’ shapes in the training data.

Exploratory Data Analysis

# Identify numeric columns 
numeric_columns <- sapply(WhiteDiamonds, is.numeric)

# Create a list of numeric column names
numeric_features <- names(WhiteDiamonds)[numeric_columns]
numeric_features <- c("Shape", "Color", "Clarity","Weight", "Total", "Size", "Depth%", "Ratio")
# Adjust margins and grid layout
par(mfrow = c(3, 3), mar = c(4, 4, 2, 1))  # 3x3 grid with smaller margins
# Plot each numeric column
for (col in numeric_features) {
  boxplot(
    WhiteDiamonds[[col]],
    main = paste(" ", col),
    col = "skyblue",
    horizontal = TRUE
  )
}

Diamonds’ key features such as Shape, Color and Clarity have a relatively even spread without many extreme outliers. The data is mostly symmetric around the median, indicating uniform distributions.

Weight shows right-skewness shown by the concentration of data on the lower end and many outliers on the higher end, indicating rare, heavier diamonds that might influence pricing. Both variables, Total and Size, have extreme right-skewed distributions with numerous outliers that correspond to premium diamonds with unique features and significantly more expensive.

Depth% and Ratio are critical for assessing diamond proportions. Depth% shows slight right-skewness and Ratio is evenly distributed. Outliers in Depth% may affect their appearance and quality of diamonds.

# Select numeric columns
WhiteDiamonds_numeric <- WhiteDiamonds[, sapply(WhiteDiamonds, is.numeric)]

# Compute the correlation matrix
df_corr <- cor(WhiteDiamonds_numeric)
# Melt the correlation matrix
df_corr_melt <- melt(df_corr)
# Create a diverging color palette
diverging_palette <- colorRampPalette(brewer.pal(11, "RdBu"))(100)

# Plot the heatmap
ggplot(data = df_corr_melt, aes(x = Var1, y = Var2, fill = value)) +
  geom_tile(color = "white") +  # Add white grid lines
  scale_fill_gradientn(colors = diverging_palette, limits = c(-1, 1), name = "Correlation") +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    plot.title = element_text(size = 20, hjust = 0.5)
  ) +
  labs(title = "WhiteDiamonds Correlation Heatmap")

Strong positive correlation exists between Weight, Size, and Price, reflecting that larger diamonds typically cost more. Features like Depth% and Table% exhibit weaker correlations, indicating minimal impact on pricing.

# List of features to plot
features <- c("Shape", "Color", "Clarity","Weight", "Total", "Size", "Ratio")

# Create individual ggplots for each feature
plots <- lapply(features, function(feature) {
  ggplot(WhiteDiamonds, aes_string(x = feature)) +
    geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
    labs(title = paste("", feature), x = feature, y = "Frequency") +
    theme_minimal()
})

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# Arrange plots in a single row
grid.arrange(grobs = plots, ncol = 3)

Three histograms for numerical features ( Weight, Total_Price, Size) show a right-skewed distribution that means smaller diamonds are more frequent.

Weight is highly positively correlated with PricePC (Price per Carat) and Total (total price). As the weight of the diamond increases, its total price and price per carat also increases significantly. Size shows strong positive correlation with physical attributes like length, and Width and Depth. Larger diamonds naturally have greater physical dimensions with influencing pricing. Variables such as Color, `Clarity and Shape show weaker correlations with numerical features like PricePC or Size.

# List of features to plot
features <- c("Weight", "Total", "Size")
# Initialize an empty list to store plots
plots <- list()

# Loop through all combinations of features
plot_id <- 1
for (x in features) {
  for (y in features) {
    plots[[plot_id]] <- ggplot(WhiteDiamonds, aes_string(x = x, y = y)) +
      geom_point(alpha = 0.6) +
      geom_smooth(method = "lm", se = FALSE, color = "blue") +
      labs(x = x, y = y) +
      theme_minimal()
    plot_id <- plot_id + 1
  }
}

# Arrange the plots in a 3x3 grid
do.call(grid.arrange, c(plots, ncol = 3))

## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'

This scatter plot matrix demonstrates that Weight and Size are highly correlated with each other and strongly influence the Total price of diamonds. The variables Weight and Size have the strongest positive relationship, evident from their tight clustering around the regression line. Both are strong predictors of the total price. The distributions highlight that smaller diamonds are more common, but price varies widely even for similar weights or sizes.

# Normalize the dataset to a [0, 1] range
normalize <- function(x) {
  (x - min(x, na.rm = TRUE)) / (max(x, na.rm = TRUE) - min(x, na.rm = TRUE))
}

# Apply normalization to all numeric columns in WhiteDiamonds
WhiteDiamonds_normalized <- as.data.frame(lapply(WhiteDiamonds, function(col) {
  if (is.numeric(col)) {
    normalize(col)
  } else {
    col  # Keep non-numeric columns unchanged
  }
}))
# Build a data frame
WhiteDiamonds_normalized <- as.data.frame(WhiteDiamonds_normalized)

The dataset WhiteDiamonds has numeric features such as Shape, Weight, Color, Clarity, Total, Size, Depth% and Ratio which vary widely in range. Normalization ensures that all numeric features are on the consistent scale, preventing variables with larger magnitudes from dominating models like PCA (Principal Component Analysis), K-Means clustering, and K-Nearest Neighbors (KNN) because all variables contribute equally to the principal components. K-Means and KNN use Euclidean distance, which is sensitive to the scale of features.

# List of features to plot
features <- c("Shape", "Weight", "Color", "Clarity","Total", "Size")

# Create individual ggplots for each feature
plots <- lapply(features, function(feature) {
  ggplot(WhiteDiamonds_normalized, aes_string(x = feature)) +
    geom_histogram(bins = 20, fill = "skyblue", color = "black", alpha = 0.7) +
    labs(title = paste("", feature), x = feature, y = "Frequency") +
    theme_minimal()
})

# Arrange plots in a single row
grid.arrange(grobs = plots, ncol = 3)

The histograms illustrate the distribution of six key features from the WhiteDiamonds dataset after normalization to a [0, 1] range. Features like Weight, Total, and Size are heavily skewed to the right, indicating that smaller and less expensive diamonds influence the inventory.

Principal Component Analysis (PCA)

Principal Component Analysis (PCA) is a statistical method widely used for dimensionality reduction in machine learning and data analysis. PCA transforms high-dimensional data into a lower-dimensional space while retaining as much variance as possible.

RNGkind(sample.kind = "Rounding")
set.seed(100)
intrain <- sample(nrow(WhiteDiamonds), nrow(WhiteDiamonds)*0.1)
diamonds_subset <- WhiteDiamonds[intrain,]
dim(diamonds_subset)

## [1] 195  22

# Select specific columns using their names
selected_data <- diamonds_subset[, c("Shape", "Weight", "Color", "Clarity", "Total", "Size")]
selected_data_scaled <- scale(selected_data)
#selected_data %>% head(100)%>% summarise_all(n_distinct)
pca_diamonds <- PCA(selected_data_scaled, scale.unit=F)

Each point on the graph represents an observation (individual) in the dataset, projected onto the new 2D PCA space (Dim 1 and Dim 2). Points closer to each other on the graph indicate that the corresponding observations in the original dataset have similar characteristics.

The circle represents the correlation scale: Variables close to the circumference are well represented by the principal components. Variables closer to the origin have weaker contributions to the principal components. Weight, Total, and Size with arrows pointing in the same direction are positively correlated. Variables with arrows pointing in opposite directions are negatively correlated, for example, color shows a negative correlation with Weight and Total. Clarity and Total with arrows at right angles (90°) to each other are uncorrelated.

K-Means Clustering

K-Means Clustering is an unsupervised machine learning algorithm for partitioning data into distinct groups (clusters) based on their similarities. It minimizes intra-cluster variance, discover patterns and structure in data without predefined labels. K-Means clustering identifies clusters of diamonds based on distinct features like Shape, Weight, Size, Clarity, Color, and Total Price. This analysis is effective and efficient for inventory management, demand forecasting, and market segmentation.

library(factoextra)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

fviz_nbclust(selected_data_scaled, kmeans, method = "wss", k.max = 10) +
  labs(
    title = "Elbow Method for Optimal Clusters",
    x = "Number of Clusters (k)",
    y = "Total Within-Cluster Sum of Squares (WSS)"
  ) +
  theme_minimal()

set.seed(100)
diamonds_cluster <- kmeans(selected_data_scaled, centers = 3)
diamonds_subset$cluster <- diamonds_cluster$cluster
# Visualize clusters
fviz_cluster(diamonds_cluster, data = selected_data_scaled)

diamonds_subset %>% 
  group_by(cluster) %>% 
  summarise_all(mean)

## # A tibble: 3 × 23
##   cluster `Stock#` Shape Weight Color Clarity PricePC  Total  List ListPrice
##     <int>    <dbl> <dbl>  <dbl> <dbl>   <dbl>   <dbl>  <dbl> <dbl>     <dbl>
## 1       1   53620.  4.28  1.49   5.93    6.59   3738.  5902. -33.8     5759.
## 2       2   53561.  4.18  2.78   5.71    6.65   7768. 21848. -31.8    11924.
## 3       3   51982   5.42  0.711  3.69    6.91   2039.  1688. -33.3     3184.
## # ℹ 13 more variables: Lab <dbl>, CertNum <dbl>, `Depth%` <dbl>,
## #   `Table%` <dbl>, Len <dbl>, Width <dbl>, Depth <dbl>, Ratio <dbl>,
## #   Pol <dbl>, Sym <dbl>, Fluor <dbl>, FluorescenceIntensity <dbl>, Size <dbl>

# Total within-cluster sum of squares
wss <- sum(diamonds_cluster$withinss)
cat("Total WSS for the K-means model:", wss, "\n")

## Total WSS for the K-means model: 698.3866

#If ground truth (true labels) are available, compare predicted clusters to actual labels for accuracy
# Assuming true labels 
actual_labels <- diamonds_subset$Shape  
# Create a confusion matrix
confusion_matrix <- table(Predicted = diamonds_cluster$cluster, Actual = actual_labels)

# Calculate accuracy
accuracy <- sum(diag(confusion_matrix)) / sum(confusion_matrix)
cat("Accuracy of the K-means model:", accuracy * 100, "%")

## Accuracy of the K-means model: 11.28205 %

The elbow point (around k = 3) represents the optimal number of clusters. This is where the model simplicity is achieved.

The K-means clustering identifies and visualizes three distinct clusters within the dataset. Cluster 1 represents a group of diamonds with similar properties. Cluster 2 indicates more variability within this group of diamonds. Cluster 3 shows that this group of diamonds shares highly similar features and properties.

The accuracy of the K-means clustering model is 11.28% at optimum cluster k =3, this means the model might not be appropriately capturing the structure of the data.

K-Nearest Neighbors (KNN) Classification

K-Nearest Neighbors (KNN) is a supervised machine learning algorithm used for classification and regression tasks. It classifies new observations based on the similarity to previously labeled data points in the training dataset.

#store it as data frame
 dia <- data.frame(WhiteDiamonds)
set.seed(123)  # Set seed for reproducibility
n <- nrow(WhiteDiamonds_normalized)  
ran <- sample(1:n, size = 0.8 * n)  
 ##training dataset extracted
 dia_train <- WhiteDiamonds_normalized[ran,]
 
 #test dataset extracted
 dia_test <- WhiteDiamonds_normalized[-ran,]
# convert ordered factor to normal factor
 dia_target <- as.factor(dia[ran,2])
 test_target <- as.factor(dia[-ran,2])
 
 # Select only numeric columns
dia_train <- WhiteDiamonds_normalized[ran, sapply(WhiteDiamonds_normalized, is.numeric)]
dia_test <- WhiteDiamonds_normalized[-ran, sapply(WhiteDiamonds_normalized, is.numeric)]

 pr <- knn(dia_train,dia_test,cl=dia_target,k=10)
 
 #create the confucion matrix
 tb <- table(pr,test_target)
 
 #check the accuracy
 accuracy <- function(x){sum(diag(x)/(sum(rowSums(x)))) * 100}
 accuracy(tb)

## [1] 92.83887

# Print the result with a message
cat("The accuracy of the KNN model is", round(accuracy(tb), 2), "%\n")

## The accuracy of the KNN model is 92.84 %

 # Add predicted clusters to the testing dataset
test_results <- dia_test
test_results$Predicted_Cluster <- pr

# Select two features for scatter plot 
feature_x <- "Weight" 
feature_y <- "Size"   

# Create scatter plot with predicted clusters
ggplot(test_results, aes_string(x = feature_x, y = feature_y, color = "Predicted_Cluster")) +
  geom_point(size = 3, alpha = 0.7) +
  labs(title = "Scatterplot of kNN Predicted Clusters",
       x = feature_x,
       y = feature_y,
       color = "Cluster") +
  theme_minimal()

The scatterplot of KNN predicted clusters shows the relationship between two numeric features: Weight (x-axis) and Size (y-axis) of the white diamonds. The legend on the right indicates the cluster numbers such as 2, 3, …, 10. Cluster 2 and Cluster 3 are visibly larger and span a broader range of the Weight and Size features. Cluster 6 and Cluster 7 are densely concentrated within a smaller range. There is a clear positive correlation between Weight and Size, as expected. Larger weights correspond to larger sizes, with data points aligning diagonally. In the lower range of Weight and Size, Overlapped Clusters in which data points are more similar and harder to distinguish. The KNN model is a more appropriate choice for features classification and market segmentation for inventory management and demand forecasting due to its higher classification and prediction accuracy.

Key Findings

The Round Brilliant (BR) shape constitutes over 32% of the inventory,with highest market demand for classic shapes. High-clarity diamonds (e.g., VVS1, IF) are less frequent but most expensive. Similarly, premium color grades (D, F) more valuable over lower grades (J, K). A significant price escalation is observed as carat weight increases, particularly for stones above 1 carat. Some diamonds with unusual Size or Depth% values suggest anomalies, probably stones with shallow proportions or pricing inconsistencies.

Conclusions

A maximum stock of Round Brilliant diamonds in the inventory list dominates market demand. That can optimize pricing for larger, premium-quality stones to maximize profitability and demand patterns for specific characteristics of diamonds like VVS1 clarity, D-F color diamonds. PCA (Principal Component Analysis) reduces the dataset dimensionality, enabling clear visualization in two dimensions. K-means clustering differentiates three segments of diamonds, that shows inventory patterns and market demand.KNN classification identifies patterns or groups of diamonds with similar characteristics and features for inventory management.

Essay

The project analyzed white diamonds inventory level and optimized market demand utilizing machine learning algorithms such as Principal Component Analysis (PCA), K-Means Clustering, and K-Nearest Neighbors (KNN) Classification that I learned in the course. As I am working for the diamond wholesales company, I access closely with large datasets of natural diamond inventories, facilitating sales to retail stores across the United States.

The dataset used for this analysis, WhiteDiamondSummary_2024_12_11.csv, comprises 1,957 rows and 21 features, which include both numerical features such as Weight, Price, Size and categorical features like Shape, Color, Clarity. The data preprocessing steps were conducted, including handling missing values by removal, applying label encoding to categorical features, and Scaling to normalize numerical features to a range of [0,1].

The cleaned dataset was used for exploratory data analysis (EDA), I found that Round Brilliant (BR) diamonds influence the inventory, comprising 32% of the total inventory level, while other shapes like Oval (OV), Cushion (CU), and Emerald Cut (EC) followed in popularity. Features such as Weight, Size, and Total Price were strong positive correlations, while distributions for these features were significant right-skewness, indicating the prevalence of smaller, more affordable diamonds in the inventory.

To reduce dimensionality and visualize high-dimensional data, Principal Component Analysis (PCA) was applied. The PCA graph revealed that features such as Weight, Size, and Total Price contributed most significantly to the first principal component, confirming their dominance in explaining variance. Meanwhile, features like Clarity and Color showed weaker contributions and even negative correlations, emphasizing their lesser impact relative to the physical dimensions and pricing attributes. PCA provided a simplified visualization of the inventory data, enabling the identification of relationships and patterns.

The K-Means Clustering algorithm was implemented to segment diamonds into groups based on their feature similarities. Using the Elbow Method, the optimal number of clusters was determined to be three. The clustering analysis revealed three distinct segments: cluster one represented smaller diamonds with lower prices, cluster two reflected mid-range diamonds, and the cluster three captured premium-quality diamonds with higher weights and prices. However, the accuracy of the K-means model was relatively low at 11.28%, while the algorithm effectively segmented data, the unsupervised nature of K-means could not well capture the structure of the data.

To address the limitations of K-means clustering, K-Nearest Neighbors (KNN) classification was applied to predict the Shape of diamonds based on features like Weight and Size. The dataset was split into training and testing subsets, with 80% used for training and 20% for testing. The KNN model outperformed K-means clustering significantly, achieving an accuracy of 92.84%. Clusters were well-separated, with larger weights corresponding to larger sizes. Overlapping clusters in the lower ranges of Weight and Size indicated similarities among smaller diamonds, which is consistent in the inventory. The high accuracy of the KNN model demonstrated its effectiveness in classifying diamond shapes and highlighted its suitability for tasks requiring predictive modeling.

In conclusion, this project provides a decision-making for inventory level management, market segmentation, and demand forecasting in the competitive diamond industry.