##About Dataset

The Diamonds dataset used in this project contains detailed information about various diamonds, including their price, carat weight, cut quality, color, clarity, and physical dimensions. It is designed to explore how different characteristics affect the market price of diamonds. This dataset is ideal for performing data cleaning, descriptive analysis, visualization, and predictive modeling such as ANOVA, regression, clustering, and classification.

#Phase 1 – Dataset Understanding ##Load Dataset

df <- read.csv("D:\\diamonds.csv", stringsAsFactors = FALSE)
head(df)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(class)

Unique Records Check

df_unique <- df %>% distinct()
n_unique <- nrow(df_unique)
n_unique
## [1] 53940

Insight: After removing duplicates, the dataset contains r n_unique unique records.

Column Structure

str(df)
## 'data.frame':    53940 obs. of  11 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : chr  "Ideal" "Premium" "Good" "Premium" ...
##  $ color  : chr  "E" "E" "E" "I" ...
##  $ clarity: chr  "SI2" "SI1" "VS1" "VS2" ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Insight: This provides an overview of all variables and their data types — numeric, character, or factor.

Missing Value Analysis

missing_summary <- sapply(df, function(x) sum(is.na(x)))
missing_df <- data.frame(Column = names(missing_summary), Missing_Values = as.numeric(missing_summary))
missing_df
##     Column Missing_Values
## 1        X              0
## 2    carat              0
## 3      cut              0
## 4    color              0
## 5  clarity              0
## 6    depth              0
## 7    table              0
## 8    price              0
## 9        x              0
## 10       y              0
## 11       z              0

Insight: The table and chart show missing values per column.

Summary Statistics

numeric_cols <- df
summary(numeric_cols)
##        X             carat            cut               color          
##  Min.   :    1   Min.   :0.2000   Length:53940       Length:53940      
##  1st Qu.:13486   1st Qu.:0.4000   Class :character   Class :character  
##  Median :26971   Median :0.7000   Mode  :character   Mode  :character  
##  Mean   :26971   Mean   :0.7979                                        
##  3rd Qu.:40455   3rd Qu.:1.0400                                        
##  Max.   :53940   Max.   :5.0100                                        
##    clarity              depth           table           price      
##  Length:53940       Min.   :43.00   Min.   :43.00   Min.   :  326  
##  Class :character   1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950  
##  Mode  :character   Median :61.80   Median :57.00   Median : 2401  
##                     Mean   :61.75   Mean   :57.46   Mean   : 3933  
##                     3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324  
##                     Max.   :79.00   Max.   :95.00   Max.   :18823  
##        x                y                z         
##  Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.700   Median : 5.710   Median : 3.530  
##  Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :10.740   Max.   :58.900   Max.   :31.800

Insight: The summary displays minimum, maximum, mean, and quartiles for numeric columns.

Category Distribution

if ("cut" %in% names(df)) {
table(df$cut)
}
## 
##      Fair      Good     Ideal   Premium Very Good 
##      1610      4906     21551     13791     12082

Insight: This shows the distribution of diamond cut types such as “Ideal”, “Premium”, and “Good”.

#Phase 2 – Data Cleaning & Visualization ##Data Cleaning

df <- df[!duplicated(df), ]
df <- na.omit(df)

df$cut <- as.factor(df$cut)
df$color <- as.factor(df$color)
df$clarity <- as.factor(df$clarity)
str(df)
## 'data.frame':    53940 obs. of  11 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Insight: After cleaning, only unique and complete records remain. Categorical columns are converted to factors, preparing data for visualization and analysis.

Column Structure

str(df)
## 'data.frame':    53940 obs. of  11 variables:
##  $ X      : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
##  $ color  : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Insight: This provides an overview of all variables and their data types — numeric, character, or factor.

Summary of Cleaned Numeric Columns

numeric_summary <- df %>%
summarise(across(where(is.numeric),
list(Min = min, Mean = mean, Median = median, Max = max)))
numeric_summary
##   X_Min  X_Mean X_Median X_Max carat_Min carat_Mean carat_Median carat_Max
## 1     1 26970.5  26970.5 53940       0.2  0.7979397          0.7      5.01
##   depth_Min depth_Mean depth_Median depth_Max table_Min table_Mean table_Median
## 1        43    61.7494         61.8        79        43   57.45718           57
##   table_Max price_Min price_Mean price_Median price_Max x_Min   x_Mean x_Median
## 1        95       326     3932.8         2401     18823     0 5.731157      5.7
##   x_Max y_Min   y_Mean y_Median y_Max z_Min   z_Mean z_Median z_Max
## 1 10.74     0 5.734526     5.71  58.9     0 3.538734     3.53  31.8

Insight: Shows updated ranges for numeric columns after cleaning, confirming no invalid values remain.

Distribution of Diamond Prices

ggplot(df, aes(x = price)) +
geom_histogram(bins = 50, fill = "skyblue", color = "white") +
ggtitle("Distribution of Diamond Prices") +
theme_minimal()

Insight: The price distribution is right-skewed — most diamonds are lower-priced, with a few luxury outliers.

Price vs Carat

ggplot(df, aes(x = carat, y = price)) +
geom_point(alpha = 0.4, color = "orange") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
ggtitle("Price vs Carat") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insight: There’s a strong positive correlation between carat and price — larger diamonds are significantly more expensive.

Price by Cut Quality

df_mean_price <- df %>%
  group_by(cut) %>%
  summarise(Average_Price = mean(price))

ggplot(df_mean_price, aes(x = cut, y = Average_Price, fill = cut)) +
  geom_bar(stat = "identity") +
  geom_text(aes(label = round(Average_Price, 0)), vjust = -0.3, size = 3.5) +
  ggtitle("Average Diamond Price by Cut") +
  xlab("Cut Quality") + ylab("Average Price (USD)") +
  theme_minimal()

Insight: Diamonds with Premium and Fair cuts have higher median prices, showing cut quality affects price.

Price Density by Clarity

ggplot(df, aes(x = price, color = clarity)) +
geom_density() +
ggtitle("Price Density by Clarity") +
theme_minimal()

Insight: Different clarity grades overlap, but high-clarity diamonds (like IF, VVS1) cluster at higher price ranges.

#Phase 3 – Statistical Modeling (ANOVA & Regression) ## ANOVA Test (Price ~ Cut)

anova_model <- aov(price ~ cut, data = df)
summary(anova_model)
##                Df    Sum Sq   Mean Sq F value Pr(>F)    
## cut             4 1.104e+10 2.760e+09   175.7 <2e-16 ***
## Residuals   53935 8.474e+11 1.571e+07                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Insight: If p-value < 0.05, it indicates a significant difference in mean price among different cut types.

Visualize ANOVA Result

ggplot(df, aes(x = cut, y = price, fill = cut)) +
geom_boxplot() +
ggtitle("ANOVA Visualization: Price by Cut") +
theme_minimal()

Insight: This confirms visually that higher cuts (Premium, Ideal) correspond to higher average prices.

Linear Regression Model

lm_model <- lm(price ~ carat + depth + table + x + y + z, data = df)
summary(lm_model)
## 
## Call:
## lm(formula = price ~ carat + depth + table + x + y + z, data = df)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -23878.2   -615.0    -50.7    347.9  12759.2 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 20849.316    447.562  46.584  < 2e-16 ***
## carat       10686.309     63.201 169.085  < 2e-16 ***
## depth        -203.154      5.504 -36.910  < 2e-16 ***
## table        -102.446      3.084 -33.216  < 2e-16 ***
## x           -1315.668     43.070 -30.547  < 2e-16 ***
## y              66.322     25.523   2.599  0.00937 ** 
## z              41.628     44.305   0.940  0.34744    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared:  0.8592, Adjusted R-squared:  0.8592 
## F-statistic: 5.486e+04 on 6 and 53933 DF,  p-value: < 2.2e-16

Insight: Regression identifies carat as the most influential predictor of price. Depth and table have smaller impacts.

Visualize Regression Fit

ggplot(df, aes(x = carat, y = price)) +
geom_point(alpha = 0.5, color = "skyblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Regression Fit: Price vs Carat") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'

Insight: The regression line (red) shows a clear positive linear trend between carat and price.

#Phase 4 – Clustering & Classification (K-Means & KNN) ## K-Means Clustering

num_data <- df %>% select_if(is.numeric)
scaled_data <- scale(num_data)
set.seed(123)
kmeans_model <- kmeans(scaled_data, centers = 3, nstart = 25)
df$Cluster <- as.factor(kmeans_model$cluster)
table(df$Cluster)
## 
##     1     2     3 
## 18068  6946 28926

Insight: K-Means divides diamonds into 3 clusters (typically low-, medium-, and high-priced groups).

Visualize Clusters

ggplot(df, aes(x = carat, y = price, color = Cluster)) +
geom_point(alpha = 0.6) +
ggtitle("K-Means Clustering (k=3): Carat vs Price") +
theme_minimal()

## KNN Classification

df$price_bin <- cut(df$price,
breaks = quantile(df$price, probs = seq(0, 1, by = 1/3)),
include.lowest = TRUE,
labels = c("Low", "Medium", "High"))
df$price_bin <- droplevels(df$price_bin)
knn_data <- df %>% select(carat, depth, table, x, y, z, price_bin)

Insight: A categorical target variable price_bin is created to classify diamonds as Low, Medium, or High priced.

Data Split & Normalization

set.seed(123)
n <- nrow(knn_data)
train_index <- sample(1:n, size = 0.75 * n)
train_data <- knn_data[train_index, ]
test_data  <- knn_data[-train_index, ]

normalize <- function(x) {(x - min(x)) / (max(x) - min(x))}
train_norm <- as.data.frame(lapply(train_data[, 1:6], normalize))
test_norm  <- as.data.frame(lapply(test_data[, 1:6], normalize))

Insight: Normalization scales numeric values between 0–1, ensuring fair distance measurement for KNN.

Apply KNN Model

set.seed(123)
train_labels <- train_data$price_bin
test_labels  <- test_data$price_bin
knn_pred <- knn(train = train_norm, test = test_norm, cl = train_labels, k = 5)

conf_mat <- table(Predicted = knn_pred, Actual = test_labels)
conf_mat
##          Actual
## Predicted  Low Medium High
##    Low    3046     27    1
##    Medium 1323   1890    8
##    High     86   2582 4522

Insight: The confusion matrix shows classification results — correctly and incorrectly predicted price categories.

Model Accuracy

accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
accuracy
## [1] 0.7013719

Insight: The model achieved an accuracy rate of around 70–80%, showing strong predictive ability for basic classification tasks.