##About Dataset
The Diamonds dataset used in this project contains detailed information about various diamonds, including their price, carat weight, cut quality, color, clarity, and physical dimensions. It is designed to explore how different characteristics affect the market price of diamonds. This dataset is ideal for performing data cleaning, descriptive analysis, visualization, and predictive modeling such as ANOVA, regression, clustering, and classification.
#Phase 1 – Dataset Understanding ##Load Dataset
df <- read.csv("D:\\diamonds.csv", stringsAsFactors = FALSE)
head(df)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(class)
df_unique <- df %>% distinct()
n_unique <- nrow(df_unique)
n_unique
## [1] 53940
Insight: After removing duplicates, the dataset contains r n_unique unique records.
str(df)
## 'data.frame': 53940 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : chr "Ideal" "Premium" "Good" "Premium" ...
## $ color : chr "E" "E" "E" "I" ...
## $ clarity: chr "SI2" "SI1" "VS1" "VS2" ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Insight: This provides an overview of all variables and their data types — numeric, character, or factor.
missing_summary <- sapply(df, function(x) sum(is.na(x)))
missing_df <- data.frame(Column = names(missing_summary), Missing_Values = as.numeric(missing_summary))
missing_df
## Column Missing_Values
## 1 X 0
## 2 carat 0
## 3 cut 0
## 4 color 0
## 5 clarity 0
## 6 depth 0
## 7 table 0
## 8 price 0
## 9 x 0
## 10 y 0
## 11 z 0
Insight: The table and chart show missing values per column.
numeric_cols <- df
summary(numeric_cols)
## X carat cut color
## Min. : 1 Min. :0.2000 Length:53940 Length:53940
## 1st Qu.:13486 1st Qu.:0.4000 Class :character Class :character
## Median :26971 Median :0.7000 Mode :character Mode :character
## Mean :26971 Mean :0.7979
## 3rd Qu.:40455 3rd Qu.:1.0400
## Max. :53940 Max. :5.0100
## clarity depth table price
## Length:53940 Min. :43.00 Min. :43.00 Min. : 326
## Class :character 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950
## Mode :character Median :61.80 Median :57.00 Median : 2401
## Mean :61.75 Mean :57.46 Mean : 3933
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324
## Max. :79.00 Max. :95.00 Max. :18823
## x y z
## Min. : 0.000 Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.710 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.700 Median : 5.710 Median : 3.530
## Mean : 5.731 Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :10.740 Max. :58.900 Max. :31.800
Insight: The summary displays minimum, maximum, mean, and quartiles for numeric columns.
if ("cut" %in% names(df)) {
table(df$cut)
}
##
## Fair Good Ideal Premium Very Good
## 1610 4906 21551 13791 12082
Insight: This shows the distribution of diamond cut types such as “Ideal”, “Premium”, and “Good”.
#Phase 2 – Data Cleaning & Visualization ##Data Cleaning
df <- df[!duplicated(df), ]
df <- na.omit(df)
df$cut <- as.factor(df$cut)
df$color <- as.factor(df$color)
df$clarity <- as.factor(df$clarity)
str(df)
## 'data.frame': 53940 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Insight: After cleaning, only unique and complete records remain. Categorical columns are converted to factors, preparing data for visualization and analysis.
str(df)
## 'data.frame': 53940 obs. of 11 variables:
## $ X : int 1 2 3 4 5 6 7 8 9 10 ...
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Factor w/ 5 levels "Fair","Good",..: 3 4 2 4 2 5 5 5 1 5 ...
## $ color : Factor w/ 7 levels "D","E","F","G",..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Factor w/ 8 levels "I1","IF","SI1",..: 4 3 5 6 4 8 7 3 6 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
Insight: This provides an overview of all variables and their data types — numeric, character, or factor.
numeric_summary <- df %>%
summarise(across(where(is.numeric),
list(Min = min, Mean = mean, Median = median, Max = max)))
numeric_summary
## X_Min X_Mean X_Median X_Max carat_Min carat_Mean carat_Median carat_Max
## 1 1 26970.5 26970.5 53940 0.2 0.7979397 0.7 5.01
## depth_Min depth_Mean depth_Median depth_Max table_Min table_Mean table_Median
## 1 43 61.7494 61.8 79 43 57.45718 57
## table_Max price_Min price_Mean price_Median price_Max x_Min x_Mean x_Median
## 1 95 326 3932.8 2401 18823 0 5.731157 5.7
## x_Max y_Min y_Mean y_Median y_Max z_Min z_Mean z_Median z_Max
## 1 10.74 0 5.734526 5.71 58.9 0 3.538734 3.53 31.8
Insight: Shows updated ranges for numeric columns after cleaning, confirming no invalid values remain.
ggplot(df, aes(x = price)) +
geom_histogram(bins = 50, fill = "skyblue", color = "white") +
ggtitle("Distribution of Diamond Prices") +
theme_minimal()
Insight: The price distribution is right-skewed — most diamonds are lower-priced, with a few luxury outliers.
ggplot(df, aes(x = carat, y = price)) +
geom_point(alpha = 0.4, color = "orange") +
geom_smooth(method = "lm", color = "red", se = FALSE) +
ggtitle("Price vs Carat") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Insight: There’s a strong positive correlation between carat and price — larger diamonds are significantly more expensive.
df_mean_price <- df %>%
group_by(cut) %>%
summarise(Average_Price = mean(price))
ggplot(df_mean_price, aes(x = cut, y = Average_Price, fill = cut)) +
geom_bar(stat = "identity") +
geom_text(aes(label = round(Average_Price, 0)), vjust = -0.3, size = 3.5) +
ggtitle("Average Diamond Price by Cut") +
xlab("Cut Quality") + ylab("Average Price (USD)") +
theme_minimal()
Insight: Diamonds with Premium and Fair cuts have higher median prices, showing cut quality affects price.
ggplot(df, aes(x = price, color = clarity)) +
geom_density() +
ggtitle("Price Density by Clarity") +
theme_minimal()
Insight: Different clarity grades overlap, but high-clarity diamonds (like IF, VVS1) cluster at higher price ranges.
#Phase 3 – Statistical Modeling (ANOVA & Regression) ## ANOVA Test (Price ~ Cut)
anova_model <- aov(price ~ cut, data = df)
summary(anova_model)
## Df Sum Sq Mean Sq F value Pr(>F)
## cut 4 1.104e+10 2.760e+09 175.7 <2e-16 ***
## Residuals 53935 8.474e+11 1.571e+07
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Insight: If p-value < 0.05, it indicates a significant difference in mean price among different cut types.
ggplot(df, aes(x = cut, y = price, fill = cut)) +
geom_boxplot() +
ggtitle("ANOVA Visualization: Price by Cut") +
theme_minimal()
Insight: This confirms visually that higher cuts (Premium, Ideal) correspond to higher average prices.
lm_model <- lm(price ~ carat + depth + table + x + y + z, data = df)
summary(lm_model)
##
## Call:
## lm(formula = price ~ carat + depth + table + x + y + z, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -23878.2 -615.0 -50.7 347.9 12759.2
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 20849.316 447.562 46.584 < 2e-16 ***
## carat 10686.309 63.201 169.085 < 2e-16 ***
## depth -203.154 5.504 -36.910 < 2e-16 ***
## table -102.446 3.084 -33.216 < 2e-16 ***
## x -1315.668 43.070 -30.547 < 2e-16 ***
## y 66.322 25.523 2.599 0.00937 **
## z 41.628 44.305 0.940 0.34744
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1497 on 53933 degrees of freedom
## Multiple R-squared: 0.8592, Adjusted R-squared: 0.8592
## F-statistic: 5.486e+04 on 6 and 53933 DF, p-value: < 2.2e-16
Insight: Regression identifies carat as the most influential predictor of price. Depth and table have smaller impacts.
ggplot(df, aes(x = carat, y = price)) +
geom_point(alpha = 0.5, color = "skyblue") +
geom_smooth(method = "lm", se = FALSE, color = "red") +
ggtitle("Regression Fit: Price vs Carat") +
theme_minimal()
## `geom_smooth()` using formula = 'y ~ x'
Insight: The regression line (red) shows a clear positive linear trend between carat and price.
#Phase 4 – Clustering & Classification (K-Means & KNN) ## K-Means Clustering
num_data <- df %>% select_if(is.numeric)
scaled_data <- scale(num_data)
set.seed(123)
kmeans_model <- kmeans(scaled_data, centers = 3, nstart = 25)
df$Cluster <- as.factor(kmeans_model$cluster)
table(df$Cluster)
##
## 1 2 3
## 18068 6946 28926
Insight: K-Means divides diamonds into 3 clusters (typically low-, medium-, and high-priced groups).
ggplot(df, aes(x = carat, y = price, color = Cluster)) +
geom_point(alpha = 0.6) +
ggtitle("K-Means Clustering (k=3): Carat vs Price") +
theme_minimal()
## KNN Classification
df$price_bin <- cut(df$price,
breaks = quantile(df$price, probs = seq(0, 1, by = 1/3)),
include.lowest = TRUE,
labels = c("Low", "Medium", "High"))
df$price_bin <- droplevels(df$price_bin)
knn_data <- df %>% select(carat, depth, table, x, y, z, price_bin)
Insight: A categorical target variable price_bin is created to classify diamonds as Low, Medium, or High priced.
set.seed(123)
n <- nrow(knn_data)
train_index <- sample(1:n, size = 0.75 * n)
train_data <- knn_data[train_index, ]
test_data <- knn_data[-train_index, ]
normalize <- function(x) {(x - min(x)) / (max(x) - min(x))}
train_norm <- as.data.frame(lapply(train_data[, 1:6], normalize))
test_norm <- as.data.frame(lapply(test_data[, 1:6], normalize))
Insight: Normalization scales numeric values between 0–1, ensuring fair distance measurement for KNN.
set.seed(123)
train_labels <- train_data$price_bin
test_labels <- test_data$price_bin
knn_pred <- knn(train = train_norm, test = test_norm, cl = train_labels, k = 5)
conf_mat <- table(Predicted = knn_pred, Actual = test_labels)
conf_mat
## Actual
## Predicted Low Medium High
## Low 3046 27 1
## Medium 1323 1890 8
## High 86 2582 4522
Insight: The confusion matrix shows classification results — correctly and incorrectly predicted price categories.
accuracy <- sum(diag(conf_mat)) / sum(conf_mat)
accuracy
## [1] 0.7013719
Insight: The model achieved an accuracy rate of around 70–80%, showing strong predictive ability for basic classification tasks.