Assignment1

Author

XiaoFei Mei

Published

February 1, 2026

Introduction

This is a analysis of dataset on records of diamonds. From Kaggle, with URL: https://www.kaggle.com/datasets/shivam2503/diamonds?select=diamonds.csv.

It has total 11 columns that show different characteristics of diamonds, such as color, carats, weight, cut grade, price, etc. My goal is to produce a clear, data-driven report shows which characteristic has the most positive relationship to the price of diamonds.

My plan is to first perform exploratory data analysis to understand the distribution, data types of the 11 columns. To check if there are any missing data, outliner for each column.Then build some modeling that can identify most influential factors. In the end to use visualization tools present those factors.

I anticipate challenges includes extreme outlines since diamonds prices could be extreme on both ends that it would be difficult to use them to find common factors that affect price. Or there could be no single factors that show dominant relationship with price, rather a combination of different factors that nearly equally important. It would be difficult for me to present a strong case on factors affect prices the most.

Data Exploratory

#install/load libraries
#| echo: false
#| warning: false
#| message: false

suppressPackageStartupMessages({
  library(tidyverse)
  library(ggplot2)
  library(skimr)
  library(DataExplorer)
  library(corrplot)
  library(GGally)
  library(psych)
  library(moments)
})

Loading Data, and initial check on columns to see if there are any missing column and values types for each columns.

From results shown below, I can see there are no missing values for all columns. Looks like it has diamonds data set plus an auto-generated index column start with value of 1.

As for data types, it shows we have numeric for measurements, character for categorical.

file_url <- "https://raw.githubusercontent.com/xiaofeimei1/Data607-A1/refs/heads/main/diamonds.csv"
data <- readr::read_csv(file_url, show_col_types = FALSE)
New names:
• `` -> `...1`
dim(data)
[1] 53940    11
names(data)
 [1] "...1"    "carat"   "cut"     "color"   "clarity" "depth"   "table"  
 [8] "price"   "x"       "y"       "z"      
head(data)
# A tibble: 6 × 11
   ...1 carat cut       color clarity depth table price     x     y     z
  <dbl> <dbl> <chr>     <chr> <chr>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1     1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
2     2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
3     3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
4     4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
5     5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
6     6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
sapply(data, class)
       ...1       carat         cut       color     clarity       depth 
  "numeric"   "numeric" "character" "character" "character"   "numeric" 
      table       price           x           y           z 
  "numeric"   "numeric"   "numeric"   "numeric"   "numeric" 
colSums(is.na(data))
   ...1   carat     cut   color clarity   depth   table   price       x       y 
      0       0       0       0       0       0       0       0       0       0 
      z 
      0 
str(data)
spc_tbl_ [53,940 × 11] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ ...1   : num [1:53940] 1 2 3 4 5 6 7 8 9 10 ...
 $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
 $ cut    : chr [1:53940] "Ideal" "Premium" "Good" "Premium" ...
 $ color  : chr [1:53940] "E" "E" "E" "I" ...
 $ clarity: chr [1:53940] "SI2" "SI1" "VS1" "VS2" ...
 $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
 $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
 $ price  : num [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
 $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
 $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
 $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
 - attr(*, "spec")=
  .. cols(
  ..   ...1 = col_double(),
  ..   carat = col_double(),
  ..   cut = col_character(),
  ..   color = col_character(),
  ..   clarity = col_character(),
  ..   depth = col_double(),
  ..   table = col_double(),
  ..   price = col_double(),
  ..   x = col_double(),
  ..   y = col_double(),
  ..   z = col_double()
  .. )
 - attr(*, "problems")=<externalptr> 

Data Analysis

For columns with numeric values, I will use plot to show the relationship of the characteristics with price.

From the charts below, we can observe a general trend: higher carat weight typically results in a higher price. Similar patterns are evident for other characteristics such as x, y, and z measurements, as well as table percentage.

This trend, however, does not apply to depth percentage. This is logical from an industry standpoint, as diamonds are primarily evaluated based on their face-up appearance. For instance:

  • If two diamonds share identical characteristics except for carat weight, the larger diamond will generally command a higher price.

  • If two diamonds have the same carat weight, but one has a larger table percentage and a smaller depth, it will appear larger when viewed from above. As a result, it is likely to be valued slightly higher.

#  carat vs price 
p1 <- ggplot(data, aes(x = carat, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs Carat",
       x = "Carat",
       y = "Price") +
  theme_minimal()

# measurements vs price
p2 <- ggplot(data, aes(x = x, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs X (Length)",
       x = "X (mm)",
       y = "Price") +
  theme_minimal()

p3 <- ggplot(data, aes(x = y, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs Y (Width)",
       x = "Y (mm)",
       y = "Price") +
  theme_minimal()

p4 <- ggplot(data, aes(x = z, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs Z (Depth)",
       x = "Z (mm)",
       y = "Price") +
  theme_minimal()

# table and depth  vs price
p5 <- ggplot(data, aes(x = depth, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs Depth %",
       x = "Depth (%)",
       y = "Price") +
  theme_minimal()

p6 <- ggplot(data, aes(x = table, y = price)) +
  geom_point(alpha = 0.1, size = 0.5) +
  geom_smooth(method = "lm", color = "red", se = FALSE) +
  labs(title = "Price vs Table %",
       x = "Table (%)",
       y = "Price") +
  theme_minimal()

# print all plots
print(p1)

print(p2)

print(p3)

print(p4)

print(p5)

print(p6)

Let’s furthur examines how categorical variables—specifically cut, color and clarity influence diamond pricing.

As plots shown below, when compare cut and color, premium cut diamonds generally related to higher prices, particularly in the D,E and F color range, with all sampled diamonds exceeding the $10,000 price threshold.

When combining cut and clarity, while VVS2 clarity combined with premium cuts shows higher price points, these samples are relatively rare as not many diamonds fit in those category. Interestingly, some high priced diamonds appear in the I1 clarity range even with fair cuts, suggesting exceptional factors such as size and weight of the diamonds influence these outlines.

categorical_cols <- c("cut", "color", "clarity")



# 3. Interaction between categorical variables and price
library(patchwork)

# Cut vs Color interaction
p_interaction1 <- ggplot(data, aes(x = cut, y = price, fill = color)) +
  geom_boxplot() +
  labs(title = "Price by Cut and Color",
       x = "Cut",
       y = "Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

# Cut vs Clarity interaction  
p_interaction2 <- ggplot(data, aes(x = cut, y = price, fill = clarity)) +
  geom_boxplot() +
  labs(title = "Price by Cut and Clarity",
       x = "Cut",
       y = "Price") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

print(p_interaction1)

print(p_interaction2)

Conclusion

This analysis confirms that carat weight and measurements have a clear trend that could influence diamond pricing. They play a more substantial role than categorical variables such as cut, color, and clarity. While premium combinations of these categorical features can elevate prices, their impact appears secondary to the fundamental dimensions of the diamond itself.

If we want further analysis to draw a clear conclusion on how cut, color and clarity affect price. We need to better isolate the effects of categorical variables on pricing, find a larger sample size to ensure adequate representation across all categorical combinations. We would need to compare diamonds with same carat weight and measurements only differ in clarity , cut and color. This approach would provide clearer insight into how cut, color, and clarity influence prices when diamonds are otherwise comparable in size and dimensions.

Another thought about this study, diamonds price fluctuate as price is subject to market volatility that could be influenced by several external factors, such as seasonal demand, geopolitical and trade policies and global economic conditions. To strengthen future analysis, ideal data sets should include time-adjusted pricing, or includes price record with dates that are as close as possible. Such enhancements would help isolate the intrinsic values factors ( carat, cut, color, clarity etc_ from external market influences, providing more robust insights into diamond price determining factors.