DATA 606 Final Project: White Diamond Inventory Statistics

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.3     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## 
## Attaching package: 'mice'
## 
## 
## The following object is masked from 'package:stats':
## 
##     filter
## 
## 
## The following objects are masked from 'package:base':
## 
##     cbind, rbind
## 
## 
## 
## Attaching package: 'scales'
## 
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## 
## The following object is masked from 'package:readr':
## 
##     col_factor
## 
## 
## corrplot 0.92 loaded
## 
## Rows: 2255 Columns: 20
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (7): Shape, Color, Clarity, Lab, Polish, Symmetry, FluorescenceIntensity
## dbl (13): No., Weight, PricePC, Total, List, ListPrice, CertNum, Depth%, Tab...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Part 1 - Introduction

Diamond forms under high temperature and pressure conditions that exist only about 100 miles beneath the earth’s surface. Diamond’s carbon atoms are bonded in essentially the same way in all directions.Another mineral, graphite, also contains only carbon, but its formation process and crystal structure are very different.Graphite is so soft that we can write with it using laser inscription, while diamond is so hard that we can only scratch it with another diamond. Diamond’s characteristic chemical composition and crystal structure make it a unique member of the mineral kingdom. Diamond is the only gem made of a single element: It is typically about 99.95 percent carbon. The other 0.05 percent can include one or more trace elements, which are atoms that aren’t part of the diamond’s essential chemistry. Some trace elements can influence its color or crystal shape.

Part 2 - Data

The white diamond inventory data set has 2,255 observations and 20 variables. In order to analyze the value of diamonds, firstly, data is cleaned in which missing value will be imputed with the mean, check for duplicates each observation is unique, consistency in Categorical Data and visualize with boxplot to display distribution, central tendency, and spread of a data set to identify outliers and irregularities in data.

# Display Structure of Data and its summary statistics
str(diamonds_df)

## 'data.frame':    2255 obs. of  20 variables:
##  $ No.                  : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Shape                : chr  "AC" "AC" "AC" "AC" ...
##  $ Weight               : num  0.51 0.6 0.7 0.71 0.9 0.9 0.92 1 1 1.01 ...
##  $ Color                : chr  "F" "K" "G" "F" ...
##  $ Clarity              : chr  "VVS1" "VS1" "VS1" "IF" ...
##  $ PricePC              : num  1856 1040 2381 2496 2500 ...
##  $ Total                : num  947 624 1667 1772 2250 ...
##  $ List                 : num  -42 -35 -41.9 -52 -50 ...
##  $ ListPrice            : num  3200 1600 4100 5200 5000 2700 6900 7500 7700 7000 ...
##  $ Lab                  : chr  "GIA" "GIA" "GIA" "GIA" ...
##  $ CertNum              : num  5.22e+09 5.22e+09 6.43e+09 6.40e+09 1.23e+09 ...
##  $ Depth%               : num  67.8 67.2 69.4 68 69.1 69.8 69.5 69.6 62.1 72.3 ...
##  $ Table%               : num  65 61 61 64 62 69 66 67 70 71 ...
##  $ Len                  : num  4.31 4.8 4.85 5 5.25 5.27 5.32 5.28 5.88 5.56 ...
##  $ Width                : num  4.29 4.71 4.74 4.88 5.16 5.13 5.08 5.25 5.64 5.36 ...
##  $ Depth                : num  2.91 3.16 3.29 3.32 3.57 3.58 3.53 3.65 3.5 3.88 ...
##  $ Ratio                : num  1 1.02 1.02 1.02 1.02 1.03 1.05 1.01 1.04 1.04 ...
##  $ Polish               : chr  "VG" "VG" "EX" "EX" ...
##  $ Symmetry             : chr  "VG" "VG" "VG" "VG" ...
##  $ FluorescenceIntensity: chr  "M" "N" "N" "N" ...

summary(diamonds_df)

##       No.            Shape               Weight         Color          
##  Min.   :   1.0   Length:2255        Min.   :0.300   Length:2255       
##  1st Qu.: 564.5   Class :character   1st Qu.:0.710   Class :character  
##  Median :1128.0   Mode  :character   Median :1.010   Mode  :character  
##  Mean   :1128.0                      Mean   :1.228                     
##  3rd Qu.:1691.5                      3rd Qu.:1.510                     
##  Max.   :2255.0                      Max.   :7.020                     
##                                                                        
##    Clarity             PricePC          Total               List       
##  Length:2255        Min.   :  795   Min.   :   247.5   Min.   :-64.24  
##  Class :character   1st Qu.: 1885   1st Qu.:  1357.2   1st Qu.:-40.00  
##  Mode  :character   Median : 3099   Median :  3342.8   Median :-31.04  
##                     Mean   : 3934   Mean   :  6842.7   Mean   :-32.35  
##                     3rd Qu.: 5110   3rd Qu.:  7688.2   3rd Qu.:-25.00  
##                     Max.   :36075   Max.   :201888.0   Max.   :  5.00  
##                     NA's   :17      NA's   :17         NA's   :18      
##    ListPrice         Lab               CertNum              Depth%     
##  Min.   : 1000   Length:2255        Min.   :8.372e+06   Min.   :47.30  
##  1st Qu.: 2800   Class :character   1st Qu.:2.225e+09   1st Qu.:61.50  
##  Median : 4700   Mode  :character   Median :5.211e+09   Median :62.90  
##  Mean   : 5973                      Mean   :4.322e+09   Mean   :64.51  
##  3rd Qu.: 7500                      3rd Qu.:6.433e+09   3rd Qu.:67.90  
##  Max.   :59000                      Max.   :1.008e+11   Max.   :82.90  
##  NA's   :7                                              NA's   :1      
##      Table%           Len             Width            Depth      
##  Min.   :51.00   Min.   : 4.200   Min.   : 2.970   Min.   :1.150  
##  1st Qu.:57.00   1st Qu.: 5.670   1st Qu.: 4.970   1st Qu.:3.230  
##  Median :60.00   Median : 6.760   Median : 5.770   Median :3.740  
##  Mean   :61.11   Mean   : 7.007   Mean   : 5.918   Mean   :3.804  
##  3rd Qu.:64.00   3rd Qu.: 8.018   3rd Qu.: 6.590   3rd Qu.:4.300  
##  Max.   :82.00   Max.   :17.780   Max.   :11.680   Max.   :9.900  
##  NA's   :1       NA's   :1        NA's   :1        NA's   :1      
##      Ratio          Polish            Symmetry         FluorescenceIntensity
##  Min.   :0.820   Length:2255        Length:2255        Length:2255          
##  1st Qu.:0.990   Class :character   Class :character   Class :character     
##  Median :1.020   Mode  :character   Mode  :character   Mode  :character     
##  Mean   :1.199                                                              
##  3rd Qu.:1.410                                                              
##  Max.   :2.650                                                              
##  NA's   :1

set.seed(1234)
# Check for missing values in PricePC and Total columns
sum(is.na(diamonds_df$PricePC))

## [1] 17

sum(is.na(diamonds_df$Total))

## [1] 17

# Impute missing values with the mean
cleaned_diamonds_df <- diamonds_df %>%
  mutate(PricePC = ifelse(is.na(PricePC), mean(PricePC, na.rm = TRUE), PricePC),
         Total = ifelse(is.na(Total), mean(Total, na.rm = TRUE), Total))

# Check for missing values in the imputed dataset
sum(is.na(cleaned_diamonds_df$PricePC))

## [1] 0

sum(is.na(cleaned_diamonds_df$Total))

## [1] 0

# Check for duplicates to identify and remove duplicated rows, ensuring each observation is unique.
duplicated_rows <- cleaned_diamonds_df[duplicated(cleaned_diamonds_df), ]

# Verify if categories in categorical columns make sense and cover all expected possibilities
unique(cleaned_diamonds_df$Shape)

##  [1] "AC" "BR" "CU" "EC" "HS" "MQ" "OV" "PR" "PS" "RA"

unique(cleaned_diamonds_df$Color)

##  [1] "F"   "K"   "G"   "J"   "D"   "H"   "E"   "I"   "L"   "M"   "N"   "Y-Z"
## [13] "U-V"

unique(cleaned_diamonds_df$Clarity)

##  [1] "VVS1" "VS1"  "IF"   "VS2"  "SI1"  "VVS2" "SI2"  "I1"   "I2"   "FL"

# Create a boxplot using ggplot
ggplot(data = cleaned_diamonds_df, aes(y = Total)) +
  geom_boxplot() +
  labs(title = "Boxplot of Value of Diamonds", y = "Total Value")

####*** Research question

How to manage diamonds inventory statistics and find the relationship between the physical dimensions (length, width, depth) of a diamond and its price or quality (cut, carat, color, clarity)?

Part 3 - Exploratory data analysis

# Create aggregated data frame that counts the number of diamonds in each subgroup.
diamonds_inventory <- cleaned_diamonds_df %>% 
  group_by(Shape, Color, Clarity) %>% 
  summarise(diamond_count = n(), .groups = "drop")
diamonds_inventory

## # A tibble: 413 × 4
##    Shape Color Clarity diamond_count
##    <chr> <chr> <chr>           <int>
##  1 AC    D     VS2                 1
##  2 AC    D     VVS2                1
##  3 AC    E     VS1                 1
##  4 AC    F     IF                  1
##  5 AC    F     VS1                 1
##  6 AC    F     VS2                 2
##  7 AC    F     VVS1                1
##  8 AC    G     VS1                 2
##  9 AC    G     VS2                 1
## 10 AC    G     VVS1                1
## # ℹ 403 more rows

# Visualize the total number of diamonds per subgroup in the inventory
ggplot(diamonds_inventory, aes(x = Shape, y = diamond_count, fill = Color)) +
  geom_bar(stat = "identity", position = "dodge") +
  facet_wrap(~ Clarity) +
  labs(title = "Number of Diamonds by Shape, Color, and Clarity", x = "Shape", y = "Diamond Count")

# Creating a bar chart of average values based on Clarity, Color, and Shape
ggplot(cleaned_diamonds_df, aes(x = Clarity, y = Total, fill = Color)) +
  geom_bar(stat = "summary", fun = "mean", position = "dodge") +
  facet_wrap(~ Shape) +
  labs(title = "Average Diamond Value by Clarity, Color, and Shape",
       x = "Clarity", y = "Average Value")

# Analyzing the value of diamonds
qplot(data = cleaned_diamonds_df, x = Total) # generate quick plot of of the 'Total' column

## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(cleaned_diamonds_df$Total) # descriptive statistics for numerical data

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##    247.5   1365.0   3376.8   6842.7   7661.8 201888.0

# Create scatterplot of Weight and Price with grouping by Clarity
cleaned_diamonds_df %>% 
  ggplot(aes(x = Weight, y = Total, color = Clarity)) +
  geom_point() +
  theme_minimal() +
    labs(title = "Scatter plot of Weight and Value of Diamonds",
       subtitle = "White Diamonds Inventory Summary Data ",
       caption = "Data Source: ggplot2 package",
       x = "Wieght and Shape",
       y = "Value of Diamonds")

# Subset data into variables of cut, color, clarity, carat, and price
corr_diamonds <- cleaned_diamonds_df[c(2:5,7)]

# Columns 'Cut', 'Color', 'Clarity', and 'Carat' are factors or characters
# Convert these columns to numeric or categorical codes if appropriate
corr_diamonds$Shape <- as.numeric(as.factor(corr_diamonds$Shape))
corr_diamonds$Color <- as.numeric(as.factor(corr_diamonds$Color))
corr_diamonds$Clarity <- as.numeric(as.factor(corr_diamonds$Clarity))
corr_diamonds$Weight <- as.numeric(corr_diamonds$Weight)  # Assuming 'Carat' is numeric
corr_diamonds$Total <- as.numeric(corr_diamonds$Total)

# Check if conversion is successful and then create the correlation matrix
correlation_matrix <- cor(corr_diamonds)
print(correlation_matrix)

##               Shape      Weight       Color     Clarity       Total
## Shape    1.00000000  0.13799367 -0.13444829  0.05797035  0.08645212
## Weight   0.13799367  1.00000000  0.12408800 -0.03038282  0.84369933
## Color   -0.13444829  0.12408800  1.00000000 -0.01110109 -0.04065172
## Clarity  0.05797035 -0.03038282 -0.01110109  1.00000000  0.03744927
## Total    0.08645212  0.84369933 -0.04065172  0.03744927  1.00000000

# Compute correlation matrix
correlation_matrix <- cor(corr_diamonds)

# Visualize correlation matrix using corrplot
corrplot(correlation_matrix, method = "circle")

EDA Results

By looking at the number to diamonds in the inventory in each subgroup based on ‘Shape’, ‘Color’, and ‘Clarity, we can make a decision of which type of diamonds need to refill to the inventory lists and how many diamonds need to buy from the suppliers. The histogram shows distribution of diamonds’ value is totally right-skewed. with most diamonds in the dataset value is less than $20K. It is understood that higher-value diamonds will generally be taller, wider, and deeper. Therefore, further exploration into the diamonds’ weight aspects is essential. When we find correlation between main characteristics and value of diamonds, the correlation matrix shows that weight has a very strong positive relationship with value, meaning that as value increases, we can likely see an increase in carat weight. The other qualitative factors that describe a diamond’s quality have very weak relationships with price. Within this dataset, the rating most closely related to the value of the diamonds is carat weight.

Part 4 - Inferential Statistical Analysis: Prediction by Linear Regression

The chi-square test is used to find the relationship a diamonds cut and clarity may have on its color. The linear model takes the following form, where the $\mathrm{x}$ values represent the predictors, while the beta values represent weights. \[ y=\beta_0+\beta_1 x_1+\beta_2 x_2+\cdots \beta_n x_n \]

For example, we could use a regression model to understand how the value of a diamond relates to three independent variables: its weight, length, width, and depth. In the model, we could define the value of a diamond as $\beta_{\text {weight }} \times$ weight $+\beta_{\text {length }} \times$ length $+\beta_{\text {width }} \times$ width $+\beta_{\text {depth }} \times$ depth. Where $\beta_{\text {weight }}$ indicates how much a diamond’s value changes as a function of its weight, $\beta_{\text {length }}, \beta_{\text {width }}$ and $\beta_{\text {depth }}$.

Once a regression object is created with lm(), we can summarize the results in an ANOVA table with aov():

# 'Clarity' and 'Color' are categorical variables in diamonds dataset
# Performing the chi-square test on the two categorical variables
chi_sq_test <- chisq.test(cleaned_diamonds_df$Clarity, cleaned_diamonds_df$Color)

## Warning in chisq.test(cleaned_diamonds_df$Clarity, cleaned_diamonds_df$Color):
## Chi-squared approximation may be incorrect

print(chi_sq_test)

## 
##  Pearson's Chi-squared test
## 
## data:  cleaned_diamonds_df$Clarity and cleaned_diamonds_df$Color
## X-squared = 206.33, df = 108, p-value = 3.864e-08

# Assuming 'Price', 'Carat', 'Length', 'Width', 'Depth' are columns in the dataset
# Creating a linear regression model
lm_model <- lm(Total ~ Weight + Len + Width + Depth, data = cleaned_diamonds_df)

# Summarizing the linear regression model
summary(lm_model)

## 
## Call:
## lm(formula = Total ~ Weight + Len + Width + Depth, data = cleaned_diamonds_df)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -29170  -1983    271   2004 115849 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  17247.8     1232.2   14.00  < 2e-16 ***
## Weight       22440.3      424.2   52.90  < 2e-16 ***
## Len          -1500.0      114.9  -13.05  < 2e-16 ***
## Width          742.4      267.0    2.78  0.00548 ** 
## Depth        -8371.4      481.3  -17.39  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 6054 on 2249 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.7686, Adjusted R-squared:  0.7682 
## F-statistic:  1867 on 4 and 2249 DF,  p-value: < 2.2e-16

# The coefficients in the diamond model
lm_model$coefficients

## (Intercept)      Weight         Len       Width       Depth 
##  17247.8286  22440.3442  -1499.9651    742.4117  -8371.4350

# draw a line using the beta parameters just calculated 
ggplot(cleaned_diamonds_df, aes(x = Weight, y = Total)) +
  geom_point() +  # Scatterplot of data points
  geom_smooth(method = "lm", formula = y ~ x, se = FALSE, color = "blue") +
  labs(title = "Scatterplot with Fitted Regression Line", x = "Carat", y = "Total")

# Obtain the residuals from the simple linear regression model
residuals <- residuals(lm_model)

# Create a data frame combining residuals and fitted values
residuals_df <- data.frame(
  Fitted_Values = fitted(lm_model),
  Residuals = residuals
)

# Plot residuals versus fitted values
ggplot(residuals_df, aes(x = Fitted_Values, y = Residuals)) +
  geom_point(shape = 21, fill = "blue", size = 3) +  # Plot as round marks
  geom_abline(slope = 0, intercept = 0, color = "red") +  # Add a line at y = 0
  labs(title = "Residuals versus Fitted Prices", x = "Value of Diamonds [USD]", y = "Residuals [USD]") +
  theme_minimal()

# Create ANOVA object from regression
anova_results <- aov(lm_model)

# Print summary results
summary(anova_results)

##               Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Weight         1 2.536e+11 2.536e+11 6919.92  < 2e-16 ***
## Len            1 5.996e+08 5.996e+08   16.36 5.41e-05 ***
## Width          1 8.419e+09 8.419e+09  229.71  < 2e-16 ***
## Depth          1 1.109e+10 1.109e+10  302.56  < 2e-16 ***
## Residuals   2249 8.243e+10 3.665e+07                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 1 observation deleted due to missingness

# Creating two different linear regression models
lm_model_1 <- lm(Total ~ Weight + Len + Width, data = cleaned_diamonds_df)
lm_model_2 <- lm(Total ~ Weight + Len + Width + Depth, data = cleaned_diamonds_df)

# Comparing the models using ANOVA
anova_results <- anova(lm_model_1, lm_model_2)
print(anova_results)

## Analysis of Variance Table
## 
## Model 1: Total ~ Weight + Len + Width
## Model 2: Total ~ Weight + Len + Width + Depth
##   Res.Df        RSS Df  Sum of Sq      F    Pr(>F)    
## 1   2250 9.3515e+10                                   
## 2   2249 8.2426e+10  1 1.1089e+10 302.56 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Inference Result

The Chi-square test determines there is no association between a diamonds Clarity and Color.Therefore, we reject H0, null hypothesis since the p-value is significant at p < 0.05.

The regression statistics indicate that the relationship/correlation between Value and Weight is significant at p < 0.05, with a p-value of 0.000. When the Carat Weight is established, the equation of linear model is Value = 17247.8286 + 22440.3442 * Weight - 1499.9651 * Length + 742.4117 * Width -8371.4350 * Depth that defines a regression for predicting the diamonds value (in USD).

The ANOVA results strongly suggest that adding the ‘Depth’ variable to the model significantly improves the model’s ability to explain the variability in the ‘Total’ compared to the model without ‘Depth’. The low p-value (< 0.001) indicates that ‘Depth’ is a significant predictor in explaining the ‘Total Value of Diamond’.

Part 5 - Conclusion

Generally, diamonds with better colour are higher price, However, the most expensive diamonds are “Internally Flawless” from the D grade (the best colour) but D color is less expensive than E and F colors when carat is in between 1.5 and 2.0.

The weight of a diamond is the most important indicator of its price. A diamond is profitable to sell if the combination of size, cut, clarity, and color is above a certain threshold. The smallest diamonds are only profitable to sell if they are exceptional in all the other factors (cut, clarity, and color), so the small diamonds sold have similar characteristics. However, larger diamonds may be profitable regardless of the values of the other factors. Thus we will observe large diamonds with a wider variety of cut, clarity, and color and thus a large amount of variation in prices.

The data set has been analyzed to find any relationships between value of diamonds and the measurement (length x width x depth) that can help us improve business decisions and personal preferences when working with diamonds.

References

Hadley Wickham, Garrett Grolemund (2016). “R for Data Science”, Published by O’Reilly Media, Inc.
https://bookdown.org/ndphillips/YaRrr/anova.html
https://www.nature.com/articles/s41598-023-44326-w