Overview

This project will examine the significance of the four Cs (cut, color, clarity, carat) and depth in determining diamond price.

Introduction

The data for this project was obtained through the ggplot2 library in R. The original posting is done by ggplot2, and the exact publishing date is unknown. The dataset contains the prices and other attributes of almost 54,000 diamonds.

Exploring the Data

#Load necessary libraries and store dataset into environment
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.3     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diamonds <- diamonds

#Look at the structure of the dataset
str(diamonds)
## tibble [53,940 Ă— 10] (S3: tbl_df/tbl/data.frame)
##  $ carat  : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

This dataset contains 53,940 observations of diamonds with a total of 10 variables: carat, cut, color, clarity, depth (total depth percentage), table (width of the top of the diamond relative to widest point), price (in US dollars), x (length in mm), y (width in mm), and z (depth in mm).

#Summary of data
summary(diamonds)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

In regards to the variables we are concerned with in this project; carat ranges from .2 to 10.74, color ranges from D (best) to J (worst), clarity ranges from I1 (worst) to IF (best), depth ranges from 43% to 79%, and price ranges from 326 USD to 18,823 USD.

#Determine NA values in the dataset
sum(is.na(diamonds))
## [1] 0

The dataset contains 0 NA values, so we can conclude that there is no missing data.

#Create plot(s) to begin visualizing the data
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()

#Create scatterplot of carat and price, by depth
ggplot(diamonds, aes(x=carat, y=price, color=depth)) + geom_point()

The first plot conveys that as a stone’s carat rating increases, there seems to be an exponential relationship between the carat and the price of a diamond. This exponential relationship appears to increase as clarity increases in quality. Similarly, examining the second scatterplot suggests that depth also influences price. However, with the depth visualization, there seems to be slight variation within the variable so that further analysis might be negligible. Looking back to the summary() section, it can be noticed that the depth variable has a median of 61.8, a mean of 61.75, a Q1 of 61 and a Q3 of 59.The values in this variable are all relatively close and this is shown in the second scatterplot with the lack of variation in color.

#Create remaining scatterplots to examine further correlation
ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()

ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()

Similar to the last trend, color seems to directly affect price. As the quality increases, so does the potential for a higher price. As expected, cut appears to follow the same trend and looking at the scatterplot we can see that the Ideal and Premium cuts can fetch higher prices.

Analysis

#In order to find the significance of each variable in regards to their effect on price we can create linear models of each and take a summary() for a p-value...

lm1c <- lm(price ~ carat, data = diamonds)
lm2c <- lm(price ~ cut, data = diamonds)
lm3c <- lm(price ~ color, data = diamonds)
lm4c <- lm(price ~ clarity, data = diamonds)

#Take a summary of each linear model in order to find Multiple R-Squared (Explained Variation in regards to Price) and the p-value.
summary(lm1c)
## 
## Call:
## lm(formula = price ~ carat, data = diamonds)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18585.3   -804.8    -18.9    537.4  12731.7 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2256.36      13.06  -172.8   <2e-16 ***
## carat        7756.43      14.07   551.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared:  0.8493, Adjusted R-squared:  0.8493 
## F-statistic: 3.041e+05 on 1 and 53938 DF,  p-value: < 2.2e-16
summary(lm2c)
## 
## Call:
## lm(formula = price ~ cut, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4258  -2741  -1494   1360  15348 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4062.24      25.40 159.923  < 2e-16 ***
## cut.L        -362.73      68.04  -5.331  9.8e-08 ***
## cut.Q        -225.58      60.65  -3.719    2e-04 ***
## cut.C        -699.50      52.78 -13.253  < 2e-16 ***
## cut^4        -280.36      42.56  -6.588  4.5e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared:  0.01286,    Adjusted R-squared:  0.01279 
## F-statistic: 175.7 on 4 and 53935 DF,  p-value: < 2.2e-16
summary(lm3c)
## 
## Call:
## lm(formula = price ~ color, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4989  -2619  -1376   1374  15654 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  4124.73      18.64 221.294  < 2e-16 ***
## color.L      2126.73      57.02  37.295  < 2e-16 ***
## color.Q       200.50      54.26   3.695  0.00022 ***
## color.C      -254.36      51.08  -4.979 6.41e-07 ***
## color^4        40.88      46.92   0.871  0.38361    
## color^5      -228.88      44.36  -5.160 2.48e-07 ***
## color^6        87.92      40.22   2.186  0.02880 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3927 on 53933 degrees of freedom
## Multiple R-squared:  0.03128,    Adjusted R-squared:  0.03117 
## F-statistic: 290.2 on 6 and 53933 DF,  p-value: < 2.2e-16
summary(lm4c)
## 
## Call:
## lm(formula = price ~ clarity, data = diamonds)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -4737  -2727  -1429   1262  16254 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3677.42      25.88 142.086  < 2e-16 ***
## clarity.L   -1723.35      98.72 -17.457  < 2e-16 ***
## clarity.Q    -428.36      96.70  -4.430 9.45e-06 ***
## clarity.C     647.87      83.31   7.777 7.57e-15 ***
## clarity^4    -123.13      66.73  -1.845   0.0650 .  
## clarity^5     804.81      54.62  14.733  < 2e-16 ***
## clarity^6    -273.65      47.68  -5.739 9.55e-09 ***
## clarity^7      81.19      42.02   1.932   0.0533 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3935 on 53932 degrees of freedom
## Multiple R-squared:  0.02715,    Adjusted R-squared:  0.02702 
## F-statistic:   215 on 7 and 53932 DF,  p-value: < 2.2e-16

For each relationship, there is a p-value < .05, <.001, <.0001, thus we can confidently say these are significant relationships. The Multiple R-squared values are as follows: carat ~ 84.93%, cut ~ 1.286%, color ~ 3.128%, and clarity ~ 2.715%. Therefore, 84.93% of the price variation can be explained by the carat, 1.286% by cut, 3.128% by color, and 2.715% by clarity. Though, it is important to note that cut, color, and clarity may be lower in percentage of variation influence due to the possibility that they are directly tied to carat rating.

Conclusions

Based on the data used in this analysis, it appears that a diamond’s price depends on its carat, color, clarity, and cut. To be more specific, each variable has a significant relationship with the price of a diamond, and in total, approximately ~ 92% of the price variation can be explained by the four variables.

Limitations

One possible limitation of this study is that its scope might be limited because the analysis is based on a single dataset and may fail to capture all the factors influencing diamond prices. Further, it is worthy to note that variables such as brand of the diamond, current market demand, and economic conditions could also affect prices, but are not included in this dataset. Because of limitations such as these, I believe that the statistical findings may be specific to this dataset, and therefore may not be generalizeable to all diamond markets or populations.


This document was produced as a final project for MAT 143H - Introduction to Statistics (Honors) at North Shore Community College.
The course was led by Professor Billy Jackson.
Student Name: Michael V. Saraceni Semester: Spring 2024