This project will examine the significance of the four Cs (cut, color, clarity, carat) and depth in determining diamond price.
The data for this project was obtained through the ggplot2 library in R. The original posting is done by ggplot2, and the exact publishing date is unknown. The dataset contains the prices and other attributes of almost 54,000 diamonds.
#Load necessary libraries and store dataset into environment
library(openintro)
## Loading required package: airports
## Loading required package: cherryblossom
## Loading required package: usdata
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ lubridate 1.9.3 ✔ tibble 3.2.1
## ✔ purrr 1.0.2 ✔ tidyr 1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
diamonds <- diamonds
#Look at the structure of the dataset
str(diamonds)
## tibble [53,940 Ă— 10] (S3: tbl_df/tbl/data.frame)
## $ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num [1:53940] 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num [1:53940] 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int [1:53940] 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num [1:53940] 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num [1:53940] 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num [1:53940] 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
This dataset contains 53,940 observations of diamonds with a total of 10 variables: carat, cut, color, clarity, depth (total depth percentage), table (width of the top of the diamond relative to widest point), price (in US dollars), x (length in mm), y (width in mm), and z (depth in mm).
#Summary of data
summary(diamonds)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
In regards to the variables we are concerned with in this project; carat ranges from .2 to 10.74, color ranges from D (best) to J (worst), clarity ranges from I1 (worst) to IF (best), depth ranges from 43% to 79%, and price ranges from 326 USD to 18,823 USD.
#Determine NA values in the dataset
sum(is.na(diamonds))
## [1] 0
The dataset contains 0 NA values, so we can conclude that there is no missing data.
#Create plot(s) to begin visualizing the data
ggplot(diamonds, aes(x=carat, y=price, color=clarity)) + geom_point()
#Create scatterplot of carat and price, by depth
ggplot(diamonds, aes(x=carat, y=price, color=depth)) + geom_point()
The first plot conveys that as a stone’s carat rating increases, there seems to be an exponential relationship between the carat and the price of a diamond. This exponential relationship appears to increase as clarity increases in quality. Similarly, examining the second scatterplot suggests that depth also influences price. However, with the depth visualization, there seems to be slight variation within the variable so that further analysis might be negligible. Looking back to the summary() section, it can be noticed that the depth variable has a median of 61.8, a mean of 61.75, a Q1 of 61 and a Q3 of 59.The values in this variable are all relatively close and this is shown in the second scatterplot with the lack of variation in color.
#Create remaining scatterplots to examine further correlation
ggplot(diamonds, aes(x=carat, y=price, color=color)) + geom_point()
ggplot(diamonds, aes(x=carat, y=price, color=cut)) + geom_point()
Similar to the last trend, color seems to directly affect price. As the quality increases, so does the potential for a higher price. As expected, cut appears to follow the same trend and looking at the scatterplot we can see that the Ideal and Premium cuts can fetch higher prices.
#In order to find the significance of each variable in regards to their effect on price we can create linear models of each and take a summary() for a p-value...
lm1c <- lm(price ~ carat, data = diamonds)
lm2c <- lm(price ~ cut, data = diamonds)
lm3c <- lm(price ~ color, data = diamonds)
lm4c <- lm(price ~ clarity, data = diamonds)
#Take a summary of each linear model in order to find Multiple R-Squared (Explained Variation in regards to Price) and the p-value.
summary(lm1c)
##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
summary(lm2c)
##
## Call:
## lm(formula = price ~ cut, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4258 -2741 -1494 1360 15348
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4062.24 25.40 159.923 < 2e-16 ***
## cut.L -362.73 68.04 -5.331 9.8e-08 ***
## cut.Q -225.58 60.65 -3.719 2e-04 ***
## cut.C -699.50 52.78 -13.253 < 2e-16 ***
## cut^4 -280.36 42.56 -6.588 4.5e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3964 on 53935 degrees of freedom
## Multiple R-squared: 0.01286, Adjusted R-squared: 0.01279
## F-statistic: 175.7 on 4 and 53935 DF, p-value: < 2.2e-16
summary(lm3c)
##
## Call:
## lm(formula = price ~ color, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4989 -2619 -1376 1374 15654
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4124.73 18.64 221.294 < 2e-16 ***
## color.L 2126.73 57.02 37.295 < 2e-16 ***
## color.Q 200.50 54.26 3.695 0.00022 ***
## color.C -254.36 51.08 -4.979 6.41e-07 ***
## color^4 40.88 46.92 0.871 0.38361
## color^5 -228.88 44.36 -5.160 2.48e-07 ***
## color^6 87.92 40.22 2.186 0.02880 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3927 on 53933 degrees of freedom
## Multiple R-squared: 0.03128, Adjusted R-squared: 0.03117
## F-statistic: 290.2 on 6 and 53933 DF, p-value: < 2.2e-16
summary(lm4c)
##
## Call:
## lm(formula = price ~ clarity, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4737 -2727 -1429 1262 16254
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3677.42 25.88 142.086 < 2e-16 ***
## clarity.L -1723.35 98.72 -17.457 < 2e-16 ***
## clarity.Q -428.36 96.70 -4.430 9.45e-06 ***
## clarity.C 647.87 83.31 7.777 7.57e-15 ***
## clarity^4 -123.13 66.73 -1.845 0.0650 .
## clarity^5 804.81 54.62 14.733 < 2e-16 ***
## clarity^6 -273.65 47.68 -5.739 9.55e-09 ***
## clarity^7 81.19 42.02 1.932 0.0533 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3935 on 53932 degrees of freedom
## Multiple R-squared: 0.02715, Adjusted R-squared: 0.02702
## F-statistic: 215 on 7 and 53932 DF, p-value: < 2.2e-16
For each relationship, there is a p-value < .05, <.001, <.0001, thus we can confidently say these are significant relationships. The Multiple R-squared values are as follows: carat ~ 84.93%, cut ~ 1.286%, color ~ 3.128%, and clarity ~ 2.715%. Therefore, 84.93% of the price variation can be explained by the carat, 1.286% by cut, 3.128% by color, and 2.715% by clarity. Though, it is important to note that cut, color, and clarity may be lower in percentage of variation influence due to the possibility that they are directly tied to carat rating.
Based on the data used in this analysis, it appears that a diamond’s price depends on its carat, color, clarity, and cut. To be more specific, each variable has a significant relationship with the price of a diamond, and in total, approximately ~ 92% of the price variation can be explained by the four variables.
One possible limitation of this study is that its scope might be limited because the analysis is based on a single dataset and may fail to capture all the factors influencing diamond prices. Further, it is worthy to note that variables such as brand of the diamond, current market demand, and economic conditions could also affect prices, but are not included in this dataset. Because of limitations such as these, I believe that the statistical findings may be specific to this dataset, and therefore may not be generalizeable to all diamond markets or populations.
This document was produced as a final project for MAT 143H -
Introduction to Statistics (Honors) at North Shore Community
College.
The course was led by Professor Billy Jackson.
Student Name: Michael V. Saraceni Semester: Spring 2024