What makes a diamond valueable? This dataset contains descriptions of over 50,000 diamonds. From it, different characteristics of diamonds and their relationship to price will be examined.

#Imports the data from github (was alerady up there)

path <- 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv'

df <- read.csv(path)

head(df)
##   X carat       cut color clarity depth table price    x    y    z
## 1 1  0.23     Ideal     E     SI2  61.5    55   326 3.95 3.98 2.43
## 2 2  0.21   Premium     E     SI1  59.8    61   326 3.89 3.84 2.31
## 3 3  0.23      Good     E     VS1  56.9    65   327 4.05 4.07 2.31
## 4 4  0.29   Premium     I     VS2  62.4    58   334 4.20 4.23 2.63
## 5 5  0.31      Good     J     SI2  63.3    58   335 4.34 4.35 2.75
## 6 6  0.24 Very Good     J    VVS2  62.8    57   336 3.94 3.96 2.48

First, the raw data was imported from a csv file.

#1 Data Exploration.  


summary(df)
##        X             carat               cut        color        clarity     
##  Min.   :    1   Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:13486   1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :26971   Median :0.7000   Ideal    :21551   F: 9542   SI2    : 9194  
##  Mean   :26971   Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:40455   3rd Qu.:1.0400   Very Good:12082   H: 8304   VVS2   : 5066  
##  Max.   :53940   Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 

Here we can see the information we are provided with. incuded are the price, weight(carat), an index (X), quality of the cut, the color, clarity, and the various dimensions of the diamond. Cut, clarity, and color appear to be factors with several different levels correlating to their qualities. The depth, table, and dimensons are all dimensional measurements of the diamonds. Price is in USD. This is more data than will be used here, and some will be removed.

#2 Data Wrangling

diamonds <- df

diamonds$X <- diamonds$depth <- diamonds$table <- diamonds$x <- diamonds$y <- diamonds$z <- NULL

names(diamonds) <- c("Carat", "Cut", "Color", "Clarity", "Price_USD")

diamonds$Cut <- factor(diamonds$Cut, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
diamonds$Color <- factor(diamonds$Color, levels = c("J", "I", "H", "G", "F", "E", "D"))
diamonds$Clarity <- factor(diamonds$Clarity, levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))

head(diamonds)
##   Carat       Cut Color Clarity Price_USD
## 1  0.23     Ideal     E     SI2       326
## 2  0.21   Premium     E     SI1       326
## 3  0.23      Good     E     VS1       327
## 4  0.29   Premium     I     VS2       334
## 5  0.31      Good     J     SI2       335
## 6  0.24 Very Good     J    VVS2       336
summary(diamonds)
##      Carat               Cut        Color        Clarity        Price_USD    
##  Min.   :0.2000   Fair     : 1610   J: 2808   SI1    :13065   Min.   :  326  
##  1st Qu.:0.4000   Good     : 4906   I: 5422   VS2    :12258   1st Qu.:  950  
##  Median :0.7000   Very Good:12082   H: 8304   SI2    : 9194   Median : 2401  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   : 3933  
##  3rd Qu.:1.0400   Ideal    :21551   F: 9542   VVS2   : 5066   3rd Qu.: 5324  
##  Max.   :5.0100                     E: 9797   VVS1   : 3655   Max.   :18823  
##                                     D: 6775   (Other): 2531

The data was then simplified by removing a few variables: X (the index), the dimensions, and table. Units were then added to the price for further clarification. Factors were also reordered for clearer plots later on. Above is the head and summary of the new dataset.

#3 Graphics
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
## 
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
## 
##     diamonds
qplot(Carat, Price_USD, data = diamonds)

Above is a comparison of price and carat. It makes sense that bigger diamonds are more valuable, right? While this generally proves true it also yields some interesting insights; diamond sizes are not evenly distributed.

qplot(Carat, data = diamonds, binwidth = 0.01, xlim = c(0, 2))
## Warning: Removed 1889 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).

(Above) Looking at a histogram of sizes alone, there’s a tendency to round to certain weights (0.3, 0.4, 0.5, 0.7, 0.9, 1 and 1.01, 1.2, 1.5, 1.7, and 2). Potential reason for this trend could be reporting bias, rounding error, or that they are secifically cut as not to fall under these ‘round’ weights. In such case, maybe cut diamonds of ‘round’ weigths more valueable per carat?

diamonds$PriceCarat <- diamonds$Price_USD/diamonds$Carat
ggplot(diamonds,aes(x=Carat,y=PriceCarat, ))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

Above is a plot of the price-per-carat against carat with a trendline. Here we see that diamonds at ‘round’ weights are not nessicarily more valueable for their weight on the whole. On a per weight basis, diamonds are most expensive around 1.7 carats (which is one of the ‘round’ weights). Hence, if you wanted to get the best price for your 4 carat diamond, you would be better off cutting it into two smaller ones.

One other note above is the clear maximum limit for price-per-carat, which slopes downwards beginning around 1 carat.

There are several other charecteristics to consider however. Perhaps diamonds of certain sizes are more likely to hold other characteristics affecting price (i.e. confounding factors). Will the trend hold true if we control for other variables?

sub1 <- subset(diamonds, Color == "F" & Cut == "Ideal")
ggplot(sub1,aes(x=Carat,y=PriceCarat))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

The above subset controls for ideal cut and midgrade color. This limits the amount of data that can be used, but shows that the general trend up to around 2.5 carats (where we run out of data) holds true.

sub2 <- subset(diamonds, Carat >= 1.5 & Carat <= 2)
ggplot(sub2,aes(x=Carat,y=PriceCarat, ))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

When a subset between two ‘round’ weights is examined (as above), there is no price-per-carat drop while approaching the next highest ‘round’ weight. There is still a tendancy to cut diamonds to the ‘round’ weights, but it does not appear make the diamond more valueable for its’ weight.

subC1 <- subset(diamonds, Carat == 0.3)
ggplot(subC1,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.3 Carats")

subC2 <- subset(diamonds, Carat == 0.5)
ggplot(subC2,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.5 Carats")

subC3 <- subset(diamonds, Carat == 0.7)
ggplot(subC3,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.7 Carats")

#There are more observations at 1.01 carats than 1.00 carats
subC4 <- subset(diamonds, Carat == 1.01)
ggplot(subC4,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 1.01 Carats")

subC5 <- subset(diamonds, Carat == 1.5)
ggplot(subC5,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 1.5 Carats")

subC6 <- subset(diamonds, Carat == 2.0)
ggplot(subC6,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 2.0 Carats")

Above are some box plots of Price against Cut for several different ‘round’ sizes of diamond (the ones for which we have the most data). As one would expect, the general trend is that diamonds of better cut are more valueable. However, This does not hold true for the smallest of diamonds (0.3 carats). One could theorize that the cut of the diamond may not be as noticeable at this small size.

ggplot(subC1,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.3 Carats")

ggplot(subC2,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.5 Carats")

ggplot(subC3,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.7 Carats")

#There are more observations at 1.01 carats than 1.00 carats

ggplot(subC4,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 1.01 Carats")

ggplot(subC5,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 1.5 Carats")

ggplot(subC6,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 2.0 Carats")

(Above) From worst color, J, to best color, D. At 0.7 carats, the color of the diamond tends to correlate well with the price of the diamond. However, at all other sizes of diamond, color appears to loose its impact on the price after the “G” rating.

ggplot(subC1,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.3 Carats")

ggplot(subC2,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.5 Carats")

ggplot(subC3,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.7 Carats")

#There are more observations at 1.01 carats than 1.00 carats

ggplot(subC4,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 1.01 Carats")

ggplot(subC5,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 1.5 Carats")

ggplot(subC6,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 2.0 Carats")

Above are plots of clarity (from “I1” at the worst to “IF” at the best). There is a general trend of price going up in responce to better clariry. However, for larger diamonds (carats 1.5, 2.0) there is less of an improvement in price with higher grades of clarity.

This has been a quick analysis of how the different characteristics of a diamond affect its sale price. As has been shown, it’s not quite as straight-forward as would be expected. Further analysis could be performed with some of the characteristics not included in this analysis. Also, comparisons between characterictics other than price (i.e. cut vs color) may also prove insightful.