What makes a diamond valueable? This dataset contains descriptions of over 50,000 diamonds. From it, different characteristics of diamonds and their relationship to price will be examined.
#Imports the data from github (was alerady up there)
path <- 'https://raw.github.com/vincentarelbundock/Rdatasets/master/csv/ggplot2/diamonds.csv'
df <- read.csv(path)
head(df)
## X carat cut color clarity depth table price x y z
## 1 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
First, the raw data was imported from a csv file.
#1 Data Exploration.
summary(df)
## X carat cut color clarity
## Min. : 1 Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:13486 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :26971 Median :0.7000 Ideal :21551 F: 9542 SI2 : 9194
## Mean :26971 Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:40455 3rd Qu.:1.0400 Very Good:12082 H: 8304 VVS2 : 5066
## Max. :53940 Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
Here we can see the information we are provided with. incuded are the price, weight(carat), an index (X), quality of the cut, the color, clarity, and the various dimensions of the diamond. Cut, clarity, and color appear to be factors with several different levels correlating to their qualities. The depth, table, and dimensons are all dimensional measurements of the diamonds. Price is in USD. This is more data than will be used here, and some will be removed.
#2 Data Wrangling
diamonds <- df
diamonds$X <- diamonds$depth <- diamonds$table <- diamonds$x <- diamonds$y <- diamonds$z <- NULL
names(diamonds) <- c("Carat", "Cut", "Color", "Clarity", "Price_USD")
diamonds$Cut <- factor(diamonds$Cut, levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
diamonds$Color <- factor(diamonds$Color, levels = c("J", "I", "H", "G", "F", "E", "D"))
diamonds$Clarity <- factor(diamonds$Clarity, levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))
head(diamonds)
## Carat Cut Color Clarity Price_USD
## 1 0.23 Ideal E SI2 326
## 2 0.21 Premium E SI1 326
## 3 0.23 Good E VS1 327
## 4 0.29 Premium I VS2 334
## 5 0.31 Good J SI2 335
## 6 0.24 Very Good J VVS2 336
summary(diamonds)
## Carat Cut Color Clarity Price_USD
## Min. :0.2000 Fair : 1610 J: 2808 SI1 :13065 Min. : 326
## 1st Qu.:0.4000 Good : 4906 I: 5422 VS2 :12258 1st Qu.: 950
## Median :0.7000 Very Good:12082 H: 8304 SI2 : 9194 Median : 2401
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean : 3933
## 3rd Qu.:1.0400 Ideal :21551 F: 9542 VVS2 : 5066 3rd Qu.: 5324
## Max. :5.0100 E: 9797 VVS1 : 3655 Max. :18823
## D: 6775 (Other): 2531
The data was then simplified by removing a few variables: X (the index), the dimensions, and table. Units were then added to the price for further clarification. Factors were also reordered for clearer plots later on. Above is the head and summary of the new dataset.
#3 Graphics
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
##
## Attaching package: 'ggplot2'
## The following object is masked _by_ '.GlobalEnv':
##
## diamonds
qplot(Carat, Price_USD, data = diamonds)
Above is a comparison of price and carat. It makes sense that bigger diamonds are more valuable, right? While this generally proves true it also yields some interesting insights; diamond sizes are not evenly distributed.
qplot(Carat, data = diamonds, binwidth = 0.01, xlim = c(0, 2))
## Warning: Removed 1889 rows containing non-finite values (stat_bin).
## Warning: Removed 2 rows containing missing values (geom_bar).
(Above) Looking at a histogram of sizes alone, there’s a tendency to round to certain weights (0.3, 0.4, 0.5, 0.7, 0.9, 1 and 1.01, 1.2, 1.5, 1.7, and 2). Potential reason for this trend could be reporting bias, rounding error, or that they are secifically cut as not to fall under these ‘round’ weights. In such case, maybe cut diamonds of ‘round’ weigths more valueable per carat?
diamonds$PriceCarat <- diamonds$Price_USD/diamonds$Carat
ggplot(diamonds,aes(x=Carat,y=PriceCarat, ))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
Above is a plot of the price-per-carat against carat with a trendline. Here we see that diamonds at ‘round’ weights are not nessicarily more valueable for their weight on the whole. On a per weight basis, diamonds are most expensive around 1.7 carats (which is one of the ‘round’ weights). Hence, if you wanted to get the best price for your 4 carat diamond, you would be better off cutting it into two smaller ones.
One other note above is the clear maximum limit for price-per-carat, which slopes downwards beginning around 1 carat.
There are several other charecteristics to consider however. Perhaps diamonds of certain sizes are more likely to hold other characteristics affecting price (i.e. confounding factors). Will the trend hold true if we control for other variables?
sub1 <- subset(diamonds, Color == "F" & Cut == "Ideal")
ggplot(sub1,aes(x=Carat,y=PriceCarat))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
The above subset controls for ideal cut and midgrade color. This limits the amount of data that can be used, but shows that the general trend up to around 2.5 carats (where we run out of data) holds true.
sub2 <- subset(diamonds, Carat >= 1.5 & Carat <= 2)
ggplot(sub2,aes(x=Carat,y=PriceCarat, ))+geom_point()+geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
When a subset between two ‘round’ weights is examined (as above), there is no price-per-carat drop while approaching the next highest ‘round’ weight. There is still a tendancy to cut diamonds to the ‘round’ weights, but it does not appear make the diamond more valueable for its’ weight.
subC1 <- subset(diamonds, Carat == 0.3)
ggplot(subC1,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.3 Carats")
subC2 <- subset(diamonds, Carat == 0.5)
ggplot(subC2,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.5 Carats")
subC3 <- subset(diamonds, Carat == 0.7)
ggplot(subC3,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 0.7 Carats")
#There are more observations at 1.01 carats than 1.00 carats
subC4 <- subset(diamonds, Carat == 1.01)
ggplot(subC4,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 1.01 Carats")
subC5 <- subset(diamonds, Carat == 1.5)
ggplot(subC5,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 1.5 Carats")
subC6 <- subset(diamonds, Carat == 2.0)
ggplot(subC6,aes(x=Cut,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Cut/Price, Size = 2.0 Carats")
Above are some box plots of Price against Cut for several different ‘round’ sizes of diamond (the ones for which we have the most data). As one would expect, the general trend is that diamonds of better cut are more valueable. However, This does not hold true for the smallest of diamonds (0.3 carats). One could theorize that the cut of the diamond may not be as noticeable at this small size.
ggplot(subC1,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.3 Carats")
ggplot(subC2,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.5 Carats")
ggplot(subC3,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 0.7 Carats")
#There are more observations at 1.01 carats than 1.00 carats
ggplot(subC4,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 1.01 Carats")
ggplot(subC5,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 1.5 Carats")
ggplot(subC6,aes(x=Color,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Color/Price, Size = 2.0 Carats")
(Above) From worst color, J, to best color, D. At 0.7 carats, the color of the diamond tends to correlate well with the price of the diamond. However, at all other sizes of diamond, color appears to loose its impact on the price after the “G” rating.
ggplot(subC1,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.3 Carats")
ggplot(subC2,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.5 Carats")
ggplot(subC3,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 0.7 Carats")
#There are more observations at 1.01 carats than 1.00 carats
ggplot(subC4,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 1.01 Carats")
ggplot(subC5,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 1.5 Carats")
ggplot(subC6,aes(x=Clarity,y=Price_USD, ))+geom_boxplot(outlier.color = "red")+ggtitle("Clarity/Price, Size = 2.0 Carats")
Above are plots of clarity (from “I1” at the worst to “IF” at the best). There is a general trend of price going up in responce to better clariry. However, for larger diamonds (carats 1.5, 2.0) there is less of an improvement in price with higher grades of clarity.
This has been a quick analysis of how the different characteristics of a diamond affect its sale price. As has been shown, it’s not quite as straight-forward as would be expected. Further analysis could be performed with some of the characteristics not included in this analysis. Also, comparisons between characterictics other than price (i.e. cut vs color) may also prove insightful.