Investigating what influences diamond prices

1. Introduction

This script was created to investigate the relationship between diamonds’ prices and their physical properties as described in the diamonds dataset from ggplot2.

2. Exploratory Data Analysis

Data and summary statistics: the diamonds data has 10 variables that include price, carat (unit of mass equal to 200 mg), cut, color, clarity, and measures of x, y and z axis.

# Load ggplot2 and data
library(ggplot2)
data(diamonds)

# See variables names
names(diamonds)

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

# Get summary stats
summary(diamonds)

##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
##

Plotting price data in a histogram evidences that the prices have a long tailed distribution (left plot); the price of the most expensive diamond is over 50 times larger than the least expensive, which means the price data distribution is easier to see on a logarithmic scale (right plot).

Most diamonds cost less than 1000 dollars, but there is an interesting multipeak distribution above 1000 dollars.

library(gridExtra)

plot1 <- qplot(price, data = diamonds, binwidth = 100) + 
  ggtitle('Price')

plot2 <- qplot(price, data = diamonds, binwidth = 0.01) +
  ggtitle('Price (log10)') +
  scale_x_log10()

grid.arrange(plot1, plot2, ncol = 2)

A few measures can be investigated as potential price determinants: carat (weight), table (width of top of diamond relative to widest point), depth, width (y) and length (x).

Analysing the plots below, it is clear that carat has the closest to a linear relationship to price, so it will be the focus for the price modelling.

plot3 <- ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point() +
  ggtitle('Carat')

plot4 <- ggplot(aes(x = table, y = price), data = diamonds) +
  geom_point() +
  ggtitle('Table')

plot5 <- ggplot(aes(x = depth, y = price), data = diamonds) +
  geom_point() +
  ggtitle('Depth percentage')

plot6 <- ggplot(aes(x = y, y = price), data = diamonds) +
  geom_point() +
  ggtitle('Width (mm)')

plot7 <- ggplot(aes(x = x, y = price), data = diamonds) +
  geom_point() +
  ggtitle('Length (mm)')


grid.arrange(plot3, plot4, plot5, plot6, plot7, ncol = 3)

Price and carat relationship: when plotting all data, it is possible to see the highest priced diamonds do not follow the overall distribution. Trying to plot data again excluding the heaviest / most expensive 1%, the distribution is much better (see blue lines in both figures below). However, some fitting is still needed so it becomes a linear distribution.

# Scatter plot of carat versus price
plot8 <- ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point() +
  geom_smooth() +
  ggtitle('All data')

# Scatter plot of carat versus price excluding outliers
plot9 <- ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point() +
  coord_cartesian(xlim = c(0, quantile(diamonds$carat, 0.99)),
                  ylim = c(0, quantile(diamonds$price, 0.99))) +
  geom_smooth() +
  ggtitle('Excluding highest 1%')

grid.arrange(plot8, plot9, ncol = 2)

## `geom_smooth()` using method = 'gam'
## `geom_smooth()` using method = 'gam'

As noticed before, the price distribution is better represented when in logarithmic scale.

# Plot the price in log scale
ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point() +
  scale_y_continuous(trans = "log10", limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) + 
  geom_smooth() +
  ggtitle('Price (log10) versus carat')

## `geom_smooth()` using method = 'gam'

As the carat distribution presented resembles the x cube function distribution, a transformation can be attempted so data is plotted as the cube root of carat as well. A function can be defined to get the cube root, and then be applied to the x axis. Points are also made smaller for clarity.

# Load the ggplot graphics package
library(GGally)
library(scales)
library(memisc)

## Loading required package: lattice

## Loading required package: MASS

## 
## Attaching package: 'memisc'

## The following object is masked from 'package:scales':
## 
##     percent

## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts

## The following object is masked from 'package:base':
## 
##     as.array

# Function to extract the cube root
trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                             inverse = function(x) x^3)

# Plot the cube root of carat
ggplot(aes(x = carat, y = price), data = diamonds) +
  geom_point(alpha = 0.5, size = 0.75, position = "jitter") +
  scale_x_continuous(trans = trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = "log10", limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) + 
  geom_smooth() +
  ggtitle('Price (log10) versus carat (cube root)')

## `geom_smooth()` using method = 'gam'

Within this distribution, the qualitative characteristics can be analysed to see if they impact on price.

# Load color brewer
library(RColorBrewer)

# Same plot as above with color by other diamond properties
ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(aes(color = clarity), alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = "log10", limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by cube root of carat and clarity')

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(aes(color = color), alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Color',
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = "log10", limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by cube root of carat and color')

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(aes(color = cut), alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Cut', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = "log10", limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by cube root of carat and Cut')

Cut seems to have the least impact on price; clarity and color do impact price, having the best color and clarity associated with highest prices within the same carat.

3. Price modelling

To model the price variation the following parameters are used: cube root of carat, carat, color and clarity.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + color)
m4 <- update(m3, ~ . + clarity)
mtable(m1, m2, m3, m4)

## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + color, 
##     data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + color + 
##     clarity, data = diamonds)
## 
## ==============================================================================
##                        m1             m2             m3             m4        
## ------------------------------------------------------------------------------
##   (Intercept)          2.821***       1.039***       1.096***       0.485***  
##                       (0.006)        (0.019)        (0.017)        (0.010)    
##   I(carat^(1/3))       5.558***       8.568***       8.303***       9.088***  
##                       (0.007)        (0.032)        (0.029)        (0.017)    
##   carat                              -1.137***      -0.966***      -1.078***  
##                                      (0.012)        (0.011)        (0.006)    
##   color: .L                                         -0.373***      -0.442***  
##                                                     (0.004)        (0.002)    
##   color: .Q                                         -0.132***      -0.094***  
##                                                     (0.003)        (0.002)    
##   color: .C                                         -0.002         -0.016***  
##                                                     (0.003)        (0.002)    
##   color: ^4                                          0.029***       0.012***  
##                                                     (0.003)        (0.002)    
##   color: ^5                                         -0.018***      -0.003*    
##                                                     (0.003)        (0.002)    
##   color: ^6                                         -0.027***       0.000     
##                                                     (0.002)        (0.001)    
##   clarity: .L                                                       0.941***  
##                                                                    (0.003)    
##   clarity: .Q                                                      -0.255***  
##                                                                    (0.003)    
##   clarity: .C                                                       0.144***  
##                                                                    (0.003)    
##   clarity: ^4                                                      -0.069***  
##                                                                    (0.002)    
##   clarity: ^5                                                       0.030***  
##                                                                    (0.002)    
##   clarity: ^6                                                      -0.005**   
##                                                                    (0.002)    
##   clarity: ^7                                                       0.034***  
##                                                                    (0.001)    
## ------------------------------------------------------------------------------
##   R-squared            0.924          0.935          0.947          0.983     
##   adj. R-squared       0.924          0.935          0.947          0.983     
##   sigma                0.280          0.259          0.233          0.133     
##   F               652012.063     387489.366     120932.024     205751.644     
##   p                    0.000          0.000          0.000          0.000     
##   Log-likelihood   -7962.499      -3631.319       2002.977      32297.142     
##   Deviance          4242.831       3613.360       2932.128        953.586     
##   AIC              15930.999       7270.637      -3985.953     -64560.284     
##   BIC              15957.685       7306.220      -3896.997     -64409.058     
##   N                53940          53940          53940          53940         
## ==============================================================================

4. Model testing: predictions

Now let’s test this model creating a hypothetic diamond with the following characteristics: carat of 1, color J, clarity IF and charted price of 5000 dollars (from https://www.creditdonkey.com/diamond-prices.html).

hypDiamond = data.frame(carat = 1.00, color = "J", clarity = "IF")
modelEstimate = predict(m4, newdata = hypDiamond,
                        interval = "prediction", level = .95)
print(exp(modelEstimate))

##        fit      lwr      upr
## 1 5469.189 4213.778 7098.626

The calculated price is 5469 dollars (95% confidence). The model predicts fairly well the chartered price a diamond with the above characteristics should have!

Investigating what influences diamond prices

Renata Barros

February 2018

1. Introduction

2. Exploratory Data Analysis

3. Price modelling

4. Model testing: predictions