“A diamond is forever” - Frances Gerety

1. Let’s consider the price of a diamond and it’s carat weight.Create a scatterplot of price (y) vs carat weight (x), and limit the x-axis and y-axis to omit the top 1% of values.

ggplot(diamonds,aes(x=carat,y=price))+
  geom_point(color='blue',fill='blue')+
  xlim(0,quantile(diamonds$carat,0.99))+
  ylim(0,quantile(diamonds$price,0.99))+
  ggtitle('Diamond price vs. carat')

it looks like the price increases kind of exponentially against carat, but it gets diverse when the carat increases.

2. Let’s sample 10,000 from diamonds data set and then use ggpairs to generate the pair-wise variables relationship

set.seed(20022012)
diamonds_samp <- diamonds[sample(1:length(diamonds$price),10000),]
ggpairs(diamonds_samp,params = c(shape=I('.'),outlier.shape=I('.')))

We notice the followings:

3. Create two histograms of the price variable, one is of price and the 2nd is log10 transformation of price, and place them side by side on one output image.

plot1 <- ggplot(diamonds,aes(x=price))+
  geom_histogram(color='blue',fill = 'blue',binwidth=100)+
  scale_x_continuous(breaks=seq(300,19000,1000),limit=c(300,19000))+
  ggtitle('Price')
plot2 <- ggplot(diamonds,aes(x=price))+
  geom_histogram(color='red',fill='red',binwidth=0.01)+
  scale_x_log10(breaks=seq(300,19000,1000),limit=c(300,19000))+
  ggtitle('Price(log10)')
grid.arrange(plot1,plot2,ncol=2)

It’s obviously that the price histogram is skewed to the left side while the log10(price) tends to bell curve distributed. Also, the two peaks in the log10(price) plot coincides with the 1st and 3rd quantile of price.

4. Create scatter plot by log10 transforming price on y axis and cuberoot transforming carat on x axis

### Create a new function to transform the carat variable
cuberoot_trans = function() trans_new('cuberoot',
                                      transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

### Use the cuberoot_trans function
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(color='blue',fill='blue',alpha=1/2,size=1,position = 'jitter') + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

Now the log10(price) is almost linear with cuberoot of carat - we can move on to the modeling.

5. Let’s see if other factors have some impact on price. Clarity first.

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=clarity)) +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +                         
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')

It’s clear that Clarity factors into the diamond price - a better clarity almost always has higher price than lower end clarity.

6. Now let’s see if cut has similar impact on diamond price

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=cut)) +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Cut', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +                         
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')

While cut plot does not show as obvious pattern as Clarity, it’s still clear that with the smae carat the diamonds with the best cut are priced higher. Hence, I think cut should be also included in the price prediction algorithm.

7. Now let’s see if color accounts for any variance of diamond price

ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=color)) +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Color', reverse = F,
    override.aes = list(alpha = 1, size = 2))) +                         
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')

This looks similar with previous Clarity plot and Color should be also considered as an factor for price.

8. it’s time to build out the price prediction model!

m1 <- lm(I(log10(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1,~ . +carat)
m3 <- update(m2,~ . +cut)
m4 <- update(m3,~ . +color)
m5 <- update(m4,~ . +clarity)
mtable(m1,m2,m3,m4,m5)
## 
## Calls:
## m1: lm(formula = I(log10(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut, 
##     data = diamonds)
## m4: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut + 
##     color, data = diamonds)
## m5: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut + 
##     color + clarity, data = diamonds)
## 
## ===========================================================================
##                     m1          m2          m3          m4          m5     
## ---------------------------------------------------------------------------
## (Intercept)       1.225***    0.451***    0.380***    0.405***    0.180*** 
##                  (0.003)     (0.008)     (0.008)     (0.007)     (0.004)   
## I(carat^(1/3))    2.414***    3.721***    3.780***    3.665***    3.971*** 
##                  (0.003)     (0.014)     (0.013)     (0.012)     (0.007)   
## carat                        -0.494***   -0.505***   -0.431***   -0.474*** 
##                              (0.005)     (0.005)     (0.004)     (0.003)   
## cut: .L                                   0.097***    0.097***    0.052*** 
##                                          (0.002)     (0.002)     (0.001)   
## cut: .Q                                  -0.027***   -0.027***   -0.013*** 
##                                          (0.002)     (0.001)     (0.001)   
## cut: .C                                   0.022***    0.022***    0.006*** 
##                                          (0.001)     (0.001)     (0.001)   
## cut: ^4                                   0.008***    0.008***   -0.001    
##                                          (0.001)     (0.001)     (0.001)   
## color: .L                                            -0.162***   -0.191*** 
##                                                      (0.001)     (0.001)   
## color: .Q                                            -0.056***   -0.040*** 
##                                                      (0.001)     (0.001)   
## color: .C                                             0.001      -0.006*** 
##                                                      (0.001)     (0.001)   
## color: ^4                                             0.012***    0.005*** 
##                                                      (0.001)     (0.001)   
## color: ^5                                            -0.007***   -0.001*   
##                                                      (0.001)     (0.001)   
## color: ^6                                            -0.010***    0.001    
##                                                      (0.001)     (0.001)   
## clarity: .L                                                       0.394*** 
##                                                                  (0.001)   
## clarity: .Q                                                      -0.104*** 
##                                                                  (0.001)   
## clarity: .C                                                       0.057*** 
##                                                                  (0.001)   
## clarity: ^4                                                      -0.027*** 
##                                                                  (0.001)   
## clarity: ^5                                                       0.011*** 
##                                                                  (0.001)   
## clarity: ^6                                                      -0.001    
##                                                                  (0.001)   
## clarity: ^7                                                       0.014*** 
##                                                                  (0.001)   
## ---------------------------------------------------------------------------
## R-squared            0.924       0.935       0.939       0.951        0.984
## adj. R-squared       0.924       0.935       0.939       0.951        0.984
## sigma                0.122       0.112       0.109       0.097        0.056
## F               652012.063  387489.366  138654.523   87959.467   173791.084
## p                    0.000       0.000       0.000       0.000        0.000
## Log-likelihood   37025.211   41356.392   43150.294   49222.951    79078.982
## Deviance           800.248     681.522     637.665     509.103      168.282
## AIC             -74044.422  -82704.783  -86284.589  -98417.901  -158115.964
## BIC             -74017.735  -82669.201  -86213.424  -98293.362  -157929.156
## N                53940       53940       53940       53940        53940    
## ===========================================================================

the linear model for the diamond price is:

\[Log(price) = 0.18+3.97carat^{1/3}-0.474carat+pc_5 * cut_{coef}+pc_7 * color_{coef}+pc_8 * clarity_{coef}\]

where \(pc_5\), \(pc_7\) and \(pc_8\) are polynomial contrast with n=5, 7, 8, respectively. You can check the detail by applying \(contr.poly(n)\).

9. Pick a diamond and predict the price by using our new model.

I randomly take this diamond in this example:

##       carat       cut color clarity depth table price    x    y    z
## 13696     1 Very Good     G     VS2  61.8    59  5600 6.29 6.37 3.91

Here is the code to get the predicted (fitted) price of this diamond by using the linear model:

thisDiamond <- data.frame(carat=1, cut='Very Good',
                          color='G',clarity='VS2')
modelEstimate <- predict(m5,newdata = thisDiamond,
                         interval = "prediction",level = .95)
10^modelEstimate
##        fit      lwr      upr
## 1 5232.111 4065.993 6732.668

The predicted price is $5,232.11 vs. actual price $5,600.