“A diamond is forever” - Frances Gerety
ggplot(diamonds,aes(x=carat,y=price))+
geom_point(color='blue',fill='blue')+
xlim(0,quantile(diamonds$carat,0.99))+
ylim(0,quantile(diamonds$price,0.99))+
ggtitle('Diamond price vs. carat')
it looks like the price increases kind of exponentially against carat, but it gets diverse when the carat increases.
set.seed(20022012)
diamonds_samp <- diamonds[sample(1:length(diamonds$price),10000),]
ggpairs(diamonds_samp,params = c(shape=I('.'),outlier.shape=I('.')))
We notice the followings:
plot1 <- ggplot(diamonds,aes(x=price))+
geom_histogram(color='blue',fill = 'blue',binwidth=100)+
scale_x_continuous(breaks=seq(300,19000,1000),limit=c(300,19000))+
ggtitle('Price')
plot2 <- ggplot(diamonds,aes(x=price))+
geom_histogram(color='red',fill='red',binwidth=0.01)+
scale_x_log10(breaks=seq(300,19000,1000),limit=c(300,19000))+
ggtitle('Price(log10)')
grid.arrange(plot1,plot2,ncol=2)
It’s obviously that the price histogram is skewed to the left side while the log10(price) tends to bell curve distributed. Also, the two peaks in the log10(price) plot coincides with the 1st and 3rd quantile of price.
### Create a new function to transform the carat variable
cuberoot_trans = function() trans_new('cuberoot',
transform = function(x) x^(1/3),
inverse = function(x) x^3)
### Use the cuberoot_trans function
ggplot(aes(carat, price), data = diamonds) +
geom_point(color='blue',fill='blue',alpha=1/2,size=1,position = 'jitter') +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
Now the log10(price) is almost linear with cuberoot of carat - we can move on to the modeling.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=clarity)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
It’s clear that Clarity factors into the diamond price - a better clarity almost always has higher price than lower end clarity.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=cut)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
While cut plot does not show as obvious pattern as Clarity, it’s still clear that with the smae carat the diamonds with the best cut are priced higher. Hence, I think cut should be also included in the price prediction algorithm.
ggplot(aes(x = carat, y = price), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter',aes(color=color)) +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = F,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
This looks similar with previous Clarity plot and Color should be also considered as an factor for price.
m1 <- lm(I(log10(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1,~ . +carat)
m3 <- update(m2,~ . +cut)
m4 <- update(m3,~ . +color)
m5 <- update(m4,~ . +clarity)
mtable(m1,m2,m3,m4,m5)
##
## Calls:
## m1: lm(formula = I(log10(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut,
## data = diamonds)
## m4: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut +
## color, data = diamonds)
## m5: lm(formula = I(log10(price)) ~ I(carat^(1/3)) + carat + cut +
## color + clarity, data = diamonds)
##
## ===========================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------
## (Intercept) 1.225*** 0.451*** 0.380*** 0.405*** 0.180***
## (0.003) (0.008) (0.008) (0.007) (0.004)
## I(carat^(1/3)) 2.414*** 3.721*** 3.780*** 3.665*** 3.971***
## (0.003) (0.014) (0.013) (0.012) (0.007)
## carat -0.494*** -0.505*** -0.431*** -0.474***
## (0.005) (0.005) (0.004) (0.003)
## cut: .L 0.097*** 0.097*** 0.052***
## (0.002) (0.002) (0.001)
## cut: .Q -0.027*** -0.027*** -0.013***
## (0.002) (0.001) (0.001)
## cut: .C 0.022*** 0.022*** 0.006***
## (0.001) (0.001) (0.001)
## cut: ^4 0.008*** 0.008*** -0.001
## (0.001) (0.001) (0.001)
## color: .L -0.162*** -0.191***
## (0.001) (0.001)
## color: .Q -0.056*** -0.040***
## (0.001) (0.001)
## color: .C 0.001 -0.006***
## (0.001) (0.001)
## color: ^4 0.012*** 0.005***
## (0.001) (0.001)
## color: ^5 -0.007*** -0.001*
## (0.001) (0.001)
## color: ^6 -0.010*** 0.001
## (0.001) (0.001)
## clarity: .L 0.394***
## (0.001)
## clarity: .Q -0.104***
## (0.001)
## clarity: .C 0.057***
## (0.001)
## clarity: ^4 -0.027***
## (0.001)
## clarity: ^5 0.011***
## (0.001)
## clarity: ^6 -0.001
## (0.001)
## clarity: ^7 0.014***
## (0.001)
## ---------------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.122 0.112 0.109 0.097 0.056
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood 37025.211 41356.392 43150.294 49222.951 79078.982
## Deviance 800.248 681.522 637.665 509.103 168.282
## AIC -74044.422 -82704.783 -86284.589 -98417.901 -158115.964
## BIC -74017.735 -82669.201 -86213.424 -98293.362 -157929.156
## N 53940 53940 53940 53940 53940
## ===========================================================================
\[Log(price) = 0.18+3.97carat^{1/3}-0.474carat+pc_5 * cut_{coef}+pc_7 * color_{coef}+pc_8 * clarity_{coef}\]
where \(pc_5\), \(pc_7\) and \(pc_8\) are polynomial contrast with n=5, 7, 8, respectively. You can check the detail by applying \(contr.poly(n)\).
I randomly take this diamond in this example:
## carat cut color clarity depth table price x y z
## 13696 1 Very Good G VS2 61.8 59 5600 6.29 6.37 3.91
Here is the code to get the predicted (fitted) price of this diamond by using the linear model:
thisDiamond <- data.frame(carat=1, cut='Very Good',
color='G',clarity='VS2')
modelEstimate <- predict(m5,newdata = thisDiamond,
interval = "prediction",level = .95)
10^modelEstimate
## fit lwr upr
## 1 5232.111 4065.993 6732.668
The predicted price is $5,232.11 vs. actual price $5,600.