Diamond Prices

The price of jewelry diamond depends on the 4 C’s: Carat weight, Cut, Clarity, and Color. We have collected a sample of diamond prices from a website, together with other related variables. We want to investigate how the price of a diamond is determined by its attributes.

Let’s first read the dataset into R

Exploring data

str(x)
## 'data.frame':    2690 obs. of  8 variables:
##  $ Carat.Size: num  0.3 0.44 0.31 0.66 0.47 0.4 0.36 0.52 0.53 0.43 ...
##  $ Color     : Factor w/ 8 levels "D","E","F","G",..: 2 2 2 8 5 4 1 5 1 3 ...
##  $ Clarity   : Factor w/ 7 levels "IF","SI1","SI2",..: 6 5 6 2 5 4 5 3 3 5 ...
##  $ Depth     : num  60 61.9 61.3 62.8 59.1 62 61.3 61.7 59.4 61.5 ...
##  $ Table     : int  59 58 58 57 64 59 57 61 59 60 ...
##  $ Cut       : Factor w/ 4 levels "Excellent","Good",..: 1 1 1 1 4 1 1 4 4 1 ...
##  $ Report    : Factor w/ 2 levels "AGS","GIA": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Price     : int  1000 1000 1000 1000 1000 1000 1000 1000 1001 1001 ...

As we can see, there are 2690 observations for each of 8 variables. Some variables are integers (int) (e.g., Price), some are numerical (num) (e.g., Carat.W eight), and some are categorical (Factor) (e.g., Color).

Let’s first consider modeling the relationship between Price and Carat Size. From the scatterplots below, we will use \(Log_{10}Price\) as the response variable to make the relationship more linear.

par(mfrow = c(1,2))
plot(x$Carat.Size, x$Price, xlab = 'Carat Size', ylab = 'Price')
plot(x$Carat.Size, log10(x$Price), xlab = 'Carat Size', ylab = 'Log10(Pice)')

x$log10Price <- as.numeric(log10(x$Price)) #adding a new variable/column with transformed values 

We now want to add Color to the model as a predictor. To see if it is appropriate to use indicators for Color, we plot log10 Price against Carat Size with points labelled by their color levels:

library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.3
ggplot(x, aes(Carat.Size, log10Price, color = Color)) + geom_point()

As we can see, the diamonds with different color levels have similar relationships between their prices and carat sizes (the slopes are similar). We therefore can use 7 indicators to represent this 8-level categorical predictor. Note that we do not need to make the indicators by ourselves, because R has already recognized Color as a factor variable.

Let’s regress LogPrice on Carat Size and Color in R as follows:

m <- lm(log10Price ~ Carat.Size + Color, data = x)

and the regression output is given below:

summary(m)
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + Color, data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41550 -0.07547 -0.00788  0.07027  0.41063 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.809895   0.008298 338.603  < 2e-16 ***
## Carat.Size   0.911400   0.007609 119.783  < 2e-16 ***
## ColorE      -0.032436   0.008164  -3.973 7.29e-05 ***
## ColorF      -0.045008   0.008432  -5.338 1.02e-07 ***
## ColorG      -0.057093   0.008724  -6.545 7.12e-11 ***
## ColorH      -0.105930   0.008858 -11.958  < 2e-16 ***
## ColorI      -0.166000   0.009468 -17.533  < 2e-16 ***
## ColorJ      -0.236116   0.010036 -23.526  < 2e-16 ***
## ColorK      -0.297170   0.012524 -23.729  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1091 on 2681 degrees of freedom
## Multiple R-squared:  0.8556, Adjusted R-squared:  0.8552 
## F-statistic:  1986 on 8 and 2681 DF,  p-value: < 2.2e-16

The model looks good in general. It has R2 = 85.56%. Both F- and t-tests are significant under 5% significance level. So, the model is statistically useful and each predictor contributes significantly given other predictors already in the model.

Note that R takes ColorD as base level and therefore yields slope estimates for Color E,F,G,H,I,J and ColorK.

How do we interpret the estimate (-0.297) for ColorK?

Given other predictors as constant, the mean log10Price of diamonds with K-level color is less than that of those with D-level color by 0.297 (unit).

But, what does this 0.297 mean on the Price scale?

We can see it from the following equation:

\[Log10Price_{D} - Log10Price_{K} = 0.297\] \[Log10\frac{Price_{D}}{Price_{K}} = 0.297\] \[\frac{Price_{D}}{Price_{K}} = 10^{0.297} \approx 1.982303 \]

We therefore interpret that the diamonds with D-level color on average cost almost twice as much as those with K-level color, after accounting for other predictors in the model.

It is easy to add interaction terms to the regression model in R. For example, we may add the interaction between Carat Size and Color to the previous model as follows:

summary(lm(log10Price ~ Carat.Size + Color + Carat.Size*Color, data = x))
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + Color + Carat.Size * Color, 
##     data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.35334 -0.06863 -0.00336  0.06774  0.39073 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        2.7276231  0.0180705 150.943  < 2e-16 ***
## Carat.Size         1.0345056  0.0253369  40.830  < 2e-16 ***
## ColorE            -0.0262055  0.0224359  -1.168  0.24290    
## ColorF            -0.0266150  0.0232569  -1.144  0.25256    
## ColorG            -0.0007409  0.0247696  -0.030  0.97614    
## ColorH             0.0009724  0.0271573   0.036  0.97144    
## ColorI             0.0560934  0.0287253   1.953  0.05095 .  
## ColorJ            -0.0354357  0.0308309  -1.149  0.25051    
## ColorK             0.0820540  0.0413955   1.982  0.04756 *  
## Carat.Size:ColorE -0.0113356  0.0312522  -0.363  0.71685    
## Carat.Size:ColorF -0.0385720  0.0311848  -1.237  0.21624    
## Carat.Size:ColorG -0.0941868  0.0310566  -3.033  0.00245 ** 
## Carat.Size:ColorH -0.1485335  0.0324056  -4.584 4.78e-06 ***
## Carat.Size:ColorI -0.2548668  0.0324612  -7.851 5.90e-15 ***
## Carat.Size:ColorJ -0.2310598  0.0335320  -6.891 6.90e-12 ***
## Carat.Size:ColorK -0.3801792  0.0401497  -9.469  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.105 on 2674 degrees of freedom
## Multiple R-squared:  0.8667, Adjusted R-squared:  0.8659 
## F-statistic:  1159 on 15 and 2674 DF,  p-value: < 2.2e-16

It seems that the interaction terms do not improve the model by any means: the R2 is almost the same, the F-test gets less significant, the t-tests for Color become much less significant, and the t-tests for the interaction terms are not significant. The model does not benefit from its more complexity, so it is not wise to add the interaction terms to the model in this scenario.

Now we want to add more predictors (not interaction terms) to the model in order to improve its R2. We have four variables left in the pool: Table, Depth, Clarity and Cut. Let’s first see whether or not the two continuous variables are linearly related to log10 Price:

par(mfrow = c(1,2))
plot(log10Price ~ as.factor(Table), data = x, xlab = 'Table', ylab = 'Log10(Price)')
plot(log10Price ~ Depth, data = x, xlab = 'Depth', ylab = 'Log10(Price)')

As we can see, both plots show no linear pattern at all, and we therefore do not need those variables in the model. How about the two categorical variables: Clarity and Cut? The new multiple regression model is fitted as follows:

m2 <- lm(log10Price ~ Carat.Size + Color + Clarity + Cut, data = x)
summary(m2)
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + Color + Clarity + Cut, 
##     data = x)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.41313 -0.05609  0.00613  0.05505  0.26598 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.841742   0.009235 307.711  < 2e-16 ***
## Carat.Size    1.022604   0.006685 152.962  < 2e-16 ***
## ColorE       -0.027876   0.006428  -4.337 1.50e-05 ***
## ColorF       -0.042568   0.006640  -6.411 1.70e-10 ***
## ColorG       -0.061608   0.006871  -8.966  < 2e-16 ***
## ColorH       -0.097171   0.006978 -13.925  < 2e-16 ***
## ColorI       -0.169393   0.007451 -22.735  < 2e-16 ***
## ColorJ       -0.250721   0.007929 -31.621  < 2e-16 ***
## ColorK       -0.336113   0.009895 -33.969  < 2e-16 ***
## ClaritySI1   -0.165888   0.008363 -19.837  < 2e-16 ***
## ClaritySI2   -0.234767   0.008676 -27.059  < 2e-16 ***
## ClarityVS1   -0.077814   0.008536  -9.116  < 2e-16 ***
## ClarityVS2   -0.111030   0.008469 -13.110  < 2e-16 ***
## ClarityVVS1  -0.045498   0.008850  -5.141 2.93e-07 ***
## ClarityVVS2  -0.033691   0.008932  -3.772 0.000165 ***
## CutGood      -0.026568   0.007134  -3.724 0.000200 ***
## CutIdeal      0.020512   0.006785   3.023 0.002525 ** 
## CutVery Good -0.009481   0.003628  -2.613 0.009023 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08546 on 2672 degrees of freedom
## Multiple R-squared:  0.9118, Adjusted R-squared:  0.9112 
## F-statistic:  1625 on 17 and 2672 DF,  p-value: < 2.2e-16

As we can see, the R2 is improved a lot, the overall F-test is highly significant, and all the t-tests are significant.

Let’s not forget to check the regression assumptions, although the model “looks wonderful“. we must revise the plot of the residuals against the fitted values:

plot(m2$fitted.values, m2$residuals, xlab = 'Fitted', ylab = 'Residuals')
abline(0,0)

We see an obvious bend that needs to be corrected. We will add a squared term of Carat Size to the model:

m2_tmp <- lm(log10Price ~ Carat.Size + I(Carat.Size^2) + Color + Clarity + Cut, data = x)
summary(m2_tmp)
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + I(Carat.Size^2) + Color + 
##     Clarity + Cut, data = x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.188910 -0.031594 -0.002267  0.028456  0.250823 
## 
## Coefficients:
##                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      2.428435   0.007639 317.918  < 2e-16 ***
## Carat.Size       2.175685   0.015916 136.695  < 2e-16 ***
## I(Carat.Size^2) -0.603975   0.008095 -74.613  < 2e-16 ***
## ColorE          -0.031737   0.003661  -8.669  < 2e-16 ***
## ColorF          -0.058101   0.003787 -15.342  < 2e-16 ***
## ColorG          -0.090627   0.003933 -23.044  < 2e-16 ***
## ColorH          -0.132862   0.004003 -33.192  < 2e-16 ***
## ColorI          -0.188967   0.004251 -44.448  < 2e-16 ***
## ColorJ          -0.260852   0.004518 -57.740  < 2e-16 ***
## ColorK          -0.334487   0.005635 -59.356  < 2e-16 ***
## ClaritySI1      -0.235769   0.004854 -48.573  < 2e-16 ***
## ClaritySI2      -0.297411   0.005012 -59.340  < 2e-16 ***
## ClarityVS1      -0.127754   0.004907 -26.034  < 2e-16 ***
## ClarityVS2      -0.176493   0.004902 -36.002  < 2e-16 ***
## ClarityVVS1     -0.045231   0.005041  -8.974  < 2e-16 ***
## ClarityVVS2     -0.077508   0.005121 -15.137  < 2e-16 ***
## CutGood         -0.038191   0.004066  -9.393  < 2e-16 ***
## CutIdeal         0.012236   0.003866   3.165  0.00157 ** 
## CutVery Good    -0.013193   0.002067  -6.383 2.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04867 on 2671 degrees of freedom
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9712 
## F-statistic:  5041 on 18 and 2671 DF,  p-value: < 2.2e-16

Is there anything wrong with the model?

Let’s check if there exists any collinearity in the model:

library(car)
## Warning: package 'car' was built under R version 3.6.3
## Loading required package: carData
## Warning: package 'carData' was built under R version 3.6.1
vif(m2_tmp)
##                      GVIF Df GVIF^(1/(2*Df))
## Carat.Size      29.854176  1        5.463897
## I(Carat.Size^2) 27.362609  1        5.230928
## Color            1.495334  7        1.029156
## Clarity          1.583478  6        1.039045
## Cut              1.094026  3        1.015090

The VIFs for Carat Size and its squared term are significantly larger than those for other variables. It’s not surprising because the Carat Size is highly correlated with its squared term:

cor(x$Carat.Size, (x$Carat.Size)^2)
## [1] 0.9791708

To fix this issue, we first center Carat Size and then square it:

x$Carat.Size2 <- (x$Carat.Size - mean(x$Carat.Size))^2 #adding a new variable to my data frame

Let’s fit the model with the centered squared term and check its collinearity:

m3 <- lm(log10Price ~ Carat.Size + Carat.Size2 + Color + Clarity + Cut, data = x)
vif(m3)
##                 GVIF Df GVIF^(1/(2*Df))
## Carat.Size  1.928771  1        1.388802
## Carat.Size2 1.209910  1        1.099959
## Color       1.495334  7        1.029156
## Clarity     1.583478  6        1.039045
## Cut         1.094026  3        1.015090

Now the predictors are pretty uncorrelated with each other. The regression output of imod3 is given below

summary(m3)
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + Carat.Size2 + Color + 
##     Clarity + Cut, data = x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.188910 -0.031594 -0.002267  0.028456  0.250823 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   2.885693   0.005292 545.249  < 2e-16 ***
## Carat.Size    1.124643   0.004046 277.992  < 2e-16 ***
## Carat.Size2  -0.603975   0.008095 -74.613  < 2e-16 ***
## ColorE       -0.031737   0.003661  -8.669  < 2e-16 ***
## ColorF       -0.058101   0.003787 -15.342  < 2e-16 ***
## ColorG       -0.090627   0.003933 -23.044  < 2e-16 ***
## ColorH       -0.132862   0.004003 -33.192  < 2e-16 ***
## ColorI       -0.188967   0.004251 -44.448  < 2e-16 ***
## ColorJ       -0.260852   0.004518 -57.740  < 2e-16 ***
## ColorK       -0.334487   0.005635 -59.356  < 2e-16 ***
## ClaritySI1   -0.235769   0.004854 -48.573  < 2e-16 ***
## ClaritySI2   -0.297411   0.005012 -59.340  < 2e-16 ***
## ClarityVS1   -0.127754   0.004907 -26.034  < 2e-16 ***
## ClarityVS2   -0.176493   0.004902 -36.002  < 2e-16 ***
## ClarityVVS1  -0.045231   0.005041  -8.974  < 2e-16 ***
## ClarityVVS2  -0.077508   0.005121 -15.137  < 2e-16 ***
## CutGood      -0.038191   0.004066  -9.393  < 2e-16 ***
## CutIdeal      0.012236   0.003866   3.165  0.00157 ** 
## CutVery Good -0.013193   0.002067  -6.383 2.04e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.04867 on 2671 degrees of freedom
## Multiple R-squared:  0.9714, Adjusted R-squared:  0.9712 
## F-statistic:  5041 on 18 and 2671 DF,  p-value: < 2.2e-16

It is interesting to see that the coefficient of Carat.Size becomes much more significant in m3 than in m2_tmp, because its standard error is reduced a lot after the collinearity is gone. But, the other coefficients and statistics remain the same.

Again, don’t forget to check the regression assumptions, although the model seems to be fantastic. Let’s produce the residual plot and the histogram of residuals:

par(mfrow = c(1,2))
plot(m3$fitted.values, m3$residuals, xlab = 'fitted', ylab = 'Residuals')
hist(m3$residuals, main = 'Histogram of Residuals')

The residual plot looks pretty random and the histogram is unimodal and approximately symmetric. Therefore, the regression assumptions are satisfied in this model.

Let’s check if there is any outlier or high-leverage case in the dataset. We compute the standardized residuals and leverages, and produce their histograms as follows:

std.res <- rstandard(m3)
lev <- hatvalues(m3)

par(mfrow = c(1,2))
hist(std.res, xlab = 'Standardized Residuals', main = "")
hist(lev, xlab = 'Leverages', main = "")

We can see that there are a few cases that stand out from the rest cases in the histograms. Let’s find out the outliers whose residuals are beyond (-3, 3):

x[which(abs(std.res) > 3), c(9, 2, 1, 3, 6)]
##      log10Price Color Carat.Size Clarity       Cut
## 654    3.254790     K       0.80    VVS1 Excellent
## 672    3.255514     E       0.69     VS1      Good
## 1232   3.500785     K       0.90     VS2      Good
## 1257   3.500785     K       0.90     VS2      Good
## 1407   3.574957     E       0.56      IF Excellent
## 2039   3.753277     H       1.01     SI2 Excellent
## 2097   3.774298     H       1.06     SI2 Excellent
## 2123   3.777717     J       1.30      IF      Good
## 2553   3.940666     K       2.00     SI2      Good
## 2580   3.953083     F       1.02     VS2 Excellent
## 2596   3.964590     K       2.02     SI2 Very Good
## 2621   3.976258     J       1.51     SI1 Very Good
## 2637   3.976992     D       0.83      IF     Ideal
## 2655   3.988024     E       1.00     VS1 Very Good
## 2678   3.999305     J       2.02     SI2 Very Good

Regarding the leverages a common rule is to flag any case whose leverage is more than 3 times larger than the mean leverage value, i.e., \(h_{ii} > 3(\frac {k+1} {n})\). Let’s use this cutoff to identify the cases with large leverages:

lev.cut <- 3*length(m3$coefficients)/dim(x)[1]
x[which(lev > lev.cut), c(9, 2, 1, 3, 6)]
##      log10Price Color Carat.Size Clarity       Cut
## 2023   3.744058     K       1.16      IF      Good
## 2397   3.881955     K       1.79     SI2 Excellent
## 2402   3.882297     K       1.78     VS2 Very Good
## 2429   3.895975     K       1.82     SI1 Excellent
## 2553   3.940666     K       2.00     SI2      Good
## 2587   3.953421     K       1.76    VVS2 Very Good
## 2596   3.964590     K       2.02     SI2 Very Good
## 2678   3.999305     J       2.02     SI2 Very Good

It is interesting to note that there are three cases that are both outliers and of high leverages. We should investigate those extreme cases further, trying to understand what makes them special.

Are the extreme cases we found out above influential to the regression model?

Let’s compute Cook’s distance and DFFITS measure for each case, and plot them in boxplots as follows:

cookd <- cooks.distance(m3)
dffit <- dffits(m3)

par(mfrow = c(1,2))
boxplot(cookd, xlab = "Cook's distance")
boxplot(dffit, xlab = 'DFFITS')

From the boxplot of Cook’s distance we see three cases stand out a bit from others. Let’s find that case:

sort(cookd, decreasing = TRUE) [1:3]
##       2553       2596       2678 
## 0.06742104 0.06423002 0.04266862
x[cookd > 0.03, c(9, 2, 1, 3, 6)]
##      log10Price Color Carat.Size Clarity       Cut
## 2553   3.940666     K       2.00     SI2      Good
## 2596   3.964590     K       2.02     SI2 Very Good
## 2678   3.999305     J       2.02     SI2 Very Good

These are the outliers as identified by the standardized residuals. One of the common cutoffs for large DFFITS is \[2\sqrt(\frac{k+1}{n-k-1})\]

dffit.cut <- 2*sqrt(length(m3$coefficients)/(dim(x)[1] - length(m3$coefficients)))

This cutoff, however, is a bit too conservative because it yields too many “influential” cases.

which(abs(dffit) > dffit.cut)
##    9   13  111  155  225  290  299  337  379  386  426  512  582  614  654  661 
##    9   13  111  155  225  290  299  337  379  386  426  512  582  614  654  661 
##  672  680  681  714  729  748  772  804  846  948  968  969  985 1005 1042 1049 
##  672  680  681  714  729  748  772  804  846  948  968  969  985 1005 1042 1049 
## 1060 1061 1094 1113 1129 1147 1174 1180 1183 1197 1232 1257 1259 1279 1284 1292 
## 1060 1061 1094 1113 1129 1147 1174 1180 1183 1197 1232 1257 1259 1279 1284 1292 
## 1293 1325 1338 1368 1375 1387 1401 1407 1412 1457 1529 1573 1581 1600 1609 1619 
## 1293 1325 1338 1368 1375 1387 1401 1407 1412 1457 1529 1573 1581 1600 1609 1619 
## 1626 1690 1813 1869 1939 1956 1973 1987 2015 2016 2039 2048 2082 2097 2123 2155 
## 1626 1690 1813 1869 1939 1956 1973 1987 2015 2016 2039 2048 2082 2097 2123 2155 
## 2161 2164 2191 2206 2220 2266 2268 2271 2276 2277 2278 2290 2304 2316 2320 2323 
## 2161 2164 2191 2206 2220 2266 2268 2271 2276 2277 2278 2290 2304 2316 2320 2323 
## 2327 2334 2341 2359 2369 2380 2382 2394 2397 2399 2408 2420 2424 2429 2440 2452 
## 2327 2334 2341 2359 2369 2380 2382 2394 2397 2399 2408 2420 2424 2429 2440 2452 
## 2488 2493 2505 2507 2511 2518 2521 2527 2529 2530 2531 2532 2533 2536 2544 2553 
## 2488 2493 2505 2507 2511 2518 2521 2527 2529 2530 2531 2532 2533 2536 2544 2553 
## 2561 2567 2570 2580 2583 2584 2585 2595 2596 2602 2606 2609 2617 2619 2621 2631 
## 2561 2567 2570 2580 2583 2584 2585 2595 2596 2602 2606 2609 2617 2619 2621 2631 
## 2634 2637 2639 2644 2645 2652 2655 2658 2672 2677 2678 2679 2680 
## 2634 2637 2639 2644 2645 2652 2655 2658 2672 2677 2678 2679 2680

We therefore just focus on the case that stands out in the boxplot:

x[dffit > 0.5, c(9, 2, 1, 3, 6)]
##      log10Price Color Carat.Size Clarity       Cut
## 2553   3.940666     K       2.00     SI2      Good
## 2596   3.964590     K       2.02     SI2 Very Good
## 2678   3.999305     J       2.02     SI2 Very Good

These are the same influential cases reflected by Cook’s distance.

So, what is so special about those diamonds?

The residual for those diamonds are positive, which means its price is much higher than the estimate given by the model. After a closer look, we find that the price of the diamond 2553 is about average, but the color is the worst, the cut is the worst, the clarity is mediocre, and the size is just above average. Diamond 2596’s price is about average, the color is the worst, the cut is bad, the clarity is mediocre and the size above the average. Similarly, diamond 2678 share the same same description of the other 2 with the exception that the color is slightly better than the other two diamonds, though it’s still bad.

Becuase of thier influence on Cook’s distance and DFFITS measure, we will remove these outliers.

new_x <- x[-c(2553, 2596, 2678), ]

We are lazy and want to use stepwise regression to help us find a good model, given all the potential predictors. We need to first set up a “null” (simplest possible) model and a “full” (most complicated possible) model:

null <- lm(log10Price ~ 1, data = new_x)
full <- lm(log10Price ~ Carat.Size + Carat.Size2 + Color + Clarity + Cut + Depth + Table , data = new_x)

We should be aware that the stepwise regression implemented in R is a variant from the original stepwise method. At each step it uses AIC, instead of t-test, to select the predictors.

It is thus a fully automatic procedure and we do not need to decide the entering or removing significance level as required in the original method. Let’s do it with R:

step.reg <- step(null, scope=list(lower=null, upper=full), test = 'F')
## Start:  AIC=-6714.69
## log10Price ~ 1
## 
##               Df Sum of Sq     RSS      AIC    F value Pr(>F)    
## + Carat.Size   1   175.576  45.053 -10981.4 10463.6374 <2e-16 ***
## + Color        7    18.244 202.385  -6932.6    34.5002 <2e-16 ***
## + Clarity      6    15.562 205.068  -6899.2    33.8953 <2e-16 ***
## + Carat.Size2  1     0.211 220.419  -6715.3     2.5644 0.1094    
## <none>                     220.629  -6714.7                      
## + Cut          3     0.400 220.229  -6713.6     1.6241 0.1817    
## + Depth        1     0.024 220.605  -6713.0     0.2965 0.5861    
## + Table        1     0.009 220.620  -6712.8     0.1144 0.7352    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-10981.36
## log10Price ~ Carat.Size
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## + Color        7    13.605  31.448 -11933.4   165.5104 < 2.2e-16 ***
## + Clarity      6     9.347  35.707 -11594.1   116.8784 < 2.2e-16 ***
## + Carat.Size2  1     7.573  37.480 -11473.9   542.3524 < 2.2e-16 ***
## + Cut          3     1.859  43.194 -11088.6    38.4863 < 2.2e-16 ***
## + Table        1     0.349  44.704 -11000.3    20.9543 4.918e-06 ***
## + Depth        1     0.135  44.918 -10987.4     8.0706  0.004533 ** 
## <none>                      45.053 -10981.4                         
## - Carat.Size   1   175.576 220.629  -6714.7 10463.6374 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-11933.36
## log10Price ~ Carat.Size + Color
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## + Clarity      6    12.158  19.290 -13234.6   280.6670 < 2.2e-16 ***
## + Carat.Size2  1     7.381  24.067 -12650.1   821.0297 < 2.2e-16 ***
## + Cut          3     1.419  30.029 -12051.4    42.1288 < 2.2e-16 ***
## + Table        1     0.232  31.216 -11951.2    19.8868 8.555e-06 ***
## + Depth        1     0.054  31.394 -11936.0     4.5864   0.03232 *  
## <none>                      31.448 -11933.4                         
## - Color        7    13.605  45.053 -10981.4   165.5104 < 2.2e-16 ***
## - Carat.Size   1   170.937 202.385  -6932.6 14556.3632 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-13234.57
## log10Price ~ Carat.Size + Color + Clarity
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## + Carat.Size2  1    12.819   6.472 -16167.2  5290.4341 < 2.2e-16 ***
## + Cut          3     0.211  19.079 -13258.2     9.8575 1.832e-06 ***
## + Table        1     0.060  19.231 -13240.9     8.2780  0.004045 ** 
## + Depth        1     0.036  19.255 -13237.5     4.9456  0.026241 *  
## <none>                      19.290 -13234.6                         
## - Clarity      6    12.158  31.448 -11933.4   280.6670 < 2.2e-16 ***
## - Color        7    16.416  35.707 -11594.1   324.8378 < 2.2e-16 ***
## - Carat.Size   1   172.692 191.982  -7062.4 23920.3139 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-16167.19
## log10Price ~ Carat.Size + Color + Clarity + Carat.Size2
## 
##               Df Sum of Sq     RSS      AIC   F value    Pr(>F)    
## + Cut          3     0.335   6.137 -16304.0    48.547 < 2.2e-16 ***
## + Depth        1     0.061   6.411 -16190.6    25.350 5.101e-07 ***
## + Table        1     0.034   6.438 -16179.2    13.942 0.0001925 ***
## <none>                       6.472 -16167.2                        
## - Carat.Size2  1    12.819  19.290 -13234.6  5290.434 < 2.2e-16 ***
## - Color        7    16.565  23.037 -12769.6   976.674 < 2.2e-16 ***
## - Clarity      6    17.595  24.067 -12650.1  1210.285 < 2.2e-16 ***
## - Carat.Size   1   184.715 191.187  -7071.6 76234.532 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-16304.01
## log10Price ~ Carat.Size + Color + Clarity + Carat.Size2 + Cut
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## + Depth        1     0.041   6.096 -16320.1    18.0606 2.213e-05 ***
## <none>                       6.137 -16304.0                         
## + Table        1     0.001   6.136 -16302.6     0.5565    0.4557    
## - Cut          3     0.335   6.472 -16167.2    48.5467 < 2.2e-16 ***
## - Carat.Size2  1    12.942  19.079 -13258.2  5626.6949 < 2.2e-16 ***
## - Clarity      6    15.827  21.964 -12889.8  1146.8382 < 2.2e-16 ***
## - Color        7    16.240  22.376 -12841.8  1008.6099 < 2.2e-16 ***
## - Carat.Size   1   183.124 189.261  -7092.8 79613.9365 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-16320.14
## log10Price ~ Carat.Size + Color + Clarity + Carat.Size2 + Cut + 
##     Depth
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## + Table        1     0.005   6.090 -16320.4     2.2048    0.1377    
## <none>                       6.096 -16320.1                         
## - Depth        1     0.041   6.137 -16304.0    18.0606 2.213e-05 ***
## - Cut          3     0.315   6.411 -16190.6    46.0002 < 2.2e-16 ***
## - Carat.Size2  1    12.961  19.057 -13259.3  5670.9311 < 2.2e-16 ***
## - Clarity      6    15.854  21.950 -12889.5  1156.1483 < 2.2e-16 ***
## - Color        7    16.173  22.269 -12852.8  1010.9158 < 2.2e-16 ***
## - Carat.Size   1   183.127 189.223  -7091.3 80124.5309 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Step:  AIC=-16320.36
## log10Price ~ Carat.Size + Color + Clarity + Carat.Size2 + Cut + 
##     Depth + Table
## 
##               Df Sum of Sq     RSS      AIC    F value    Pr(>F)    
## <none>                       6.090 -16320.4                         
## - Table        1     0.005   6.096 -16320.1     2.2048    0.1377    
## - Depth        1     0.045   6.136 -16302.6    19.7132 9.364e-06 ***
## - Cut          3     0.221   6.312 -16230.6    32.2563 < 2.2e-16 ***
## - Carat.Size2  1    12.934  19.025 -13261.8  5661.7144 < 2.2e-16 ***
## - Clarity      6    15.855  21.946 -12888.1  1156.7094 < 2.2e-16 ***
## - Color        7    16.140  22.230 -12855.4  1009.2741 < 2.2e-16 ***
## - Carat.Size   1   183.127 189.217  -7089.4 80160.3819 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model selected by the stepwise regression is:

summary(step.reg)
## 
## Call:
## lm(formula = log10Price ~ Carat.Size + Color + Clarity + Carat.Size2 + 
##     Cut + Depth + Table, data = new_x)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.185304 -0.031594 -0.001351  0.029559  0.171255 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   3.1813410  0.0770819  41.272  < 2e-16 ***
## Carat.Size    1.1249198  0.0039732 283.126  < 2e-16 ***
## ColorE       -0.0312884  0.0035971  -8.698  < 2e-16 ***
## ColorF       -0.0578710  0.0037208 -15.553  < 2e-16 ***
## ColorG       -0.0901593  0.0038689 -23.304  < 2e-16 ***
## ColorH       -0.1322195  0.0039374 -33.581  < 2e-16 ***
## ColorI       -0.1877411  0.0041775 -44.941  < 2e-16 ***
## ColorJ       -0.2599352  0.0044427 -58.509  < 2e-16 ***
## ColorK       -0.3363519  0.0055585 -60.511  < 2e-16 ***
## ClaritySI1   -0.2366131  0.0047701 -49.603  < 2e-16 ***
## ClaritySI2   -0.2995138  0.0049284 -60.772  < 2e-16 ***
## ClarityVS1   -0.1276372  0.0048257 -26.450  < 2e-16 ***
## ClarityVS2   -0.1771119  0.0048187 -36.755  < 2e-16 ***
## ClarityVVS1  -0.0446977  0.0049515  -9.027  < 2e-16 ***
## ClarityVVS2  -0.0781713  0.0050311 -15.537  < 2e-16 ***
## Carat.Size2  -0.6265371  0.0083267 -75.244  < 2e-16 ***
## CutGood      -0.0378119  0.0043088  -8.775  < 2e-16 ***
## CutIdeal      0.0103358  0.0038644   2.675  0.00753 ** 
## CutVery Good -0.0116387  0.0021229  -5.483 4.59e-08 ***
## Depth        -0.0039360  0.0008865  -4.440 9.36e-06 ***
## Table        -0.0008845  0.0005957  -1.485  0.13770    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.0478 on 2666 degrees of freedom
## Multiple R-squared:  0.9724, Adjusted R-squared:  0.9722 
## F-statistic:  4696 on 20 and 2666 DF,  p-value: < 2.2e-16