We are interested in predicting the price of a diamond based on 9 predictors. We want to know if we are getting ripped off for the price we pay for a diamond compared to historical prices.
First we load the appropriate libraries (makes sure you have them installed).

## Loading required package: ISLR
## Loading required package: ggplot2
## Loading required package: GGally
## Loading required package: scales
## Loading required package: memisc
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
## Loading required package: dplyr
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:memisc':
## 
##     collect, query, rename
## The following object is masked from 'package:MASS':
## 
##     select
## The following object is masked from 'package:GGally':
## 
##     nasa
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

Then we read in the diamonds data and assign it to an object called mydata. We inspect the structure of the dataframe using str.

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

The dataframe seems to have the factors in the correct order and numerical variables assigned as such.

#WARNING takes a 10 minutes to run!
#diasamp <- mydata[sample(1:length(mydata$price), 10000),]
#ggpairs(diasamp, params = c(shape = I("."), outlier.shape = I(".")))

What’s happening is that ggpairs is plotting each variable against the other in a pretty smart way. In the lower-triangle of the plot matrix, it uses grouped histograms for qualitative-qualitative pairs and scatterplots for quantitative-quantitative pairs. In the upper-triangle, it plots grouped histograms for qualitative-qualitative pairs (using the x-instead of y-variable as the grouping factor), boxplots for qualitative-quantitative pairs, and provides the correlation for quantitative-quantitative pairs.

Let’s define a function to inspect the cube root of the carat, it’s volume.

cubroot_trans = function() trans_new("cubroot", transform= function(x) x^(1/3), inverse = function(x) x^3 )

With so many data points overfitting is a concern, let’s jitter and make each point more transparent.

p1 <- ggplot( data=mydata, aes(carat, price)) +
    geom_point(alpha = 0.5, size = .75, position="jitter") +
    scale_x_continuous(trans=cubroot_trans(), limits = c(0.2,3),
        breaks = c(0.2, 0.5, 1, 2, 3)) +
    scale_y_continuous(trans=log10_trans(), limits = c(350,15000),
        breaks = c(350, 1000, 5000, 10000, 15000)) +
    theme_bw() +
    ggtitle("Price (log10) by Cubed-Root of Carat")
p1
## Warning: Removed 1691 rows containing missing values (geom_point).

Our 2008 data has some problems but let’s go ahead and model price using linear regression.

mydata$logprice <- log(mydata$price)
m1 <- lm(logprice~  I(carat^(1/3)), 
    data=mydata[mydata$price < 10000,])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut )
m4 <- update(m3, ~ . + color + clarity)
 

mtable(m1, m2, m3, m4)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = mydata[mydata$price < 
##     10000, ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = mydata[mydata$price < 
##     10000, ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = mydata[mydata$price < 
##     10000, ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = mydata[mydata$price < 10000, ])
## 
## ==============================================================
##                      m1         m2         m3         m4      
## --------------------------------------------------------------
##   (Intercept)      2.723***   0.653***   0.538***   0.340***  
##                   (0.007)    (0.027)    (0.026)    (0.014)    
##   I(carat^(1/3))   5.667***   9.327***   9.383***   9.311***  
##                   (0.008)    (0.046)    (0.045)    (0.025)    
##   carat                      -1.551***  -1.540***  -1.194***  
##                              (0.019)    (0.019)    (0.010)    
##   cut: .L                                0.206***   0.119***  
##                                         (0.004)    (0.002)    
##   cut: .Q                               -0.057***  -0.031***  
##                                         (0.004)    (0.002)    
##   cut: .C                                0.046***   0.012***  
##                                         (0.003)    (0.002)    
##   cut: ^4                                0.015***  -0.004**   
##                                         (0.003)    (0.001)    
##   color: .L                                        -0.425***  
##                                                    (0.002)    
##   color: .Q                                        -0.092***  
##                                                    (0.002)    
##   color: .C                                        -0.015***  
##                                                    (0.002)    
##   color: ^4                                         0.014***  
##                                                    (0.002)    
##   color: ^5                                         0.001     
##                                                    (0.002)    
##   color: ^6                                         0.001     
##                                                    (0.001)    
##   clarity: .L                                       0.874***  
##                                                    (0.004)    
##   clarity: .Q                                      -0.233***  
##                                                    (0.003)    
##   clarity: .C                                       0.122***  
##                                                    (0.003)    
##   clarity: ^4                                      -0.058***  
##                                                    (0.002)    
##   clarity: ^5                                       0.024***  
##                                                    (0.002)    
##   clarity: ^6                                      -0.003*    
##                                                    (0.002)    
##   clarity: ^7                                       0.031***  
##                                                    (0.001)    
## --------------------------------------------------------------
##   R-squared            0.9        0.9        0.9        1.0   
##   adj. R-squared       0.9        0.9        0.9        1.0   
##   sigma                0.3        0.2        0.2        0.1   
##   F               512971.7   293367.4   104526.2   122132.1   
##   p                    0.0        0.0        0.0        0.0   
##   Log-likelihood   -3713.5     -707.4      796.6    31356.3   
##   Deviance          3322.1     2936.4     2760.6      787.3   
##   AIC               7432.9     1422.9    -1577.3   -62670.7   
##   BIC               7459.3     1458.0    -1506.9   -62486.0   
##   N                48717      48717      48717      48717     
## ==============================================================

We can use the summary function on each of these models to assess how much variation within the data they explain.

We can use the model to see if we are being charged too much for a diamond we are interested in.

# Round 1.00 Very Good I VS1 $5,601
thisDiamond <- data.frame(carat = 1.00, cut = "Very Good", color = "I", clarity="VS1")
modEst <- predict(m4, newdata = thisDiamond, interval="prediction", level = .95)
exp(modEst)
##        fit      lwr      upr
## 1 4495.173 3503.397 5767.712

The predicted price from the model is lower than what they are asking for by $600 - hussle!