Diamond predictions

Setting up

Libraries and dataframes.

setwd("C:/Users/tomas/Downloads/eda-course-materials")

library(dslabs)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(readr)
library(gridExtra)

## 
## Attaching package: 'gridExtra'

## The following object is masked from 'package:dplyr':
## 
##     combine

library(reshape2)
library(quantreg)

## Loading required package: SparseM

## 
## Attaching package: 'SparseM'

## The following object is masked from 'package:base':
## 
##     backsolve

library(scales)

## 
## Attaching package: 'scales'

## The following object is masked from 'package:readr':
## 
##     col_factor

library(memisc)

## Loading required package: lattice

## Loading required package: MASS

## 
## Attaching package: 'MASS'

## The following object is masked from 'package:dplyr':
## 
##     select

## 
## Attaching package: 'memisc'

## The following object is masked from 'package:scales':
## 
##     percent

## The following object is masked from 'package:ggplot2':
## 
##     syms

## The following objects are masked from 'package:dplyr':
## 
##     collect, recode, rename, syms

## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts

## The following object is masked from 'package:base':
## 
##     as.array

library(lattice)
library(MASS)
library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:memisc':
## 
##     recode

## The following object is masked from 'package:dplyr':
## 
##     recode

data("diamonds")
diamondsbig <- read.csv("C:/Users/tomas/Downloads/eda-course-materials/lesson6/diamondsbig.csv")

First we can create a plot matrix, using a sample of the data, to see the relationships between the different variables and the price.
The upper and lower parameters define the aesthetics of the scatterplots and boxplots.
axisLabels = "internal sets the labels to the diagonal in the middle.

set.seed(16)

diamond_samp <- diamonds[sample.int(nrow(diamonds), 10000), ]

ggpairs(diamond_samp,
  lower = list(continuous = wrap("points", shape = I('.'))),
  upper = list(combo = wrap("box", outlier.shape = I('.'))), 
  axisLabels = "internal")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Price seems to be affected by cut, carat, clarity and color.
The histogram of price shows it is long-tailed, we can apply a log 10 transformation.
Carat is a product of xyz, so we should apply a cube root transformation. As no transformation exists we can create a function with a new transformation with the function function() trans_new(“NAME”, transform = function(x) TRANSFORMATION OF X, inverse = function(x) INVERSE FUNCTION).
The layer + ggtitle(“TITLE”) adds a title to the plot

cuberoot_trans <- function()trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha=1/5, size=0.8, position = "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

## Warning: Removed 1693 rows containing missing values (geom_point).

Now we can add the other variables to the plot in the form of color.
With the package RColorBrewer we can add a color palette and a guide for the legend of the plot.
The modification is made with the layer scale_color_brewer(type = ‘COLOR PALETTE’, guide = guide_legend(title = ‘TITLE’, reverse = T OF F, override.aes = list(alpha = 1, size = SIZE OF THE LEGEND))).

p1 <- ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')

p2 <- ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = "Color", reverse = F,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
  
p3 <- ggplot(aes(x = carat, y = price), data = diamonds) + 
  geom_point(aes(color=clarity), alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')

grid.arrange(p1, p2, p3, ncol=1)

## Warning: Removed 1689 rows containing missing values (geom_point).

## Warning: Removed 1690 rows containing missing values (geom_point).

## Warning: Removed 1694 rows containing missing values (geom_point).

Now that we have the plots, we can create a linear model using a sample of a new bigger, dataframe.
The linear model is built with the function lm(data = DATAFRAME, I((DEPENDANT VARIABLE) ~ I(INDEPENDANT VARIABLE)), and then adding more independent variables with update(MODEL, ~ . + NEW INDEPENDANT VARIABLE).
The function mtable() displays the results from the model.

diamondsbig <- read.csv("C:/Users/tomas/Downloads/eda-course-materials/lesson6/diamondsbig.csv")


m1 <- lm(data = subset(diamondsbig[sample.int(nrow(diamondsbig), 10000), ], price<1000 & cert=="GIA"), 
         I(log(price) ~ I(carat^1/3)))
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + color)
m4 <- update(m3, ~ . + cut)
m5 <- update(m4, ~ . + clarity)
mtable(m5)

## 
## Calls:
## m5: lm(formula = log(price) ~ I(carat^1/3) + carat + color + cut + 
##     clarity, data = subset(diamondsbig[sample.int(nrow(diamondsbig), 
##     10000), ], price < 1000 & cert == "GIA"))
## 
## ==============================
##   (Intercept)       5.101***  
##                    (0.043)    
##   I(carat^1/3)      9.766***  
##                    (0.243)    
##   color: E/D       -0.045**   
##                    (0.014)    
##   color: F/D       -0.060***  
##                    (0.014)    
##   color: G/D       -0.091***  
##                    (0.015)    
##   color: H/D       -0.132***  
##                    (0.017)    
##   color: I/D       -0.220***  
##                    (0.018)    
##   color: J/D       -0.330***  
##                    (0.023)    
##   color: K/D       -0.431***  
##                    (0.029)    
##   color: L/D       -0.459***  
##                    (0.037)    
##   cut: Ideal        0.104***  
##                    (0.013)    
##   cut: V.Good       0.062***  
##                    (0.015)    
##   clarity: I2      -0.406***  
##                    (0.079)    
##   clarity: IF       0.551***  
##                    (0.030)    
##   clarity: SI1      0.397***  
##                    (0.025)    
##   clarity: SI2      0.343***  
##                    (0.026)    
##   clarity: VS1      0.445***  
##                    (0.027)    
##   clarity: VS2      0.429***  
##                    (0.026)    
##   clarity: VVS1     0.527***  
##                    (0.027)    
##   clarity: VVS2     0.486***  
##                    (0.027)    
## ------------------------------
##   R-squared         0.527     
##   N              1627         
## ==============================
##   Significance:   
##                 *** = p < 0.001;   
##                 ** = p < 0.01;   
##                 * = p < 0.05

Finally, to predict the price of a diamond, we create a new dataframe with the characteristics of the diamond to predict.
Then we use the function predict(MODEL, newdata = EXAMPLE OBJECT, interval = “prediction”, level = CONFIDENCE INTERVAL) and apply it to a new vector.

thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)

## Warning in predict.lm(m5, newdata = thisDiamond, interval = "prediction", :
## prediction from a rank-deficient fit may be misleading

exp(modelEstimate)

##        fit      lwr      upr
## 1 5672.888 4025.212 7995.022