Linear Regression Model to predict Diamond Price

Loading The Diamond Data

library(ggplot2)
data(diamonds)
str(diamonds)

## Classes 'tbl_df', 'tbl' and 'data.frame':    53940 obs. of  10 variables:
##  $ carat  : num  0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
##  $ cut    : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
##  $ color  : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
##  $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
##  $ depth  : num  61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
##  $ table  : num  55 61 65 58 58 57 57 55 61 61 ...
##  $ price  : int  326 326 327 334 335 336 336 337 337 338 ...
##  $ x      : num  3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
##  $ y      : num  3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
##  $ z      : num  2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...

Scatter Plot Review

The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.

# Let's start by examining two variables in the data set.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is 
# reasonable or a rip-off.

# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).

# Limit the x-axis and y-axis to omit the top 1% of values.

# ENTER YOUR CODE BELOW THIS LINE
# ================================================================

ggplot(diamonds,(aes(x = carat, y = price)))+
  geom_point(fill = I('#F79420'),color = I('red'),shape= 21)+
  xlim(0,quantile(diamonds$carat,.99))+
  ylim(0,quantile(diamonds$price,.99))

## Warning: Removed 926 rows containing missing values (geom_point).

Price and Carat Relationship

Response:

there is non linear relationship between both variables, maybe its exponential or else.
Variance or dispersion increases as the carat size increases.

ggplot(diamonds,(aes(x = carat, y = price)))+
  geom_point(fill = I('#F79420'),color = I('red'),shape= 21)+
  stat_smooth(method = lm)+
  xlim(0,quantile(diamonds$carat,.99))+
  ylim(0,quantile(diamonds$price,.99))

## Warning: Removed 926 rows containing non-finite values (stat_smooth).

## Warning: Removed 926 rows containing missing values (geom_point).

## Warning: Removed 4 rows containing missing values (geom_smooth).

ggpairs Function

Notes:

https://s3.amazonaws.com/udacity-hosted-downloads/ud651/ggpairs_landscape.pdf

The Demand of Diamonds

Notes: using log scale for price

# Create two histograms of the price variable
# and place them side by side on one output image.

# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.

# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.

library(gridExtra)

p1 <- ggplot(diamonds,(aes(x = price)))+
  geom_histogram(binwidth = 100, fill = I('#099DD9'))+
  ggtitle('Diamond prices')

p2 = ggplot(diamonds,(aes(x = log10(price))))+
  geom_histogram(binwidth = 0.05, fill = I('#F79420'))+
  ggtitle('log Price')


grid.arrange(p1,p2,ncol = 2)

The two peaks shows the type of buyer as expected. rich buyer and poor buyer. log 10 scale has bimodal and bell curve type of distribution, which is better in behavior. ***

Connecting Demand and Price Distributions

Notes: On the log scale, the prices look less dispersed at the high end of Carat size and price, but actually we can do better. Let’s try using the cube root of Carat in light of our speculation about flaws being exponentially more likely in diamonds with more volume ***

Scatterplot Transformation

ggplot(diamonds,(aes(x = carat, y = log10(price))))+
  geom_point(fill = I('#F79420'),color = I('brown'),shape= 21)+
  ggtitle('Carat vs (log10)Prices')

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', 
                                      transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point()+
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

## Warning: Removed 1683 rows containing missing values (geom_point).

With these transformations that we used to get our data on this nice scale. Things look almost linear. We can now move forward and see about modelling our data using just a linear model.

Overplotting Revisited

When multiple points are taken on same value. to correct them

add transperency
add jitter points, adjust the shape
additional info by adding layer of geom_smooth or geom_quantile or geom_density_2D

head(sort(table(diamonds$carat), decreasing = TRUE))

## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558

head(sort(table(diamonds$price), decreasing = TRUE))

## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121

# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.

ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 0.5, size = 0.75, position ='jitter') + 
  geom_count()+
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')

## Warning: Removed 1683 rows containing non-finite values (stat_sum).

## Warning: Removed 1690 rows containing missing values (geom_point).

Price vs. Carat and Cut

Alter the code below.

# Adjust the code below to color the points by clarity.

# A layer called scale_color_brewer() has 
# been added to adjust the legend and
# provide custom colors.

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Clarity', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')

## Warning: Removed 1694 rows containing missing values (geom_point).

Price vs. Carat and Color

Alter the code below.

# Adjust the code below to color the points by cut.
# Change any other parts of the code as needed.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')

## Warning: Removed 1690 rows containing missing values (geom_point).

Color and Price

Response: yes . a pattern is observed in the color and price relation.

Linear Models in R

Notes:

Response:

Building the Linear Model

Notes: sdigits = 3 This will ensure that the output at the end of the table shows three significant digits like shown in the video.

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5,sdigits = 3)

## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ============================================================================================
##                        m1             m2             m3             m4            m5        
## --------------------------------------------------------------------------------------------
##   (Intercept)          2.821***       1.039***       0.874***      0.932***       0.415***  
##                       (0.006)        (0.019)        (0.019)       (0.017)        (0.010)    
##   I(carat^(1/3))       5.558***       8.568***       8.703***      8.438***       9.144***  
##                       (0.007)        (0.032)        (0.031)       (0.028)        (0.016)    
##   carat                              -1.137***      -1.163***     -0.992***      -1.093***  
##                                      (0.012)        (0.011)       (0.010)        (0.006)    
##   cut: .L                                            0.224***      0.224***       0.120***  
##                                                     (0.004)       (0.004)        (0.002)    
##   cut: .Q                                           -0.062***     -0.062***      -0.031***  
##                                                     (0.004)       (0.003)        (0.002)    
##   cut: .C                                            0.051***      0.052***       0.014***  
##                                                     (0.003)       (0.003)        (0.002)    
##   cut: ^4                                            0.018***      0.018***      -0.002     
##                                                     (0.003)       (0.002)        (0.001)    
##   color: .L                                                       -0.373***      -0.441***  
##                                                                   (0.003)        (0.002)    
##   color: .Q                                                       -0.129***      -0.093***  
##                                                                   (0.003)        (0.002)    
##   color: .C                                                        0.001         -0.013***  
##                                                                   (0.003)        (0.002)    
##   color: ^4                                                        0.029***       0.012***  
##                                                                   (0.003)        (0.002)    
##   color: ^5                                                       -0.016***      -0.003*    
##                                                                   (0.003)        (0.001)    
##   color: ^6                                                       -0.023***       0.001     
##                                                                   (0.002)        (0.001)    
##   clarity: .L                                                                     0.907***  
##                                                                                  (0.003)    
##   clarity: .Q                                                                    -0.240***  
##                                                                                  (0.003)    
##   clarity: .C                                                                     0.131***  
##                                                                                  (0.003)    
##   clarity: ^4                                                                    -0.063***  
##                                                                                  (0.002)    
##   clarity: ^5                                                                     0.026***  
##                                                                                  (0.002)    
##   clarity: ^6                                                                    -0.002     
##                                                                                  (0.002)    
##   clarity: ^7                                                                     0.032***  
##                                                                                  (0.001)    
## --------------------------------------------------------------------------------------------
##   R-squared            0.924          0.935          0.939         0.951          0.984     
##   adj. R-squared       0.924          0.935          0.939         0.951          0.984     
##   sigma                0.280          0.259          0.250         0.224          0.129     
##   F               652012.063     387489.366     138654.523     87959.467     173791.084     
##   p                    0.000          0.000          0.000         0.000          0.000     
##   Log-likelihood   -7962.499      -3631.319      -1837.416      4235.240      34091.272     
##   Deviance          4242.831       3613.360       3380.837      2699.212        892.214     
##   AIC              15930.999       7270.637       3690.832     -8442.481     -68140.544     
##   BIC              15957.685       7306.220       3761.997     -8317.942     -67953.736     
##   N                53940          53940          53940         53940          53940         
## ============================================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.

Model Problems

Video Notes:

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response:

diamondsbig<- load(“BigDiamonds.Rda”) getwd()

Building a Model Using the Big Diamonds Data Set

Notes:

Building a Model Using the Big Diamonds Data Set}

diamondsbig$logprice <- log(diamondsbig$price) m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == “GIA”,]) m2 <- update(m1, ~ . + carat) m3 <- update(m2, ~ . + cut) m4 <- update(m3, ~ . + color) m5 <- update(m4, ~ . + clarity) mtable(m1, m2, m3, m4, m5,sdigits = 3)

suppressMessages(library(lattice)) suppressMessages(library(MASS)) suppressMessages(library(memisc)) models <- mtable(m1, m2, m3, m4, m5) ```

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.

thisDiamond = data.frame(carat = 1.00, cut = “V.Good”, color = “I”, clarity=“VS1”) modelEstimate = predict(m5, newdata = thisDiamond, interval=“prediction”, level = .95)

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

Linear Regression Model to predict Diamond Price

Dr Alina

September 27, 2018

Loading The Diamond Data

Scatter Plot Review

Price and Carat Relationship

ggpairs Function

The Demand of Diamonds

Connecting Demand and Price Distributions

Scatterplot Transformation

Create a new function to transform the carat variable

Use the cuberoot_trans function

Overplotting Revisited

Price vs. Carat and Cut

Price vs. Carat and Color

Color and Price

Linear Models in R

Building the Linear Model

Model Problems

Building a Model Using the Big Diamonds Data Set

Building a Model Using the Big Diamonds Data Set}

Predictions

Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.