Lesson 6

setwd("D:/R/Udacity/EDA_Course_Materials/lesson6")
library(ggplot2)

Welcome

Notes:


Scatterplot Review

ggplot(data = diamonds, aes(x = carat, y = price)) +
    scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
    scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
    geom_point(fill= I('#F79420'), color = I('black'), shape = 21) +
    stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing missing values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_path).


Price and Carat Relationship

Response: Price is growing with carat size and most of diamonds are concentrated in 1.0, 1.5, 2.0 carat sizes. Relationship between varialbes is not linear. ***

Frances Gerety

Notes:

A diamonds is


The Rise of Diamonds

Notes:


ggpairs Function

Notes:

# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
## 
## Attaching package: 'memisc'
## 
## The following object is masked from 'package:scales':
## 
##     percent
## 
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## 
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.')))
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.
## stat_bin: binwidth defaulted to range/30. Use 'binwidth = x' to adjust this.

What are some things you notice in the ggpairs output? Response: The value which is driving the price is carat of diamonds. ***

The Demand of Diamonds

Notes:

library(gridExtra)
## Loading required package: grid
plot1 <- qplot(data = diamonds, x = price, binwidth = 100, fill = I('#099DD9')) + 
  ggtitle('Price')

plot2 <- qplot(data = diamonds, x = price, binwidth = 0.01, fill = I('#F79420')) +
  scale_x_log10() +
  ggtitle('Price (log10)')

grid.arrange(plot1, plot2, ncol = 2)
## Warning: position_stack requires constant width: output may be incorrect


Connecting Demand and Price Distributions

Notes: Picks on Price(log10) histogram are the most frequent price poor buyers and rich buyers get diamonds for. ***

Scatterplot Transformation

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function


Overplotting Revisited

head(sort(table(diamonds$carat), decreasing = T))
## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = T))
## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha = 1/2, size = 3/4, position = 'jitter') + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response: Diamonds of one size with higher clarity have higher price. ***

Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response: We cannot say if cut accounts for variance in price. In this dataset we mostly have diamonds of ideal cut. ***

Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Color', 
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response: Color influences price as we see diamonds of D color usually have higher price. ***

Linear Models in R

Notes:

Response:


Building the Linear Model

Notes:

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ======================================================================
##                     m1         m2         m3         m4         m5    
## ----------------------------------------------------------------------
## (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415*** 
##                 (0.006)    (0.019)    (0.019)    (0.017)    (0.010)   
## I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144*** 
##                 (0.007)    (0.032)    (0.031)    (0.028)    (0.016)   
## carat                      -1.137***  -1.163***  -0.992***  -1.093*** 
##                            (0.012)    (0.011)    (0.010)    (0.006)   
## cut: .L                                0.224***   0.224***   0.120*** 
##                                       (0.004)    (0.004)    (0.002)   
## cut: .Q                               -0.062***  -0.062***  -0.031*** 
##                                       (0.004)    (0.003)    (0.002)   
## cut: .C                                0.051***   0.052***   0.014*** 
##                                       (0.003)    (0.003)    (0.002)   
## cut: ^4                                0.018***   0.018***  -0.002    
##                                       (0.003)    (0.002)    (0.001)   
## color: .L                                        -0.373***  -0.441*** 
##                                                  (0.003)    (0.002)   
## color: .Q                                        -0.129***  -0.093*** 
##                                                  (0.003)    (0.002)   
## color: .C                                         0.001     -0.013*** 
##                                                  (0.003)    (0.002)   
## color: ^4                                         0.029***   0.012*** 
##                                                  (0.003)    (0.002)   
## color: ^5                                        -0.016***  -0.003*   
##                                                  (0.003)    (0.001)   
## color: ^6                                        -0.023***   0.001    
##                                                  (0.002)    (0.001)   
## clarity: .L                                                  0.907*** 
##                                                             (0.003)   
## clarity: .Q                                                 -0.240*** 
##                                                             (0.003)   
## clarity: .C                                                  0.131*** 
##                                                             (0.003)   
## clarity: ^4                                                 -0.063*** 
##                                                             (0.002)   
## clarity: ^5                                                  0.026*** 
##                                                             (0.002)   
## clarity: ^6                                                 -0.002    
##                                                             (0.002)   
## clarity: ^7                                                  0.032*** 
##                                                             (0.001)   
## ----------------------------------------------------------------------
## R-squared            0.924      0.935      0.939     0.951       0.984
## adj. R-squared       0.924      0.935      0.939     0.951       0.984
## sigma                0.280      0.259      0.250     0.224       0.129
## F               652012.063 387489.366 138654.523 87959.467  173791.084
## p                    0.000      0.000      0.000     0.000       0.000
## Log-likelihood   -7962.499  -3631.319  -1837.416  4235.240   34091.272
## Deviance          4242.831   3613.360   3380.837  2699.212     892.214
## AIC              15930.999   7270.637   3690.832 -8442.481  -68140.544
## BIC              15957.685   7306.220   3761.997 -8317.942  -67953.736
## N                53940      53940      53940     53940       53940    
## ======================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes: We didn’t count inflation or any financial market changes.

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response:


A Bigger, Better Data Set

Notes:

#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')

#diamondsurl <- getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
diamondsBig <- read.csv('diamondsBig.csv')

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsBig)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamondsBig)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamondsBig)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamondsBig)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsBig)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsBig)
## 
## ===========================================================================
##                     m1          m2          m3          m4          m5     
## ---------------------------------------------------------------------------
## (Intercept)       3.096***    1.406***    1.218***    1.299***    0.578*** 
##                  (0.002)     (0.005)     (0.005)     (0.005)     (0.004)   
## I(carat^(1/3))    5.317***    7.911***    7.920***    8.170***    8.368*** 
##                  (0.002)     (0.008)     (0.008)     (0.007)     (0.005)   
## carat                        -0.767***   -0.779***   -0.782***   -0.815*** 
##                              (0.002)     (0.002)     (0.002)     (0.001)   
## cut: Ideal                                0.256***    0.222***    0.130*** 
##                                          (0.002)     (0.001)     (0.001)   
## cut: V.Good                               0.119***    0.092***    0.059*** 
##                                          (0.002)     (0.002)     (0.001)   
## color: E/D                                           -0.104***   -0.081*** 
##                                                      (0.002)     (0.001)   
## color: F/D                                           -0.171***   -0.135*** 
##                                                      (0.002)     (0.001)   
## color: G/D                                           -0.267***   -0.225*** 
##                                                      (0.002)     (0.001)   
## color: H/D                                           -0.377***   -0.327*** 
##                                                      (0.002)     (0.001)   
## color: I/D                                           -0.472***   -0.429*** 
##                                                      (0.002)     (0.001)   
## color: J/D                                           -0.592***   -0.561*** 
##                                                      (0.002)     (0.001)   
## color: K/D                                           -0.760***   -0.758*** 
##                                                      (0.002)     (0.002)   
## color: L/D                                           -0.894***   -0.886*** 
##                                                      (0.003)     (0.003)   
## clarity: I2                                                      -0.355*** 
##                                                                  (0.005)   
## clarity: IF                                                       0.992*** 
##                                                                  (0.002)   
## clarity: SI1                                                      0.479*** 
##                                                                  (0.002)   
## clarity: SI2                                                      0.329*** 
##                                                                  (0.002)   
## clarity: VS1                                                      0.713*** 
##                                                                  (0.002)   
## clarity: VS2                                                      0.625*** 
##                                                                  (0.002)   
## clarity: VVS1                                                     0.869*** 
##                                                                  (0.002)   
## clarity: VVS2                                                     0.790*** 
##                                                                  (0.002)   
## ---------------------------------------------------------------------------
## R-squared             0.893       0.911       0.915       0.940      0.968 
## adj. R-squared        0.893       0.911       0.915       0.940      0.968 
## sigma                 0.425       0.389       0.379       0.318      0.233 
## F               5000339.553 3039162.591 1608418.407  785005.644 903078.293 
## p                     0.000       0.000       0.000       0.000      0.000 
## Log-likelihood  -336293.722 -283694.777 -268204.909 -162478.620  23289.316 
## Deviance         107833.287   90420.162   85850.030   60255.534  32348.910 
## AIC              672593.443  567397.553  536421.818  324985.239 -46534.632 
## BIC              672627.344  567442.754  536489.619  325143.442 -46286.028 
## N                597311      597311      597311      597311     597311     
## ===========================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
exp(modelEstimate)
##        fit     lwr      upr
## 1 4786.053 3033.06 7552.207

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.


Final Thoughts

Notes: This model to realize if you’re overpaying for a diamond. But it doesn’t predict the exact price of the diamond. ***

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!