Lesson 6

Welcome

Notes:


Scatterplot Review

library('ggplot2')
data(diamonds)
summary(diamonds)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z         
##  Min.   : 0.000   Min.   : 0.000  
##  1st Qu.: 4.720   1st Qu.: 2.910  
##  Median : 5.710   Median : 3.530  
##  Mean   : 5.735   Mean   : 3.539  
##  3rd Qu.: 6.540   3rd Qu.: 4.040  
##  Max.   :58.900   Max.   :31.800  
## 
qplot(x = carat, y = price, data = diamonds, xlim = c(0, quantile(diamonds$carat,0.99)), ylim = c(0, quantile(diamonds$price, 0.99))) + geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).

## Warning: Removed 926 rows containing missing values (geom_point).

ggplot(aes(x = carat, y = price), data = diamonds) + scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) + scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) + geom_point(color = I('#F79420'), alpha = 1/4) + stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing non-finite values (stat_smooth).

## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).


Price and Carat Relationship

Response: nonlinear relationship, exponential or something else. dispersion or variance of the relationship also increases as carat size increases


Frances Gerety

Notes: The diamonds data set ships with ggplot2 and contains the prices and the specs for more than 50,000

A diamonds is forever


The Rise of Diamonds

Notes: a man’s sucess in life


ggpairs Function

Notes:

# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.2.4
library(scales)
## Warning: package 'scales' was built under R version 3.2.4
library(memisc)
## Warning: package 'memisc' was built under R version 3.2.4
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.2.4
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.2.4
## 
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
## 
##     percent
## The following objects are masked from 'package:stats':
## 
##     contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
## 
##     as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

What are some things you notice in the ggpairs output? Response: the lower triangle of the plot matrix, it uses grouped histograms for qualitative pairs and scatter plots for quantitative pairs. In the upper triangle, it plots grouped histograms for qualitative pairs, this time using the x instead of the y variable as the grouping factor. Box plots for qualitative pairs, and it provides the correlation for quantitative pairs.


The Demand of Diamonds

Notes: On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. We shouldn’t expect the market for bigger diamonds to be as competitive as the one for smaller diamonds.

Create two histograms of the price variable and place them side by side on one output image.

We’ve put some code below to get you started.

The first plot should be a histogram of price and the second plot should transform the price variable using log10.

Set appropriate bin widths for each plot. ggtitle() will add a title to each histogram.

library(gridExtra)

plot1 <- qplot(x = price, data = diamonds, binwidth = 100, fill = I('#099DD9')) +
  ggtitle('Price')

plot2 <- qplot(x=log10(price+1), data = diamonds, binwidth = 0.01, fill = I('#F79420')) +
  ggtitle('Price (log10)')
  
plot2 <- qplot(x=price, data = diamonds, binwidth = 0.01, fill = I('#F79420')) + scale_x_log10() + ggtitle('Price (log10)')

grid.arrange(plot1, plot2, ncol = 2)


Connecting Demand and Price Distributions

Notes: the prices for diamonds are pretty heavily skewed, but when you put those prices on a log ten scale, they seem much better behaved. They are closer to the bell curve of a normal distribution. We can see little bit of evidence of bimodality on this log ten scale which is consistent with our two class rich buyer poor buyer speculation about the nature of customers for diamonds.

most people buy less expensive diamonds and their relationship is quite linear. Demand for expensive diamonds are very disperse. There is a chasm in price.


Scatterplot Transformation

qplot (carat, price, data = diamonds) + scale_y_continuous(trans = log10_trans()) + ggtitle('Price (log10) by Carat')

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).


Overplotting Revisited

head(sort(table(diamonds$carat), desreasing = T))
## 
## 2.59 2.64 2.65 2.67  2.7 2.71 
##    1    1    1    1    1    1
head(sort(table(diamonds$price), desreasing = T))
## 
## 327 334 335 338 339 340 
##   1   1   1   1   1   1
ggplot(aes(carat, price), data = diamonds) + geom_point(alpha = 0.5, size = 0.75, position = 'jitter') + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).


Other Qualitative Factors

Notes:


Price vs. Carat and Clarity

Adjust the code below to color the points by clarity.

A layer called scale_color_brewer() has been added to adjust the legend and provide custom colors.

# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Clarity', reverse = T, override.aes = list(alpha = 1, size = 2))) +  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).


Clarity and Price

Response: Clarity seems to explain a lot of the remaining variance in price, after adding color to our plot. Holding carat weight constant, we see diamonds with lower clarity are almost always cheaper than diamonds with better clarity.


Price vs. Carat and Cut

Adjust the code below to color the points by cut. Change any other parts of the code as needed.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Cut', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).


Cut and Price

Response: We don’t see much variation on cut. Most of the diamonds in the data are ideal cut anyway. So, we’ve lost the color pattern that we saw before.


Price vs. Carat and Color

Adjust the code below to color the points by diamond colors and change the titles.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Color', reverse = F, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).


Color and Price

Response: Color seems to explain a lot of the remaining variance in price, after adding color to our plot. However, blue states that the difference between all color grades from D to J are basically not noticeable to the naked eye. Yet, we do see the color difference in the price tag.


Linear Models in R

Notes: lm( y ~ x), where y is outcome variable and x is explanatory variable

Response: we applied the log transformaiton to our long tailed dollar variable, and we speculated that the flawless diamond should become exponentially rarer as diamond volume increases. so we should be interested in the cube root of carat weight.


Building the Linear Model for Price

Notes: I stands for as is

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## =========================================================================
##                      m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------
##   (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415***  
##                   (0.006)    (0.019)    (0.019)    (0.017)    (0.010)    
##   I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144***  
##                   (0.007)    (0.032)    (0.031)    (0.028)    (0.016)    
##   carat                      -1.137***  -1.163***  -0.992***  -1.093***  
##                              (0.012)    (0.011)    (0.010)    (0.006)    
##   cut: .L                                0.224***   0.224***   0.120***  
##                                         (0.004)    (0.004)    (0.002)    
##   cut: .Q                               -0.062***  -0.062***  -0.031***  
##                                         (0.004)    (0.003)    (0.002)    
##   cut: .C                                0.051***   0.052***   0.014***  
##                                         (0.003)    (0.003)    (0.002)    
##   cut: ^4                                0.018***   0.018***  -0.002     
##                                         (0.003)    (0.002)    (0.001)    
##   color: .L                                        -0.373***  -0.441***  
##                                                    (0.003)    (0.002)    
##   color: .Q                                        -0.129***  -0.093***  
##                                                    (0.003)    (0.002)    
##   color: .C                                         0.001     -0.013***  
##                                                    (0.003)    (0.002)    
##   color: ^4                                         0.029***   0.012***  
##                                                    (0.003)    (0.002)    
##   color: ^5                                        -0.016***  -0.003*    
##                                                    (0.003)    (0.001)    
##   color: ^6                                        -0.023***   0.001     
##                                                    (0.002)    (0.001)    
##   clarity: .L                                                  0.907***  
##                                                               (0.003)    
##   clarity: .Q                                                 -0.240***  
##                                                               (0.003)    
##   clarity: .C                                                  0.131***  
##                                                               (0.003)    
##   clarity: ^4                                                 -0.063***  
##                                                               (0.002)    
##   clarity: ^5                                                  0.026***  
##                                                               (0.002)    
##   clarity: ^6                                                 -0.002     
##                                                               (0.002)    
##   clarity: ^7                                                  0.032***  
##                                                               (0.001)    
## -------------------------------------------------------------------------
##   R-squared            0.9        0.9        0.9        1.0        1.0   
##   adj. R-squared       0.9        0.9        0.9        1.0        1.0   
##   sigma                0.3        0.3        0.3        0.2        0.1   
##   F               652012.1   387489.4   138654.5    87959.5   173791.1   
##   p                    0.0        0.0        0.0        0.0        0.0   
##   Log-likelihood   -7962.5    -3631.3    -1837.4     4235.2    34091.3   
##   Deviance          4242.8     3613.4     3380.8     2699.2      892.2   
##   AIC              15931.0     7270.6     3690.8    -8442.5   -68140.5   
##   BIC              15957.7     7306.2     3762.0    -8317.9   -67953.7   
##   N                53940      53940      53940      53940      53940     
## =========================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.


Model Problems

Video Notes: To start, our data’s from 2008. So, not only do we need to account for inflation, but the diamond market is quite different now than it was. In fact, when I fit models to this data and predicted the price of the diamonds that I found off a market, I kept getting predictions that were way too low. It was found that global marker was poor. It turns out that prices plummeted in 2008 due to the global financial crisis.

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response:


A Bigger, Better Data Set

Notes:

#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')

#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

load("BigDiamonds.Rda")
diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
## 
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## =========================================================================
##                      m1         m2         m3         m4         m5      
## -------------------------------------------------------------------------
##   (Intercept)      2.671***   1.333***   0.949***   0.529***  -0.464***  
##                   (0.003)    (0.012)    (0.012)    (0.010)    (0.009)    
##   I(carat^(1/3))   5.839***   8.243***   8.633***   8.110***   8.320***  
##                   (0.004)    (0.022)    (0.021)    (0.017)    (0.012)    
##   carat                      -1.061***  -1.223***  -0.782***  -0.763***  
##                              (0.009)    (0.009)    (0.007)    (0.005)    
##   cut: V.Good                            0.120***   0.090***   0.071***  
##                                         (0.002)    (0.001)    (0.001)    
##   cut: Ideal                             0.211***   0.181***   0.131***  
##                                         (0.002)    (0.001)    (0.001)    
##   color: K/L                                        0.123***   0.117***  
##                                                    (0.004)    (0.003)    
##   color: J/L                                        0.312***   0.318***  
##                                                    (0.003)    (0.002)    
##   color: I/L                                        0.451***   0.469***  
##                                                    (0.003)    (0.002)    
##   color: H/L                                        0.569***   0.602***  
##                                                    (0.003)    (0.002)    
##   color: G/L                                        0.633***   0.665***  
##                                                    (0.003)    (0.002)    
##   color: F/L                                        0.687***   0.723***  
##                                                    (0.003)    (0.002)    
##   color: E/L                                        0.729***   0.756***  
##                                                    (0.003)    (0.002)    
##   color: D/L                                        0.812***   0.827***  
##                                                    (0.003)    (0.002)    
##   clarity: I1                                                  0.301***  
##                                                               (0.006)    
##   clarity: SI2                                                 0.607***  
##                                                               (0.006)    
##   clarity: SI1                                                 0.727***  
##                                                               (0.006)    
##   clarity: VS2                                                 0.836***  
##                                                               (0.006)    
##   clarity: VS1                                                 0.891***  
##                                                               (0.006)    
##   clarity: VVS2                                                0.935***  
##                                                               (0.006)    
##   clarity: VVS1                                                0.995***  
##                                                               (0.006)    
##   clarity: IF                                                  1.052***  
##                                                               (0.006)    
## -------------------------------------------------------------------------
##   R-squared             0.9        0.9       0.9        0.9         1.0  
##   adj. R-squared        0.9        0.9       0.9        0.9         1.0  
##   sigma                 0.3        0.3       0.3        0.2         0.2  
##   F               2700903.7  1406538.3  754405.4   423311.5    521161.4  
##   p                     0.0        0.0       0.0        0.0         0.0  
##   Log-likelihood   -60137.8   -53996.3  -43339.8    37830.4    154124.3  
##   Deviance          28298.7    27291.5   25628.3    15874.9      7992.7  
##   AIC              120281.6   108000.5   86691.6   -75632.8   -308204.5  
##   BIC              120313.8   108043.5   86756.0   -75482.6   -307968.4  
##   N                338946     338946    338946     338946      338946    
## =========================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
exp(modelEstimate)
##        fit     lwr      upr
## 1 5040.436 3730.34 6810.638

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

   fit     lwr      upr

1 5040.436 3730.34 6810.638 ***

Final Thoughts

Notes: Even though we can predict the price of a diamnond based on a function for c’s. One thing you should not conclude with this exercise is that where you buy your diamond is irrelevant. You almost surely pay more for the same diamond at Tiffany’s compared to Costco. Regardless you can use a model like this to get a sense of whether you were overpaying. One last thing, data and models are never infallible and you can still get taken even equipped with this kind of analysis.


Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!