Lesson 6

Welcome

Notes:

# Colors
# #8224e3 -- Purple
# #fcae3a -- Orange
# #359bed -- Blue
# #81d742 -- Green
# #dd3333 -- Red

Scatterplot Review

library("ggplot2")
library(xkcd)
data(diamonds)
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.

# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is 
# reasonable or a rip-off.

# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).

# Limit the x-axis and y-axis to omit the top 1% of values.
names(diamonds)

##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

ggplot(data=diamonds, aes(x=carat, y=price)) +
  # get rid of top percentile
  scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99))) +
  scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99))) +
  geom_point(fill=I('#dd3333'), color= I("black"), aes(alpha=1/10),shape=21) +
  stat_smooth(method='lm') +
  theme_xkcd()

Price and Carat Relationship

Response:

ggplot(data=diamonds, aes(x=carat, y=price)) +
  # get rid of top percentile
  scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99))) +
  scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99))) +
  geom_point(color=I('#dd3333'),alpha=1/10) +
  stat_smooth(method='lm') +
  ggtitle("Linear Fit of Carat Weight to Price") + 
  theme_xkcd()

Frances Gerety

Notes: This is the woman that created the rise of the diamond.

A diamonds is

Forever!

The Rise of Diamonds

Notes: Frances Gerety wrote “A Diamond Is Forever” for an advertising campaign for De Beers in 1947, and wrote all of the company’s ads for 25 years.

ggpairs Function

Notes:

# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)

# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, outlier.shape = I('.')) # params = c(shape = I('.'),

What are some things you notice in the ggpairs output? Response:

The Demand of Diamonds

Notes: We speculated that there are two tiers of demands, those purchased by those who feel obligateddue to a woman’s expectation of a diamond for a marriage proposal who would otherwise never purchase a diamond, and those who are especially wealthy and purchase a large diamond because they have the means to afford it.

The log base 10 scale reveals this information of bimodality.

# Create two histograms of the price variable
# and place them side by side on one output image.

# We’ve put some code below to get you started.

# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.

# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.

# You can self-assess your work with the plots
# in the solution video.

# ALTER THE CODE BELOW THIS LINE
# ==============================================

library(gridExtra)

plot1 <- ggplot(data=diamonds, aes(x=price)) + 
  geom_histogram(fill=I("#81d742"), color=I("black"),binwidth = 200) +
  ggtitle('Price') +
  theme_xkcd()
plot2 <- ggplot(data=diamonds, aes(x=price)) +
  geom_histogram(fill=I("#359bed"), color=I("black"),binwidth = 0.02) + 
  scale_x_log10() +
  ggtitle('Price - Log10 Tranformation') +
  xlab("Price Log 10") +
  theme_xkcd()
grid.arrange(plot1,plot2, ncol=2)

Connecting Demand and Price Distributions

Notes:

Scatterplot Transformation

Here we see the smaller variance in higher priced diamonds.

ggplot(data=diamonds, aes(x=carat, y=price)) +
  geom_point(color=I("#359bed"), alpha=1/50) + 
  scale_y_continuous(trans=log10_trans()) +
  ggtitle('Price - Log10 Tranformation') +
  xlab("Carat") + 
  ylab("Price Log 10") +
  theme_xkcd()

Create a new function to transform the carat variable

cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)

Use the cuberoot_trans function

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price log10 by Cube-Root of Carat') +
  theme_xkcd()

Overplotting Revisited

head(sort(table(diamonds$carat),decreasing = TRUE))

## 
##  0.3 0.31 1.01  0.7 0.32    1 
## 2604 2249 2242 1981 1840 1558

head(sort(table(diamonds$price),decreasing = TRUE))

## 
## 605 802 625 828 776 698 
## 132 127 126 125 124 121

# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.

# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.
ggplot(aes(carat, price), data = diamonds) + 
  geom_point(alpha=1/5,size=.75,color="#8224e3",position= "jitter") + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat') +
  theme_xkcd()

Other Qualitative Factors

Notes:

Price vs. Carat and Clarity

Alter the code below.

# install and load the RColorBrewer package
install.packages('RColorBrewer')

library(RColorBrewer)

ggplot(aes(x = carat, y = price,color=clarity), data = diamonds) + 
  geom_point(alpha = 0.2, size = .75, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
    breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
    breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price log10 by Cube-Root of Carat and Clarity') +
  theme_xkcd()

Clarity and Price

Response: Damn! Look at the trends in the clarity. The better the clarity, the higher the price. There is a very clear correlation here. Filtering out each clarity level you could run a model on each level and get a very accurate outcome. ***

Price vs. Carat and Cut

Alter the code below.

ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + 
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = 'Cut', reverse = T,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Cut') +
  theme_xkcd()

Cut and Price

Response: There doesn’t seem to be that many poorly cut diamonds. I don’t think cut makes as big of a difference since so many still show how variance in price.

The ideals dominate the cut, and the variance is large amongst each category of cut. ***

Price vs. Carat and Color

Alter the code below.

ggplot(aes(x = carat, y = price, color = color), data = diamonds) + 
  geom_point(alpha = 0.5, size = .75, position = 'jitter') +
  scale_color_brewer(type = 'div',
                     guide = guide_legend(title = "Color", reverse = FALSE,
                                          override.aes = list(alpha = 1, size = 2))) +  
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat and Color') + 
  theme_xkcd()

Color and Price

Response: Here we see more of what we are looking for, a small variance in the price distribution of color.

Meaning in other words, the price and color have a strong correlation. ***

Linear Models in R

Notes:

We can create predictive models in R using the lm() function. Here is the help for the lm function:

?lm

In my explanation, derived from the EDA course, we need to provide a formula in the form of lm( y ~ x), where Y is the outcome variable, and X is the explanatory variable.

Which of the formulas would we ue inside the lm() function?

<td>price ~ carat</td>
<td>log(price) ~ carat^3</td>

<td>price^(1/3) ~ log(carat)</td>
<td>log(price) ~ carat^(1/3)</td>

I would say the last one because that is the relationship that was linear in nature.

As the course states: Nice! Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).

Response:

Building the Linear Model

Notes:

m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds) 

# create our first model. The I() wrapper uses the expression inside the I function to transform the variable before using it in the regression, rather than instruction R to interpret these symbols as part of the formula to construct a design matrix for the regression. 
m2 <- update(m1, ~ . + carat)

# We can update the previous model by adding carat to the explanatory variables to consider more parameters and more accurately predict our outcome.

m3 <- update(m2, ~ . + cut) #lets add cut even though that doesn't do much. 
m4 <- update(m3, ~ . + color) #
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)

## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamonds)
## 
## ======================================================================
##                     m1         m2         m3         m4         m5    
## ----------------------------------------------------------------------
## (Intercept)      2.821***   1.039***   0.874***   0.932***   0.415*** 
##                 (0.006)    (0.019)    (0.019)    (0.017)    (0.010)   
## I(carat^(1/3))   5.558***   8.568***   8.703***   8.438***   9.144*** 
##                 (0.007)    (0.032)    (0.031)    (0.028)    (0.016)   
## carat                      -1.137***  -1.163***  -0.992***  -1.093*** 
##                            (0.012)    (0.011)    (0.010)    (0.006)   
## cut: .L                                0.224***   0.224***   0.120*** 
##                                       (0.004)    (0.004)    (0.002)   
## cut: .Q                               -0.062***  -0.062***  -0.031*** 
##                                       (0.004)    (0.003)    (0.002)   
## cut: .C                                0.051***   0.052***   0.014*** 
##                                       (0.003)    (0.003)    (0.002)   
## cut: ^4                                0.018***   0.018***  -0.002    
##                                       (0.003)    (0.002)    (0.001)   
## color: .L                                        -0.373***  -0.441*** 
##                                                  (0.003)    (0.002)   
## color: .Q                                        -0.129***  -0.093*** 
##                                                  (0.003)    (0.002)   
## color: .C                                         0.001     -0.013*** 
##                                                  (0.003)    (0.002)   
## color: ^4                                         0.029***   0.012*** 
##                                                  (0.003)    (0.002)   
## color: ^5                                        -0.016***  -0.003*   
##                                                  (0.003)    (0.001)   
## color: ^6                                        -0.023***   0.001    
##                                                  (0.002)    (0.001)   
## clarity: .L                                                  0.907*** 
##                                                             (0.003)   
## clarity: .Q                                                 -0.240*** 
##                                                             (0.003)   
## clarity: .C                                                  0.131*** 
##                                                             (0.003)   
## clarity: ^4                                                 -0.063*** 
##                                                             (0.002)   
## clarity: ^5                                                  0.026*** 
##                                                             (0.002)   
## clarity: ^6                                                 -0.002    
##                                                             (0.002)   
## clarity: ^7                                                  0.032*** 
##                                                             (0.001)   
## ----------------------------------------------------------------------
## R-squared            0.924      0.935      0.939     0.951       0.984
## adj. R-squared       0.924      0.935      0.939     0.951       0.984
## sigma                0.280      0.259      0.250     0.224       0.129
## F               652012.063 387489.366 138654.523 87959.467  173791.084
## p                    0.000      0.000      0.000     0.000       0.000
## Log-likelihood   -7962.499  -3631.319  -1837.416  4235.240   34091.272
## Deviance          4242.831   3613.360   3380.837  2699.212     892.214
## AIC              15930.999   7270.637   3690.832 -8442.481  -68140.544
## BIC              15957.685   7306.220   3761.997 -8317.942  -67953.736
## N                53940      53940      53940     53940       53940    
## ======================================================================

Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.

the R-squared coefficient is the key variable here, and notice how we are accounting for almost all of the price variance by using the relevant models.

Here is the output: R-squared 0.924 0.935 0.939 0.951 0.984

Model Problems

Video Notes:

Our model is: ln(price) = 0.415 + 9.144(carat^(1/3)) - 1.093carat + (…* cut + …* color + …* clarity) + error coefficient.

Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)

Response: This data is from 2008, we need to account for inflation. Additionally, we’re out of the global recession. -Diamond Market in china is heating up. -Uneven price increases across carat size, which means we need to retrain the model rather than simply use a multiplier to account for inflation / price increases. ***

A Bigger, Better Data Set

Notes:

install.packages('bitops')
install.packages('RCurl')

library('bitops')
library('RCurl')

#these don't actually work, why?
#-------------------------------
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))

#-------------------------------
# Edit, from http://stackoverflow.com/questions/24846120/importing-data-into-r-rdata-from-github
# Note that the https:// URL scheme is not supported except on Windows.
#-------------------------------

# we can use this instead
diamondsurl = "https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda"
load(url(diamondsurl))

# or this
#download.file(diamondsurl,"BigDiamonds.rda")
#load("BigDiamonds.rda")

The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data

Building a Model Using the Big Diamonds Data Set

Notes:

# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.

# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).

# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!

# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.

# This exercise WILL BE automatically graded.

# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)

# Your task is to write the code to create the models.

#m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamondsbig)
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",]) # lets remove outliers
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1,m2,m3,m4,m5)

## 
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price < 
##     10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color, 
##     data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##         "GIA", ])
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + 
##     clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == 
##     "GIA", ])
## 
## ===========================================================================
##                     m1          m2          m3          m4          m5     
## ---------------------------------------------------------------------------
## (Intercept)       2.671***    1.333***    0.949***    0.529***   -0.464*** 
##                  (0.003)     (0.012)     (0.012)     (0.010)     (0.009)   
## I(carat^(1/3))    5.839***    8.243***    8.633***    8.110***    8.320*** 
##                  (0.004)     (0.022)     (0.021)     (0.017)     (0.012)   
## carat                        -1.061***   -1.223***   -0.782***   -0.763*** 
##                              (0.009)     (0.009)     (0.007)     (0.005)   
## cut: V.Good                               0.120***    0.090***    0.071*** 
##                                          (0.002)     (0.001)     (0.001)   
## cut: Ideal                                0.211***    0.181***    0.131*** 
##                                          (0.002)     (0.001)     (0.001)   
## color: K/L                                            0.123***    0.117*** 
##                                                      (0.004)     (0.003)   
## color: J/L                                            0.312***    0.318*** 
##                                                      (0.003)     (0.002)   
## color: I/L                                            0.451***    0.469*** 
##                                                      (0.003)     (0.002)   
## color: H/L                                            0.569***    0.602*** 
##                                                      (0.003)     (0.002)   
## color: G/L                                            0.633***    0.665*** 
##                                                      (0.003)     (0.002)   
## color: F/L                                            0.687***    0.723*** 
##                                                      (0.003)     (0.002)   
## color: E/L                                            0.729***    0.756*** 
##                                                      (0.003)     (0.002)   
## color: D/L                                            0.812***    0.827*** 
##                                                      (0.003)     (0.002)   
## clarity: I1                                                       0.301*** 
##                                                                  (0.006)   
## clarity: SI2                                                      0.607*** 
##                                                                  (0.006)   
## clarity: SI1                                                      0.727*** 
##                                                                  (0.006)   
## clarity: VS2                                                      0.836*** 
##                                                                  (0.006)   
## clarity: VS1                                                      0.891*** 
##                                                                  (0.006)   
## clarity: VVS2                                                     0.935*** 
##                                                                  (0.006)   
## clarity: VVS1                                                     0.995*** 
##                                                                  (0.006)   
## clarity: IF                                                       1.052*** 
##                                                                  (0.006)   
## ---------------------------------------------------------------------------
## R-squared             0.888       0.892      0.899       0.937        0.969
## adj. R-squared        0.888       0.892      0.899       0.937        0.969
## sigma                 0.289       0.284      0.275       0.216        0.154
## F               2700903.714 1406538.330 754405.425  423311.488   521161.443
## p                     0.000       0.000      0.000       0.000        0.000
## Log-likelihood   -60137.791  -53996.269 -43339.818   37830.414   154124.270
## Deviance          28298.689   27291.534  25628.285   15874.910     7992.720
## AIC              120281.582  108000.539  86691.636  -75632.827  -308204.540
## BIC              120313.783  108043.473  86756.037  -75482.557  -307968.400
## N                338946      338946     338946      338946       338946    
## ===========================================================================

Predictions

Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601 ^^These are the true values, lets pop it into our model. We need to exponentiate our model results since we used the log10(price) rather than price itself.

#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
# data.Frame creates a data frame, we created a dataframe with one value, thisDiamond
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
exp(modelEstimate) # this will give us the actual price because our model outputs log 10.

##        fit     lwr      upr
## 1 5040.436 3730.34 6810.638

# parameters
# level = gives us the lower and upper bounds based on our confidence level
#       -- ie, there is a 95% chance that the diamond price is within this range
#

Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.

Final Thoughts

Notes:

Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!