========================================================
Notes:
The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.
We can quickly see if the relationship is linear or not.
# Import Libraries
library(ggplot2)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Load the dataset
data("diamonds")
#Spot Check
head(diamonds)
## Source: local data frame [6 x 10]
##
## carat cut color clarity depth table price x y z
## (dbl) (fctr) (fctr) (fctr) (dbl) (dbl) (int) (dbl) (dbl) (dbl)
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.20 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
# Check the Names
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is
# reasonable or a rip-off.
# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).
# Limit the x-axis and y-axis to omit the top 1% of values.
# ENTER YOUR CODE BELOW THIS LINE
# ================================================================
# QPLOT Syntax
#
qplot(data = diamonds, x = carat, y = price,
xlim = c(0, quantile(diamonds$carat, 0.99)),
ylim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 926 rows containing missing values (geom_point).
# GGPlot Syntax
#
ggplot(data = diamonds, aes(x = carat, y = price)) +
scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) +
scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) +
geom_point(fill = I('#F79420'), color = I('black'), shape = 21) +
stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Response:
A few things pop our right away about the functional relationship between price and carat size.
Add in a smoothing function - which shows that the Linear Trendline doesn’t go through the center of the data at some key places. If we tried to use this to make predictions, we might be off in some key places in the data.
Notes:
Forever.
Notes:
De Beere Cartel …
Notes:
# install these if necessary
# install.packages('GGally')
# install.packages('scales')
# install.packages('memisc')
# install.packages('lattice')
# install.packages('MASS')
# install.packages('car')
# install.packages('reshape')
# install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(scales)
library(memisc)
## Loading required package: lattice
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:dplyr':
##
## collect, query, rename
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
# ggpairs(diamond_samp, params = c(shape = I('.'), outlier.shape = I('.'))) <- DEPRECATED !!!
ggpairs(diamond_samp,
lower = list(continuous = wrap("points", shape = I('.'))),
upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
** What’s happening?**
GGPairs is plotting each variable against the other in a pretty smart way.
In the lower triangle of the plot matrix, ggplot uses grouped histograms for qualitative - qualitative comparisons, and scatter plots for quantitative - quantitative pairs.
In the upper triangle, it plots group histograms for qualitative - qualitative pairs, this time using the X instead of the Y variable as the grouping factor, and box plots for qualitative - quantitative pairs.
It provides teh correlation for quantitative - quantitative pairs.
REMEMBER - our goal is to understand the price of diamonds, that’s the focus. Lets look at the relationships that correspond to price.
What are some things you notice in the ggpairs output? Response:
It is a positive non linear relationship
Price and carat are highly correlated - almost a 1 to 1.
There “seems”" to be relationships between price and clarity and price and color, which might come in handy as we model our data.
The critical factor driving price is the carat weight of the diamond. The relationship between price and diamond size is non-linear. What might explain this pattern?
NOTE - it’s often the case that leveraging substantive knowledge about your data like this can lead to especially fruitful transformations.
Diamond Carat Weight
** On the demand side** - customers in the market for a smaller less expensive diamond are probably more price sensitive to price that more well-to-do-buyers. Many less than 1 carat customers would surely never buy a diamond were it not for teh social norm for presenting one when proposing.
And, there are fewer customers who can afford a bigger diamond ( > 1 carat), hence, we shouldn’t expect the market for bigger diamonds to be as competitive as the one for smaller diamonds.
So it would make sence that the variance as well as price would increase with carat size.
Now often, the distribution of any monetary variable (like dollars) will be highly skewed and vary over orders of magnitude. This can result from path dependence (like the rich getting richer), or mulplicatitive processes (like year on year inflation), or some combination of both, hence it is a good idea to look into **compressing any such variable*, by putting it on a log scale.
Lets examine the distribution of price again…
Notes:
# Create two histograms of the price variable
# and place them side by side on one output image.
# We've put some code below to get you started.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
# You can self-assess your work with the plots
# in the solution video.
# ALTER THE CODE BELOW THIS LINE
# ==============================================
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# QPLOT VERSION
plot1 <- qplot(data = diamonds, x=price, binwidth = 100, fill = I('#099DD9')) +
ggtitle('Price')
plot2 <- qplot(data = diamonds, x=price, binwidth = 0.01, fill = I('#F79420')) +
scale_x_log10() +
ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol = 2)
# GGPLOT VERSION
p1 <- ggplot(data=diamonds, aes(x = price)) +
geom_histogram(binwidth = 100, col="red", fill="green", alpha = .2) +
ggtitle('Price')
p2 <- ggplot(data=diamonds, aes(x = price)) +
geom_histogram(binwidth = 0.01, col="red", fill="green", alpha = .2) +
scale_x_log10() +
ggtitle('Price (log 10)')
grid.arrange(p1, p2, ncol = 2)
Notes:
We can see that the prices for diamonds are pretty heavily skewed. However, when you put them on a log scale, they behave much better, more along the normal distribution we’d like to see.
We can even see a little bit of evidence of bi-modality in the log 10 distribution, which is consistent with our two class (rich buyer / poor buyer) speculation about the nature of the customers for diamonds.
Sneak Preview Log 10
# Replot on a LOG 10 scale X: Carat vs Y: Price
ggplot(data=diamonds, aes(x = carat, y = price)) +
geom_point(position='jitter',size=0.6,alpha=1/2) +
scale_y_continuous(trans = log10_trans()) +
ggtitle('Price (log10 by Carat)')
# Create a CUBE'd function
# cuberoot_trans = function() trans_new('cuberoot',
# transform = function(x) x^(1/3),
# inverse = function(x) x^3)
# ggplot(data=diamonds, aes(x = carat, y = price)) +
# geom_point(position='jitter',size=0.6,alpha=1/2) +
# scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
# breaks = c(0.2, 0.5, 1, 2, 3)) +
# scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
# breaks = c(350, 1000, 5000, 10000, 15000)) +
# ggtitle('Price (log10 by Carat)')
NOTE: - Try using the cube root of carrot - in light of our speculation of of flaws being exponentially more likely in diamonds with more volume. Remember - volume is on a cubic scale.
Create a function to transform the carat variable - takes the cubes root of any input variable, and it has an inverse function to undo function, which we’ll need to display the plot correctly.
Then when we get to plotting, we’ll use the scale_x_continuous argument to transform the X-Axis with this cubed root transformation function. We are also transforming the Y-Axis with this **log10 transformation* from previously.
** Analysis** - things now look almost linear - therefore we can move forward and try to model our data using just a linear model.
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
Notes:
Over-plotting - The case where multiple points take on the same value (often because of rounding).
Lets look at this by writing some code:
Run the table command on both price and carat, then we sort that so that the highest values appear first, then we look at the top 6 (the default for the HEAD Function).
The first line is the price and the second line is the count for each one of these values.
head(sort(table(diamonds$carat), decreasing=T))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing=T))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
Analysis - we can see that these are really high numbers which can result in a substantial amount of over plotting. When you have this much data, you are going to have serious overplotting, even when you are plotting the variables against each other.
This can really obscure some of the Density and Sparsity of our data at really key points.
You can deal with this by: (1) making your points smaller, by jittering your points and by adding transparancy (ggplot alpha parameter).
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don't want all the hints at once.
# ALTER THE CODE BELOW THIS LINE
# =======================================================================
ggplot(aes(carat, price), data = diamonds) +
# geom_point() +
geom_point(position='jitter',size=0.75, alpha=1/2) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1693 rows containing missing values (geom_point).
Notes:
Alter the code below.
# install and load the RColorBrewer package
# install.packages('RColorBrewer', dependencies = TRUE)
library(RColorBrewer)
# Start by visualizing CLARITY - Just add one parameter to our AES "color=clarity"
ggplot(aes(x = carat, y = price, color=clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response:
Yes - because it creates a clear separation of the top tier diamonds from the rest.
Clarity does seem to explain an awful lot of the remaining variance int he price after adding color to our plot.
Holding carat weight constant while looking at one part of the plot, we see that diamonds with lower clarity are almost always cheaper than diamonds with better clarity.
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response:
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = F,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1691 rows containing missing values (geom_point).
Response:
** Yes - because it creates a clear separation of price depending on the color. The better colors nearly always command a higher price, regardless of what the marketing people say.
Notes:
Log10(price) ~ carat^(1/3)
Nice! Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).
Response:
Notes:
Lets build up our linear model for price:
Note the use of the I Wrapper around each of the variables. The I stands for AS IS, which tells R to use the expression inside of the I Function to transform a variable before using it in the regression. This is instead of instructing R to interpret these symbols as part of the formula to construct the design matrix for the regression.
I can also update the previous model to add the carat variable in the regression (using the syntax below). The real functional relationship is surely more complex than the cubed root of carat weight, so we add a simple linear function of carat in our model predicting price.
We can continue to make more complex models by adding more variables (cut, color, clarity).
The code run shows us some very nice R^2 values, we’re accounting for almost all of the variance in price using the 4 C’s.
If we want to know if the price for a diamond is reasonable, we might now use this model.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## =========================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 1.0 1.0
## adj. R-squared 0.9 0.9 0.9 1.0 1.0
## sigma 0.3 0.3 0.3 0.2 0.1
## F 652012.1 387489.4 138654.5 87959.5 173791.1
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -7962.5 -3631.3 -1837.4 4235.2 34091.3
## Deviance 4242.8 3613.4 3380.8 2699.2 892.2
## AIC 15931.0 7270.6 3690.8 -8442.5 -68140.5
## BIC 15957.7 7306.2 3762.0 -8317.9 -67953.7
## N 53940 53940 53940 53940 53940
## =========================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes:
Research: (Take some time to come up with 2-4 problems for the model) (You should spend 10-20 min on this)
Response:
Here’s the problem for the model.
The data is from 2008, so the model doesn’t follow the market trends or account for inflation or the 2008 recession. Maybe for some reason the price of diamonds drops or jump. We can do this by doing stochastic gradient and implementing real-live learn sample. Per the global diamonds report. Since then diamonds have grown at about 6%/year compound annual rate.
The rapidly growing demand in China could also explain the increase.
Diamond prices grew unevenly through carat sizes since 2008, meaning that the model initially estimated could not simply be adjusted for inflation.
We also need some other major factor rather than just bunch of description data about diamonds alone. Is it the diamonds will always have fixed price? Can we know all the diamonds out there in the market? Don’t forget the global recession in the 2008 that have an impact of the price of a diamonds.
Diamonds increase significantly from the mine to the market. Who knows in each step in the process, there’s will be a big jump of price.
Finally, as earlier we stated, there’s cartel of diamonds (e.g. De Beer) that maybe monopoly some price of diamonds in the market. Not just them, but also major players in the diamonds market. They can change the price, hence the price of diamonds also change in the market.
Notes:
# install.packages('bitops')
# install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
# load(rawConnection(diamondsurl))
load("BigDiamonds.rda")
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.
# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).
# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!
# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.
# This exercise WILL BE automatically graded.
# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)
# Your task is to write the code to create the models.
# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
#===========================================================================================
#
#diamondsBigSample <- read.csv('diamondsBigSample.csv')
# diamondsBigSample <- load("BigDiamonds.rda")
# diamondsBigSample <- data.frame(load("BigDiamonds.rda"))
# head(diamondsBigSample)
# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
#===========================================================================================
# diamondsBigSample$logprice = log(diamondsBigSample$price)
# m1 <- lm(logprice ~ I(carat^(1/3)),
# data = diamondsBigSample[diamondsBigSample$price < 10000 &
# diamondsBigSample$cert == "GIA",])
# m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = subset(diamondsBigSample, price < 10000 & cert == 'GIA' ))
# m2 <- update(m1, ~ . + carat)
# m3 <- update(m2, ~ . + cut)
# m4 <- update(m3, ~ . + color)
# m5 <- update(m4, ~ . + clarity)
# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
#===========================================================================================
# suppressMessages(library(lattice))
# suppressMessages(library(MASS))
# suppressMessages(library(memisc))
# models <- mtable(m1, m2, m3, m4, m5)
Example Diamond from BlueNile:
# Be sure you've loaded the library memisc and have m5 saved as an object in your workspace.
str(m5)
## List of 13
## $ coefficients : Named num [1:20] 0.415 9.144 -1.093 0.12 -0.031 ...
## ..- attr(*, "names")= chr [1:20] "(Intercept)" "I(carat^(1/3))" "carat" "cut.L" ...
## $ residuals :Class 'AsIs' Named num [1:53940] 0.0828 0.0875 -0.2141 -0.2465 0.1509 ...
## .. ..- attr(*, "names")= chr [1:53940] "1" "2" "3" "4" ...
## $ effects :Class 'AsIs' Named num [1:53940] -1808.48 226.47 25.09 14.1 2.44 ...
## .. ..- attr(*, "names")= chr [1:53940] "(Intercept)" "I(carat^(1/3))" "carat" "cut.L" ...
## $ rank : int 20
## $ fitted.values:Class 'AsIs' Named num [1:53940] 5.7 5.7 6 6.06 5.66 ...
## .. ..- attr(*, "names")= chr [1:53940] "1" "2" "3" "4" ...
## $ assign : int [1:20] 0 1 2 3 3 3 3 4 4 4 ...
## $ qr :List of 5
## ..$ qr : num [1:53940, 1:20] -2.32e+02 4.31e-03 4.31e-03 4.31e-03 4.31e-03 ...
## .. ..- attr(*, "dimnames")=List of 2
## .. .. ..$ : chr [1:53940] "1" "2" "3" "4" ...
## .. .. ..$ : chr [1:20] "(Intercept)" "I(carat^(1/3))" "carat" "cut.L" ...
## .. ..- attr(*, "assign")= int [1:20] 0 1 2 3 3 3 3 4 4 4 ...
## .. ..- attr(*, "contrasts")=List of 3
## .. .. ..$ cut : chr "contr.poly"
## .. .. ..$ color : chr "contr.poly"
## .. .. ..$ clarity: chr "contr.poly"
## ..$ qraux: num [1:20] 1 1.01 1.01 1 1 ...
## ..$ pivot: int [1:20] 1 2 3 4 5 6 7 8 9 10 ...
## ..$ tol : num 1e-07
## ..$ rank : int 20
## ..- attr(*, "class")= chr "qr"
## $ df.residual : int 53920
## $ contrasts :List of 3
## ..$ cut : chr "contr.poly"
## ..$ color : chr "contr.poly"
## ..$ clarity: chr "contr.poly"
## $ xlevels :List of 3
## ..$ cut : chr [1:5] "Fair" "Good" "Very Good" "Premium" ...
## ..$ color : chr [1:7] "D" "E" "F" "G" ...
## ..$ clarity: chr [1:8] "I1" "SI2" "SI1" "VS2" ...
## $ call : language lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + clarity, data = diamonds)
## $ terms :Classes 'terms', 'formula' length 3 I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + clarity
## .. ..- attr(*, "variables")= language list(I(log(price)), I(carat^(1/3)), carat, cut, color, clarity)
## .. ..- attr(*, "factors")= int [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
## .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. ..$ : chr [1:6] "I(log(price))" "I(carat^(1/3))" "carat" "cut" ...
## .. .. .. ..$ : chr [1:5] "I(carat^(1/3))" "carat" "cut" "color" ...
## .. ..- attr(*, "term.labels")= chr [1:5] "I(carat^(1/3))" "carat" "cut" "color" ...
## .. ..- attr(*, "order")= int [1:5] 1 1 1 1 1
## .. ..- attr(*, "intercept")= int 1
## .. ..- attr(*, "response")= int 1
## .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. ..- attr(*, "predvars")= language list(I(log(price)), I(carat^(1/3)), carat, cut, color, clarity)
## .. ..- attr(*, "dataClasses")= Named chr [1:6] "numeric" "numeric" "numeric" "ordered" ...
## .. .. ..- attr(*, "names")= chr [1:6] "I(log(price))" "I(carat^(1/3))" "carat" "cut" ...
## $ model :'data.frame': 53940 obs. of 6 variables:
## ..$ I(log(price)) :Class 'AsIs' num [1:53940] 5.79 5.79 5.79 5.81 5.81 ...
## ..$ I(carat^(1/3)):Class 'AsIs' num [1:53940] 0.613 0.594 0.613 0.662 0.677 ...
## ..$ carat : num [1:53940] 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## ..$ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## ..$ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## ..$ clarity : Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## ..- attr(*, "terms")=Classes 'terms', 'formula' length 3 I(log(price)) ~ I(carat^(1/3)) + carat + cut + color + clarity
## .. .. ..- attr(*, "variables")= language list(I(log(price)), I(carat^(1/3)), carat, cut, color, clarity)
## .. .. ..- attr(*, "factors")= int [1:6, 1:5] 0 1 0 0 0 0 0 0 1 0 ...
## .. .. .. ..- attr(*, "dimnames")=List of 2
## .. .. .. .. ..$ : chr [1:6] "I(log(price))" "I(carat^(1/3))" "carat" "cut" ...
## .. .. .. .. ..$ : chr [1:5] "I(carat^(1/3))" "carat" "cut" "color" ...
## .. .. ..- attr(*, "term.labels")= chr [1:5] "I(carat^(1/3))" "carat" "cut" "color" ...
## .. .. ..- attr(*, "order")= int [1:5] 1 1 1 1 1
## .. .. ..- attr(*, "intercept")= int 1
## .. .. ..- attr(*, "response")= int 1
## .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
## .. .. ..- attr(*, "predvars")= language list(I(log(price)), I(carat^(1/3)), carat, cut, color, clarity)
## .. .. ..- attr(*, "dataClasses")= Named chr [1:6] "numeric" "numeric" "numeric" "ordered" ...
## .. .. .. ..- attr(*, "names")= chr [1:6] "I(log(price))" "I(carat^(1/3))" "carat" "cut" ...
## - attr(*, "class")= chr "lm"
# Create a Diamond:
# thisDiamond = data.frame(carat = 1.00, cut = "Good", color = "I", clarity="VS1")
# modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction", level = .95)
# Estimate / predict it's value:
# modelEstimate = predict(m5, newdata = thisDiamond, interval="prediction", level = .95)
# Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels) :
# factor cut has new level V.Good
# NOTE: The culpret is "cut = "V.Good".
# View(modelEstimate)
# modelEstimate
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
Notes:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!