Notes:
library('ggplot2')
data(diamonds)
summary(diamonds)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
##
qplot(x = carat, y = price, data = diamonds, xlim = c(0, quantile(diamonds$carat,0.99)), ylim = c(0, quantile(diamonds$price, 0.99))) + geom_point(fill = I('#F79420'), color = I('black'), shape = 21)
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 926 rows containing missing values (geom_point).
ggplot(aes(x = carat, y = price), data = diamonds) + scale_x_continuous(lim = c(0, quantile(diamonds$carat, 0.99))) + scale_y_continuous(lim = c(0, quantile(diamonds$price, 0.99))) + geom_point(color = I('#F79420'), alpha = 1/4) + stat_smooth(method = 'lm')
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Response: nonlinear relationship, exponential or something else. dispersion or variance of the relationship also increases as carat size increases
Notes: The diamonds data set ships with ggplot2 and contains the prices and the specs for more than 50,000
Notes: a man’s sucess in life
Notes:
# install these if necessary
#install.packages('GGally')
#install.packages('scales')
#install.packages('memisc')
#install.packages('lattice')
#install.packages('MASS')
#install.packages('car')
#install.packages('reshape')
#install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
## Warning: package 'GGally' was built under R version 3.2.4
library(scales)
## Warning: package 'scales' was built under R version 3.2.4
library(memisc)
## Warning: package 'memisc' was built under R version 3.2.4
## Loading required package: lattice
## Warning: package 'lattice' was built under R version 3.2.4
## Loading required package: MASS
## Warning: package 'MASS' was built under R version 3.2.4
##
## Attaching package: 'memisc'
## The following object is masked from 'package:scales':
##
## percent
## The following objects are masked from 'package:stats':
##
## contr.sum, contr.treatment, contrasts
## The following object is masked from 'package:base':
##
## as.array
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, lower = list(continuous = wrap("points", shape = I('.'))), upper = list(combo = wrap("box", outlier.shape = I('.'))))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
What are some things you notice in the ggpairs output? Response: the lower triangle of the plot matrix, it uses grouped histograms for qualitative pairs and scatter plots for quantitative pairs. In the upper triangle, it plots grouped histograms for qualitative pairs, this time using the x instead of the y variable as the grouping factor. Box plots for qualitative pairs, and it provides the correlation for quantitative pairs.
Notes: On the demand side, customers in the market for a less expensive, smaller diamond are probably more sensitive to price than more well-to-do buyers. We shouldn’t expect the market for bigger diamonds to be as competitive as the one for smaller diamonds.
Create two histograms of the price variable and place them side by side on one output image.
We’ve put some code below to get you started.
The first plot should be a histogram of price and the second plot should transform the price variable using log10.
Set appropriate bin widths for each plot. ggtitle() will add a title to each histogram.
library(gridExtra)
plot1 <- qplot(x = price, data = diamonds, binwidth = 100, fill = I('#099DD9')) +
ggtitle('Price')
plot2 <- qplot(x=log10(price+1), data = diamonds, binwidth = 0.01, fill = I('#F79420')) +
ggtitle('Price (log10)')
plot2 <- qplot(x=price, data = diamonds, binwidth = 0.01, fill = I('#F79420')) + scale_x_log10() + ggtitle('Price (log10)')
grid.arrange(plot1, plot2, ncol = 2)
Notes: the prices for diamonds are pretty heavily skewed, but when you put those prices on a log ten scale, they seem much better behaved. They are closer to the bell curve of a normal distribution. We can see little bit of evidence of bimodality on this log ten scale which is consistent with our two class rich buyer poor buyer speculation about the nature of customers for diamonds.
most people buy less expensive diamonds and their relationship is quite linear. Demand for expensive diamonds are very disperse. There is a chasm in price.
qplot (carat, price, data = diamonds) + scale_y_continuous(trans = log10_trans()) + ggtitle('Price (log10) by Carat')
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3), inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) + geom_point() + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
head(sort(table(diamonds$carat), desreasing = T))
##
## 2.59 2.64 2.65 2.67 2.7 2.71
## 1 1 1 1 1 1
head(sort(table(diamonds$price), desreasing = T))
##
## 327 334 335 338 339 340
## 1 1 1 1 1 1
ggplot(aes(carat, price), data = diamonds) + geom_point(alpha = 0.5, size = 0.75, position = 'jitter') + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1691 rows containing missing values (geom_point).
Notes:
Adjust the code below to color the points by clarity.
A layer called scale_color_brewer() has been added to adjust the legend and provide custom colors.
# install and load the RColorBrewer package
#install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Clarity', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1693 rows containing missing values (geom_point).
Response: Clarity seems to explain a lot of the remaining variance in price, after adding color to our plot. Holding carat weight constant, we see diamonds with lower clarity are almost always cheaper than diamonds with better clarity.
Adjust the code below to color the points by cut. Change any other parts of the code as needed.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Cut', reverse = T, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Cut')
## Warning: Removed 1696 rows containing missing values (geom_point).
Response: We don’t see much variation on cut. Most of the diamonds in the data are ideal cut anyway. So, we’ve lost the color pattern that we saw before.
Adjust the code below to color the points by diamond colors and change the titles.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) + geom_point(alpha = 0.5, size = 1, position = 'jitter') + scale_color_brewer(type = 'div', guide = guide_legend(title = 'Color', reverse = F, override.aes = list(alpha = 1, size = 2))) + scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3), breaks = c(0.2, 0.5, 1, 2, 3)) + scale_y_continuous(trans = log10_trans(), limits = c(350, 15000), breaks = c(350, 1000, 5000, 10000, 15000)) + ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1688 rows containing missing values (geom_point).
Response: Color seems to explain a lot of the remaining variance in price, after adding color to our plot. However, blue states that the difference between all color grades from D to J are basically not noticeable to the naked eye. Yet, we do see the color difference in the price tag.
Notes: lm( y ~ x), where y is outcome variable and x is explanatory variable
Response: we applied the log transformaiton to our long tailed dollar variable, and we speculated that the flawless diamond should become exponentially rarer as diamond volume increases. so we should be interested in the cube root of carat weight.
Notes: I stands for as is
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## =========================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 1.0 1.0
## adj. R-squared 0.9 0.9 0.9 1.0 1.0
## sigma 0.3 0.3 0.3 0.2 0.1
## F 652012.1 387489.4 138654.5 87959.5 173791.1
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -7962.5 -3631.3 -1837.4 4235.2 34091.3
## Deviance 4242.8 3613.4 3380.8 2699.2 892.2
## AIC 15931.0 7270.6 3690.8 -8442.5 -68140.5
## BIC 15957.7 7306.2 3762.0 -8317.9 -67953.7
## N 53940 53940 53940 53940 53940
## =========================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes: To start, our data’s from 2008. So, not only do we need to account for inflation, but the diamond market is quite different now than it was. In fact, when I fit models to this data and predicted the price of the diamonds that I found off a market, I kept getting predictions that were way too low. It was found that global marker was poor. It turns out that prices plummeted in 2008 due to the global financial crisis.
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response:
Notes:
#install.packages('bitops')
#install.packages('RCurl')
library('bitops')
library('RCurl')
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
load("BigDiamonds.Rda")
diamondsbig$logprice = log(diamondsbig$price)
m1 <- lm(logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",])
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = logprice ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = logprice ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color,
## data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
## m5: lm(formula = logprice ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
##
## =========================================================================
## m1 m2 m3 m4 m5
## -------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## -------------------------------------------------------------------------
## R-squared 0.9 0.9 0.9 0.9 1.0
## adj. R-squared 0.9 0.9 0.9 0.9 1.0
## sigma 0.3 0.3 0.3 0.2 0.2
## F 2700903.7 1406538.3 754405.4 423311.5 521161.4
## p 0.0 0.0 0.0 0.0 0.0
## Log-likelihood -60137.8 -53996.3 -43339.8 37830.4 154124.3
## Deviance 28298.7 27291.5 25628.3 15874.9 7992.7
## AIC 120281.6 108000.5 86691.6 -75632.8 -308204.5
## BIC 120313.8 108043.5 86756.0 -75482.6 -307968.4
## N 338946 338946 338946 338946 338946
## =========================================================================
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
#Be sure youâve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate)
## fit lwr upr
## 1 5040.436 3730.34 6810.638
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
fit lwr upr
1 5040.436 3730.34 6810.638 ***
Notes: Even though we can predict the price of a diamnond based on a function for c’s. One thing you should not conclude with this exercise is that where you buy your diamond is irrelevant. You almost surely pay more for the same diamond at Tiffany’s compared to Costco. Regardless you can use a model like this to get a sense of whether you were overpaying. One last thing, data and models are never infallible and you can still get taken even equipped with this kind of analysis.
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!