Notes:
# Colors
# #8224e3 -- Purple
# #fcae3a -- Orange
# #359bed -- Blue
# #81d742 -- Green
# #dd3333 -- Red
library("ggplot2")
library(xkcd)
data(diamonds)
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is
# reasonable or a rip-off.
# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).
# Limit the x-axis and y-axis to omit the top 1% of values.
names(diamonds)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
ggplot(data=diamonds, aes(x=carat, y=price)) +
# get rid of top percentile
scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99))) +
scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99))) +
geom_point(fill=I('#dd3333'), color= I("black"), aes(alpha=1/10),shape=21) +
stat_smooth(method='lm') +
theme_xkcd()
Response:
ggplot(data=diamonds, aes(x=carat, y=price)) +
# get rid of top percentile
scale_x_continuous(lim=c(0,quantile(diamonds$carat,0.99))) +
scale_y_continuous(lim=c(0,quantile(diamonds$price,0.99))) +
geom_point(color=I('#dd3333'),alpha=1/10) +
stat_smooth(method='lm') +
ggtitle("Linear Fit of Carat Weight to Price") +
theme_xkcd()
Notes: This is the woman that created the rise of the diamond.
Forever!
Notes: Frances Gerety wrote “A Diamond Is Forever” for an advertising campaign for De Beers in 1947, and wrote all of the company’s ads for 25 years.
Notes:
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp, outlier.shape = I('.')) # params = c(shape = I('.'),
What are some things you notice in the ggpairs output? Response:
Notes: We speculated that there are two tiers of demands, those purchased by those who feel obligateddue to a woman’s expectation of a diamond for a marriage proposal who would otherwise never purchase a diamond, and those who are especially wealthy and purchase a large diamond because they have the means to afford it.
The log base 10 scale reveals this information of bimodality.
# Create two histograms of the price variable
# and place them side by side on one output image.
# We’ve put some code below to get you started.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
# You can self-assess your work with the plots
# in the solution video.
# ALTER THE CODE BELOW THIS LINE
# ==============================================
library(gridExtra)
plot1 <- ggplot(data=diamonds, aes(x=price)) +
geom_histogram(fill=I("#81d742"), color=I("black"),binwidth = 200) +
ggtitle('Price') +
theme_xkcd()
plot2 <- ggplot(data=diamonds, aes(x=price)) +
geom_histogram(fill=I("#359bed"), color=I("black"),binwidth = 0.02) +
scale_x_log10() +
ggtitle('Price - Log10 Tranformation') +
xlab("Price Log 10") +
theme_xkcd()
grid.arrange(plot1,plot2, ncol=2)
Notes:
Here we see the smaller variance in higher priced diamonds.
ggplot(data=diamonds, aes(x=carat, y=price)) +
geom_point(color=I("#359bed"), alpha=1/50) +
scale_y_continuous(trans=log10_trans()) +
ggtitle('Price - Log10 Tranformation') +
xlab("Carat") +
ylab("Price Log 10") +
theme_xkcd()
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price log10 by Cube-Root of Carat') +
theme_xkcd()
head(sort(table(diamonds$carat),decreasing = TRUE))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price),decreasing = TRUE))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha=1/5,size=.75,color="#8224e3",position= "jitter") +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat') +
theme_xkcd()
Notes:
Alter the code below.
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = carat, y = price,color=clarity), data = diamonds) +
geom_point(alpha = 0.2, size = .75, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price log10 by Cube-Root of Carat and Clarity') +
theme_xkcd()
Response: Damn! Look at the trends in the clarity. The better the clarity, the higher the price. There is a very clear correlation here. Filtering out each clarity level you could run a model on each level and get a very accurate outcome. ***
Alter the code below.
ggplot(aes(x = carat, y = price, color = cut), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Cut') +
theme_xkcd()
Response: There doesn’t seem to be that many poorly cut diamonds. I don’t think cut makes as big of a difference since so many still show how variance in price.
The ideals dominate the cut, and the variance is large amongst each category of cut. ***
Alter the code below.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = .75, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = "Color", reverse = FALSE,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color') +
theme_xkcd()
Response: Here we see more of what we are looking for, a small variance in the price distribution of color.
Meaning in other words, the price and color have a strong correlation. ***
Notes:
We can create predictive models in R using the lm() function. Here is the help for the lm function:
?lm
In my explanation, derived from the EDA course, we need to provide a formula in the form of lm( y ~ x), where Y is the
Which of the formulas would we ue inside the lm() function?
I would say the last one because that is the relationship that was linear in nature.
As the course states: Nice! Price is the outcome and carat is the predictor variable. We used our domain knowledge of diamonds and carat weight to take the cube root of carat weight (volume).
Response:
Notes:
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
# create our first model. The I() wrapper uses the expression inside the I function to transform the variable before using it in the regression, rather than instruction R to interpret these symbols as part of the formula to construct a design matrix for the regression.
m2 <- update(m1, ~ . + carat)
# We can update the previous model by adding carat to the explanatory variables to consider more parameters and more accurately predict our outcome.
m3 <- update(m2, ~ . + cut) #lets add cut even though that doesn't do much.
m4 <- update(m3, ~ . + color) #
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ======================================================================
## m1 m2 m3 m4 m5
## ----------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## ----------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.280 0.259 0.250 0.224 0.129
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
## BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
## N 53940 53940 53940 53940 53940
## ======================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
the R-squared coefficient is the key variable here, and notice how we are accounting for almost all of the price variance by using the relevant models.
Here is the output: R-squared 0.924 0.935 0.939 0.951 0.984
Video Notes:
Our model is: ln(price) = 0.415 + 9.144(carat^(1/3)) - 1.093carat + (…* cut + …* color + …* clarity) + error coefficient.
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response: This data is from 2008, we need to account for inflation. Additionally, we’re out of the global recession. -Diamond Market in china is heating up. -Uneven price increases across carat size, which means we need to retrain the model rather than simply use a multiplier to account for inflation / price increases. ***
Notes:
install.packages('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
#these don't actually work, why?
#-------------------------------
#diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
#load(rawConnection(diamondsurl))
#-------------------------------
# Edit, from http://stackoverflow.com/questions/24846120/importing-data-into-r-rdata-from-github
# Note that the https:// URL scheme is not supported except on Windows.
#-------------------------------
# we can use this instead
diamondsurl = "https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda"
load(url(diamondsurl))
# or this
#download.file(diamondsurl,"BigDiamonds.rda")
#load("BigDiamonds.rda")
The code used to obtain the data is available here: https://github.com/solomonm/diamonds-data
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.
# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).
# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!
# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.
# This exercise WILL BE automatically graded.
# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)
# Your task is to write the code to create the models.
#m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamondsbig)
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data=diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert == "GIA",]) # lets remove outliers
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1,m2,m3,m4,m5)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamondsbig[diamondsbig$price <
## 10000 & diamondsbig$cert == "GIA", ])
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamondsbig[diamondsbig$price < 10000 & diamondsbig$cert ==
## "GIA", ])
##
## ===========================================================================
## m1 m2 m3 m4 m5
## ---------------------------------------------------------------------------
## (Intercept) 2.671*** 1.333*** 0.949*** 0.529*** -0.464***
## (0.003) (0.012) (0.012) (0.010) (0.009)
## I(carat^(1/3)) 5.839*** 8.243*** 8.633*** 8.110*** 8.320***
## (0.004) (0.022) (0.021) (0.017) (0.012)
## carat -1.061*** -1.223*** -0.782*** -0.763***
## (0.009) (0.009) (0.007) (0.005)
## cut: V.Good 0.120*** 0.090*** 0.071***
## (0.002) (0.001) (0.001)
## cut: Ideal 0.211*** 0.181*** 0.131***
## (0.002) (0.001) (0.001)
## color: K/L 0.123*** 0.117***
## (0.004) (0.003)
## color: J/L 0.312*** 0.318***
## (0.003) (0.002)
## color: I/L 0.451*** 0.469***
## (0.003) (0.002)
## color: H/L 0.569*** 0.602***
## (0.003) (0.002)
## color: G/L 0.633*** 0.665***
## (0.003) (0.002)
## color: F/L 0.687*** 0.723***
## (0.003) (0.002)
## color: E/L 0.729*** 0.756***
## (0.003) (0.002)
## color: D/L 0.812*** 0.827***
## (0.003) (0.002)
## clarity: I1 0.301***
## (0.006)
## clarity: SI2 0.607***
## (0.006)
## clarity: SI1 0.727***
## (0.006)
## clarity: VS2 0.836***
## (0.006)
## clarity: VS1 0.891***
## (0.006)
## clarity: VVS2 0.935***
## (0.006)
## clarity: VVS1 0.995***
## (0.006)
## clarity: IF 1.052***
## (0.006)
## ---------------------------------------------------------------------------
## R-squared 0.888 0.892 0.899 0.937 0.969
## adj. R-squared 0.888 0.892 0.899 0.937 0.969
## sigma 0.289 0.284 0.275 0.216 0.154
## F 2700903.714 1406538.330 754405.425 423311.488 521161.443
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -60137.791 -53996.269 -43339.818 37830.414 154124.270
## Deviance 28298.689 27291.534 25628.285 15874.910 7992.720
## AIC 120281.582 108000.539 86691.636 -75632.827 -308204.540
## BIC 120313.783 108043.473 86756.037 -75482.557 -307968.400
## N 338946 338946 338946 338946 338946
## ===========================================================================
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601 ^^These are the true values, lets pop it into our model. We need to exponentiate our model results since we used the log10(price) rather than price itself.
#Be sure you’ve loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
# data.Frame creates a data frame, we created a dataframe with one value, thisDiamond
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
exp(modelEstimate) # this will give us the actual price because our model outputs log 10.
## fit lwr upr
## 1 5040.436 3730.34 6810.638
# parameters
# level = gives us the lower and upper bounds based on our confidence level
# -- ie, there is a 95% chance that the diamond price is within this range
#
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.
Notes:
Click KnitHTML to see all of your hard work and to have an html page of this lesson, your answers, and your notes!