library(ggplot2)
data(diamonds)
str(diamonds)
## Classes 'tbl_df', 'tbl' and 'data.frame': 53940 obs. of 10 variables:
## $ carat : num 0.23 0.21 0.23 0.29 0.31 0.24 0.24 0.26 0.22 0.23 ...
## $ cut : Ord.factor w/ 5 levels "Fair"<"Good"<..: 5 4 2 4 2 3 3 3 1 3 ...
## $ color : Ord.factor w/ 7 levels "D"<"E"<"F"<"G"<..: 2 2 2 6 7 7 6 5 2 5 ...
## $ clarity: Ord.factor w/ 8 levels "I1"<"SI2"<"SI1"<..: 2 3 5 4 2 6 7 3 4 5 ...
## $ depth : num 61.5 59.8 56.9 62.4 63.3 62.8 62.3 61.9 65.1 59.4 ...
## $ table : num 55 61 65 58 58 57 57 55 61 61 ...
## $ price : int 326 326 327 334 335 336 336 337 337 338 ...
## $ x : num 3.95 3.89 4.05 4.2 4.34 3.94 3.95 4.07 3.87 4 ...
## $ y : num 3.98 3.84 4.07 4.23 4.35 3.96 3.98 4.11 3.78 4.05 ...
## $ z : num 2.43 2.31 2.31 2.63 2.75 2.48 2.47 2.53 2.49 2.39 ...
The scatterplot is a powerful tool to help you understand the relationship between two continuous variables.
# Let's start by examining two variables in the data set.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is
# reasonable or a rip-off.
# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).
# Limit the x-axis and y-axis to omit the top 1% of values.
# ENTER YOUR CODE BELOW THIS LINE
# ================================================================
ggplot(diamonds,(aes(x = carat, y = price)))+
geom_point(fill = I('#F79420'),color = I('red'),shape= 21)+
xlim(0,quantile(diamonds$carat,.99))+
ylim(0,quantile(diamonds$price,.99))
## Warning: Removed 926 rows containing missing values (geom_point).
Response:
there is non linear relationship between both variables, maybe its exponential or else.
Variance or dispersion increases as the carat size increases.
ggplot(diamonds,(aes(x = carat, y = price)))+
geom_point(fill = I('#F79420'),color = I('red'),shape= 21)+
stat_smooth(method = lm)+
xlim(0,quantile(diamonds$carat,.99))+
ylim(0,quantile(diamonds$price,.99))
## Warning: Removed 926 rows containing non-finite values (stat_smooth).
## Warning: Removed 926 rows containing missing values (geom_point).
## Warning: Removed 4 rows containing missing values (geom_smooth).
Notes:
https://s3.amazonaws.com/udacity-hosted-downloads/ud651/ggpairs_landscape.pdf
Notes: using log scale for price
# Create two histograms of the price variable
# and place them side by side on one output image.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
library(gridExtra)
p1 <- ggplot(diamonds,(aes(x = price)))+
geom_histogram(binwidth = 100, fill = I('#099DD9'))+
ggtitle('Diamond prices')
p2 = ggplot(diamonds,(aes(x = log10(price))))+
geom_histogram(binwidth = 0.05, fill = I('#F79420'))+
ggtitle('log Price')
grid.arrange(p1,p2,ncol = 2)
The two peaks shows the type of buyer as expected. rich buyer and poor buyer. log 10 scale has bimodal and bell curve type of distribution, which is better in behavior. ***
Notes: On the log scale, the prices look less dispersed at the high end of Carat size and price, but actually we can do better. Let’s try using the cube root of Carat in light of our speculation about flaws being exponentially more likely in diamonds with more volume ***
ggplot(diamonds,(aes(x = carat, y = log10(price))))+
geom_point(fill = I('#F79420'),color = I('brown'),shape= 21)+
ggtitle('Carat vs (log10)Prices')
cuberoot_trans = function() trans_new('cuberoot',
transform = function(x) x^(1/3),
inverse = function(x) x^3)
ggplot(aes(carat, price), data = diamonds) +
geom_point()+
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing missing values (geom_point).
With these transformations that we used to get our data on this nice scale. Things look almost linear. We can now move forward and see about modelling our data using just a linear model.
When multiple points are taken on same value. to correct them
add transperency
add jitter points, adjust the shape
additional info by adding layer of geom_smooth or geom_quantile or geom_density_2D
head(sort(table(diamonds$carat), decreasing = TRUE))
##
## 0.3 0.31 1.01 0.7 0.32 1
## 2604 2249 2242 1981 1840 1558
head(sort(table(diamonds$price), decreasing = TRUE))
##
## 605 802 625 828 776 698
## 132 127 126 125 124 121
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
ggplot(aes(carat, price), data = diamonds) +
geom_point(alpha = 0.5, size = 0.75, position ='jitter') +
geom_count()+
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
## Warning: Removed 1683 rows containing non-finite values (stat_sum).
## Warning: Removed 1690 rows containing missing values (geom_point).
Alter the code below.
# Adjust the code below to color the points by clarity.
# A layer called scale_color_brewer() has
# been added to adjust the legend and
# provide custom colors.
ggplot(aes(x = carat, y = price, color = clarity), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
## Warning: Removed 1694 rows containing missing values (geom_point).
Alter the code below.
# Adjust the code below to color the points by cut.
# Change any other parts of the code as needed.
ggplot(aes(x = carat, y = price, color = color), data = diamonds) +
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Color', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat and Color')
## Warning: Removed 1690 rows containing missing values (geom_point).
Response: yes . a pattern is observed in the color and price relation.
Notes:
Response:
Notes: sdigits = 3 This will ensure that the output at the end of the table shows three significant digits like shown in the video.
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5,sdigits = 3)
##
## Calls:
## m1: lm(formula = I(log(price)) ~ I(carat^(1/3)), data = diamonds)
## m2: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat, data = diamonds)
## m3: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut, data = diamonds)
## m4: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color,
## data = diamonds)
## m5: lm(formula = I(log(price)) ~ I(carat^(1/3)) + carat + cut + color +
## clarity, data = diamonds)
##
## ============================================================================================
## m1 m2 m3 m4 m5
## --------------------------------------------------------------------------------------------
## (Intercept) 2.821*** 1.039*** 0.874*** 0.932*** 0.415***
## (0.006) (0.019) (0.019) (0.017) (0.010)
## I(carat^(1/3)) 5.558*** 8.568*** 8.703*** 8.438*** 9.144***
## (0.007) (0.032) (0.031) (0.028) (0.016)
## carat -1.137*** -1.163*** -0.992*** -1.093***
## (0.012) (0.011) (0.010) (0.006)
## cut: .L 0.224*** 0.224*** 0.120***
## (0.004) (0.004) (0.002)
## cut: .Q -0.062*** -0.062*** -0.031***
## (0.004) (0.003) (0.002)
## cut: .C 0.051*** 0.052*** 0.014***
## (0.003) (0.003) (0.002)
## cut: ^4 0.018*** 0.018*** -0.002
## (0.003) (0.002) (0.001)
## color: .L -0.373*** -0.441***
## (0.003) (0.002)
## color: .Q -0.129*** -0.093***
## (0.003) (0.002)
## color: .C 0.001 -0.013***
## (0.003) (0.002)
## color: ^4 0.029*** 0.012***
## (0.003) (0.002)
## color: ^5 -0.016*** -0.003*
## (0.003) (0.001)
## color: ^6 -0.023*** 0.001
## (0.002) (0.001)
## clarity: .L 0.907***
## (0.003)
## clarity: .Q -0.240***
## (0.003)
## clarity: .C 0.131***
## (0.003)
## clarity: ^4 -0.063***
## (0.002)
## clarity: ^5 0.026***
## (0.002)
## clarity: ^6 -0.002
## (0.002)
## clarity: ^7 0.032***
## (0.001)
## --------------------------------------------------------------------------------------------
## R-squared 0.924 0.935 0.939 0.951 0.984
## adj. R-squared 0.924 0.935 0.939 0.951 0.984
## sigma 0.280 0.259 0.250 0.224 0.129
## F 652012.063 387489.366 138654.523 87959.467 173791.084
## p 0.000 0.000 0.000 0.000 0.000
## Log-likelihood -7962.499 -3631.319 -1837.416 4235.240 34091.272
## Deviance 4242.831 3613.360 3380.837 2699.212 892.214
## AIC 15930.999 7270.637 3690.832 -8442.481 -68140.544
## BIC 15957.685 7306.220 3761.997 -8317.942 -67953.736
## N 53940 53940 53940 53940 53940
## ============================================================================================
Notice how adding cut to our model does not help explain much of the variance in the price of diamonds. This fits with out exploration earlier.
Video Notes:
Research: (Take some time to come up with 2-4 problems for the model) (You should 10-20 min on this)
Response:
diamondsbig<- load(“BigDiamonds.Rda”) getwd()
Notes:
diamondsbig\(logprice <- log(diamondsbig\)price) m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamondsbig[diamondsbig\(price < 10000 & diamondsbig\)cert == “GIA”,]) m2 <- update(m1, ~ . + carat) m3 <- update(m2, ~ . + cut) m4 <- update(m3, ~ . + color) m5 <- update(m4, ~ . + clarity) mtable(m1, m2, m3, m4, m5,sdigits = 3)
suppressMessages(library(lattice)) suppressMessages(library(MASS)) suppressMessages(library(memisc)) models <- mtable(m1, m2, m3, m4, m5) ```
Example Diamond from BlueNile: Round 1.00 Very Good I VS1 $5,601
thisDiamond = data.frame(carat = 1.00, cut = “V.Good”, color = “I”, clarity=“VS1”) modelEstimate = predict(m5, newdata = thisDiamond, interval=“prediction”, level = .95)
Evaluate how well the model predicts the BlueNile diamond’s price. Think about the fitted point estimate as well as the 95% CI.