Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4
ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)
## Warning: Computation failed in `stat_binhex()`:
## Package `hexbin` required for `stat_binhex`.
## Please install and try again.

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)
## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

vertical strips means same x (or same carat), and bright means more counts; and in this case, bright vertical strips means that the diamonds in the market usually have fixed weight (or carat).

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

this linear relationship between log(price) and log(carat) indicats that there exists an exponential relationship betweeen price and carat. i.e., the price of the diamond increases exponentially with the increase of the carat of the diamond.

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

# Use this chunk to place your code for extracting the high and low residuals
high_resid <- diamonds2 %>%
  filter(abs(lresid2) >= 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(carat, price, pred) %>%
  arrange(carat)
high_resid
## # A tibble: 16 x 3
##    carat price  pred
##    <dbl> <int> <dbl>
##  1 0.25   1013   264
##  2 0.25   1186   284
##  3 0.25   1186   284
##  4 0.290  1776   412
##  5 0.3    2366   774
##  6 0.32   1715   576
##  7 0.34   2160   314
##  8 0.35   1415   639
##  9 0.35   1415   639
## 10 0.51   3360  1373
## 11 0.51   3920  1705
## 12 0.51   4368  1705
## 13 0.61   3807  1540
## 14 1.01  10011  4048
## 15 1.03   1262  2644
## 16 2.46  10470 23622

seems most of the diamonds with extremely high or low residuals are over-priced.

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices

summary(mod_diamond2)
## 
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.206978   0.001693 7211.806  < 2e-16 ***
## lcarat       1.886239   0.001124 1677.809  < 2e-16 ***
## color.L     -0.633998   0.002910 -217.872  < 2e-16 ***
## color.Q     -0.137580   0.002676  -51.409  < 2e-16 ***
## color.C     -0.022072   0.002503   -8.819  < 2e-16 ***
## color^4      0.016570   0.002297    7.213 5.54e-13 ***
## color^5     -0.002828   0.002169   -1.304    0.192    
## color^6      0.003533   0.001971    1.793    0.073 .  
## cut.L        0.173866   0.003386   51.349  < 2e-16 ***
## cut.Q       -0.050346   0.002980  -16.897  < 2e-16 ***
## cut.C        0.019129   0.002583    7.407 1.31e-13 ***
## cut^4       -0.002410   0.002066   -1.166    0.243    
## clarity.L    1.308155   0.005179  252.598  < 2e-16 ***
## clarity.Q   -0.334090   0.004839  -69.047  < 2e-16 ***
## clarity.C    0.178423   0.004140   43.093  < 2e-16 ***
## clarity^4   -0.088059   0.003298  -26.697  < 2e-16 ***
## clarity^5    0.035885   0.002680   13.389  < 2e-16 ***
## clarity^6   -0.001371   0.002327   -0.589    0.556    
## clarity^7    0.048221   0.002051   23.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16
mod2_stdres=rstandard(mod_diamond2)
qqnorm(mod2_stdres,
       ylab="Standardized Residuals",
       xlab="Normal Scores",
       main="Normal Q-Q")
qqline(mod2_stdres)

hist(mod2_stdres, xlim=c(-3,3), breaks=50, main="Histogram of Residuals")

From the summary, the adjusted R-squared is 0.983 with a p-value < 2.2e-16, we can tell the model is a good prediction; and from the qq plot and residual histogram we can tell that the residuals are approximately normally distributed.