Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4
ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)
## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

The bright stripes represent the popular carat sizes and their average price range. The number of carats doesn’t follow a uniform distribution since round numbers are easier to sell. Similarly, diamonds with similar carat size could have approximately the same rounded price.

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

It means that the price of a diamond is entirely dependent on the carat size when the relationship is linear. A 1% increase in carat will result in a 1% increase in price.

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

From the table below, we know that in the first 20 rows, there are a few 3-carat diamonds that are priced extremely low, much lower than the average price. If we look at the 50-100 rows, there are also a few 1-carat diamonds that are priced very high, much higher than the average. But if we put other factors, such as clarify into consideration. We know that the price of diamonds in addition to “carat” is also affected by “clarity.” So, there are no big errors here. It is just that the price is affected by multiple factors.

# Use this chunk to place your code for extracting the high and low residuals
diamonds2<- diamonds %>% mutate(lprice = log2(price), lcarat = log2(carat))
model2<- lm(lprice ~ lcarat + color + clarity + cut, data = diamonds2)
high<- diamonds2 %>% add_residuals(model2) %>% arrange(resid) %>% slice(1:20)
low<- diamonds2 %>% add_residuals(model2) %>% arrange(-resid) %>% slice(1:20)

bind_rows(high, low) %>% select(price, carat, resid, clarity)
## # A tibble: 40 x 4
##    price carat  resid clarity
##    <int> <dbl>  <dbl> <ord>  
##  1  6512  3    -1.46  I1     
##  2 10470  2.46 -1.17  SI2    
##  3 10453  3.05 -1.14  I1     
##  4 14220  3.01 -1.12  SI2    
##  5  9925  3.01 -1.12  I1     
##  6 18701  3.51 -1.09  VS2    
##  7  1262  1.03 -1.04  I1     
##  8  8040  3.01 -1.02  I1     
##  9 12587  3.5  -0.990 I1     
## 10  8044  3    -0.985 I1     
## # … with 30 more rows

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

Based on the plot of lresid2 vs. lcarat, it seems like the model is reliable predicting the diamond price with a few outliers.

# Use this chunk to place your code for assessing how well the model predicts diamond prices

summary(diamonds2)
##      carat               cut        color        clarity     
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655  
##                                     J: 2808   (Other): 2531  
##      depth           table           price             x         
##  Min.   :43.00   Min.   :43.00   Min.   :  326   Min.   : 0.000  
##  1st Qu.:61.00   1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710  
##  Median :61.80   Median :57.00   Median : 2401   Median : 5.700  
##  Mean   :61.75   Mean   :57.46   Mean   : 3933   Mean   : 5.731  
##  3rd Qu.:62.50   3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540  
##  Max.   :79.00   Max.   :95.00   Max.   :18823   Max.   :10.740  
##                                                                  
##        y                z              lprice           lcarat        
##  Min.   : 0.000   Min.   : 0.000   Min.   : 8.349   Min.   :-2.32193  
##  1st Qu.: 4.720   1st Qu.: 2.910   1st Qu.: 9.892   1st Qu.:-1.32193  
##  Median : 5.710   Median : 3.530   Median :11.229   Median :-0.51457  
##  Mean   : 5.735   Mean   : 3.539   Mean   :11.234   Mean   :-0.56982  
##  3rd Qu.: 6.540   3rd Qu.: 4.040   3rd Qu.:12.378   3rd Qu.: 0.05658  
##  Max.   :58.900   Max.   :31.800   Max.   :14.200   Max.   : 2.32481  
##