ANLY505 DiamondsDataModel

Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

## Warning: package 'hexbin' was built under R version 3.5.2

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

## Warning: package 'bindrcpp' was built under R version 3.5.2

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid

## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4

ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)

## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

They represent the categories of carat which is in fact an integer variable. But because we took log of the initial variable, so we get results different from integers.

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

It says that the price of a diamond is completely dependent on the carat size iff the relationship is multiplicative or linear. Therefore, 1% increase in carat is associated with a 1% increase in price.

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

# Use this chunk to place your code for extracting the high and low residuals
diamonds2 <-
  diamonds %>% 
  mutate(lprice = log2(price),
         lcarat = log2(carat))
mod <- lm(lprice ~ lcarat + color + clarity + cut, data = diamonds2)
top <-
  diamonds2 %>% 
  add_residuals(mod) %>% 
  arrange(resid) %>% 
  slice(1:10)
bot <-
  diamonds2 %>% 
  add_residuals(mod) %>% 
  arrange(-resid) %>% 
  slice(1:10)
bind_rows(top, bot) %>% 
  select(price, carat, resid)

## # A tibble: 20 x 3
##    price carat  resid
##    <int> <dbl>  <dbl>
##  1  6512 3     -1.46 
##  2 10470 2.46  -1.17 
##  3 10453 3.05  -1.14 
##  4 14220 3.01  -1.12 
##  5  9925 3.01  -1.12 
##  6 18701 3.51  -1.09 
##  7  1262 1.03  -1.04 
##  8  8040 3.01  -1.02 
##  9 12587 3.5   -0.990
## 10  8044 3     -0.985
## 11  2160 0.34   2.81 
## 12  1776 0.290  2.10 
## 13  1186 0.25   2.06 
## 14  1186 0.25   2.06 
## 15  1013 0.25   1.94 
## 16  2366 0.3    1.61 
## 17  1715 0.32   1.57 
## 18  4368 0.51   1.36 
## 19 10011 1.01   1.31 
## 20  3807 0.61   1.31

There is nothing unusual about these diamonds. And we don’t see any particularly bad or good at this point.

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices
# Add error calculation
diamonds2 <- diamonds2 %>% 
    add_predictions(mod) %>% 
    mutate(pred = round(2 ^ pred),
           err = pred - price) 
# Visualize error
diamonds2 %>%
    ggplot(aes(err)) +
    geom_histogram(bins = 50)

We’ll try exam the difference between the actual and predicted price by histgram. It shows the predictions centered around zero, however, it’s difficult to see the range of errors. To better characterize the error, we can review the distribution using quantile().

p <- c(0.005, 0.025, 0.25, 0.5, 0.75, 0.975, 0.995)
diamonds2$err %>% quantile(probs = p)

##      0.5%      2.5%       25%       50%       75%     97.5%     99.5% 
## -2863.000 -1769.000  -198.000     0.000   155.000  1436.525  3559.915

median(diamonds2$price)

## [1] 2401

The model error (difference between actual and predicted price) is 95% of the time within [-$1769, $1436]. This is relatively high considering the median price is $2,401. Therefore, there is considerable variability in the predictions that should be weighed before using the model.