Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4
ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)
## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

These bright vertical strips represent the points at which there are higher number of observations. In this case these strips represents the higher density of diamonds at around those points.

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

It represents the percentage increase in price for a 1% increase in carat.

However it can also be expressed in the following form:

Modelling the relationship

mod_log <- lm(log2(price) ~ log2(carat), data = diamonds)

diamonds3 <- diamonds %>%
  select(price, carat) %>%
  add_predictions(mod_log)

ggplot(diamonds3, aes(x = carat, y = 2^pred)) + geom_line() + labs(x = 'carat',
                                                                   y = 'price')

As we can see that the relationship between price and carat is not linear. Relationship can also be approximated as a t times increase in x resulting in a t^a1 increase in y.

2^coef(mod_log)[2]
## log2(carat) 
##    3.195002

A two times increase in x is likely to result in about 3.2 times increase in y

Confirming this relationship

# An increase in carat from 1 to 2
2^(predict(mod_log, newdata = tibble(carat = 2)) - predict(mod_log, newdata = tibble(carat = 1)))
##        1 
## 3.195002
# An increase in carat from 2 to 4
2^(predict(mod_log, newdata = tibble(carat = 4)) - predict(mod_log, newdata = tibble(carat = 2)))
##        1 
## 3.195002
# An increase in carat from 0.5 to 1
2^(predict(mod_log, newdata = tibble(carat = 1)) - predict(mod_log, newdata = tibble(carat = 0.5)))
##        1 
## 3.195002

In all the cases a two times increase led to the same answer 3.195002.

So there are two ways of showing a relationship between price and carat with the log equation. First as a percentage change in price with a 1% change in carat and second as a t times change in carat leading to t^a[1] times change in price.

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

# Use this chunk to place your code for extracting the high and low residuals
diamonds2 %>%
  filter(abs(lresid2) > 1 | abs(lresid2) < -0.8) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z, lresid2) %>%
  arrange(price)
## # A tibble: 16 x 12
##    price  pred carat cut   color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord> <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair  F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Prem~ G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Prem~ G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair  E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair  G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair  G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair  F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair  F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair  F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very~ D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Prem~ F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good  F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair  F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair  F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair  D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Prem~ E     SI2      59.7    59  8.82  8.76  5.25
## # ... with 1 more variable: lresid2 <dbl>

There does not appear to be anything too unusual about the data.

However one thing to note here is that the model is placing much more emphasis on carat weight than any other variable. For example observation 4 in this tibble has a carat weight of 1.03 but has a fair cut, E color and i1 clarity. But because of the carat weight the model priced it more than twice of the original price. Similarly observation 10 has a carat weight of 0.30 with a very good cut and D color and VVS2 clarity but still because of the lower carat weight it got priced at about 1/3rd of the original price by the model.

Graphing the extreme residuals with the full dataset to confirm

library(gridExtra)
## 
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
## 
##     combine
# Extracting data with for the top 20 and bottom 20 residuals
res_top10 <- diamonds2 %>%
  arrange(desc(lresid2)) %>%
  select(lprice, lcarat, lresid2) %>%
  head(20)

res_bottom10 <- diamonds2 %>%
  arrange(lresid2) %>%
  select(lprice, lcarat, lresid2) %>%
  head(20)

df1 <- rbind(res_top10, res_bottom10)

a <- ggplot(df1, aes(lcarat, lprice)) +
  geom_hex(bins = 50)

b <- ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins = 20)

grid.arrange(a, b, layout_matrix = cbind(1, 2))

From the graphs its clear that there doesn’t appear to be anything unusual with the data. The model appears to be heavily favoring carot weight over all other variables while predicting the price. While carat weight is the most important factor from our analysis we can say that giving more weight to other variables would be helpful when predicting prices.

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices

diamonds4 <- diamonds2 %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z, lresid2, lcarat, lresid2) %>%
  arrange(price)


c <- ggplot(diamonds4, aes(lcarat, lresid2)) + 
  geom_hex(bins = 50)

d <- ggplot(diamonds4, aes(pred, price)) +
  geom_hex(bins = 50) + geom_abline(intercept = 0, slope = 1, color="blue")
grid.arrange(c, d, layout_matrix = cbind(1, 2)) 

Model metrics

library(Metrics)
## 
## Attaching package: 'Metrics'
## The following objects are masked from 'package:modelr':
## 
##     mae, mape, mse, rmse
print(paste('Root means squared error: ', rmse(diamonds4$price, diamonds4$pred)))
## [1] "Root means squared error:  735.999947660903"
print(paste('Mean absolute error: ', mae(diamonds4$price, diamonds4$pred)))
## [1] "Mean absolute error:  391.585163712045"
range(diamonds4$price)
## [1]   326 18823

In terms of statistical analysis the model seems to be doing well as indicated by the residual plot and the actual vs prediction plot.

The roor mean squared error is 735, which is extrmely good considering the range of diamond prices are from 326 till 18823. Since the root mean squared error can be heavily influenced by some residuals as we saw earlier we have some extreme residual values we can use the mean absolute error, which is 391.

So these metrics indicate the model seems to be doing well in pricing diamonds.

However, we saw earlier that there are large errors in many cases due to the model heavily favoring carat weight over all other variables. Thus in order to buy and sell diamonds you would need a better model.

Having a model that does take into account the effect of variables like cut, clarity, color while pricing diamonds could be helpful. Also we saw using the log model some presence of non-linearity. Usage of other non-linear models could alsobe helpful. One good approach could be decision trees regression.