Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid

## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4

ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)

## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

lcarat is the log of the carat value of the diamond and lprice is the log of the price amount, to make the two more comparable and smoothen the distribution.

The bright blue vertical strips show a higher concentration of the number of observations for a certain logarithm of carat values and prices. As shown on the plot, for each carat values, there is a price range associated. With the light blue it shows that either there seems to be a higher “agreement” on the price for certain carats, or that carats values have these specific values are more popular (higher count of diamonds).

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

According to this equation, the log of price of a diamond is linearily dependant from the value of the log of the carat, with a certain coefficient a_1. If this coefficient is negative, it shows a negative relation between the carat and the price. If it is positive, it shows a positive relation between the carat and the price of a diamond. Here, there is also “a_0”, the error of the equation, which captures everything influencing the price that is not the log of the carat.

To get the actual price of the diamond, we would use the “exp” function: price=exp(a_0 + a_1*log(carat))

Question #3

Extract the diamonds that have very high and very low residuals.
Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

First, we want to identify the range of the values of the residuals for the 4 parameters model.

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

summary(diamonds2$lresid2)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## -1.17388 -0.12437 -0.00094  0.00000  0.11920  2.78322

The residuals range from -1.17 to 2.78.

The code to filter the extreme residuals were already given previously, but an alternative can be found below:

diamondsX <- diamonds2 %>% 
  filter(lresid2 > 1| lresid2 < -1) %>%
  arrange(lresid2)

diamondsX

## # A tibble: 16 x 14
##    carat cut   color clarity depth table price     x     y     z lprice
##    <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>  <dbl>
##  1 2.46  Prem… E     SI2      59.7    59 10470  8.82  8.76  5.25  13.4 
##  2 1.03  Fair  E     I1       78.2    54  1262  5.72  5.59  4.42  10.3 
##  3 0.35  Fair  G     VS2      65.9    54  1415  5.57  5.53  3.66  10.5 
##  4 0.35  Fair  G     VS2      65.9    54  1415  5.57  5.53  3.66  10.5 
##  5 0.51  Fair  F     VVS2     65.4    60  3920  4.98  4.9   3.23  11.9 
##  6 0.51  Prem… F     SI1      62.7    62  3360  5.09  4.96  3.15  11.7 
##  7 0.61  Good  F     SI2      62.5    65  3807  5.36  5.29  3.33  11.9 
##  8 1.01  Fair  D     SI2      64.6    58 10011  6.25  6.2   4.02  13.3 
##  9 0.51  Fair  F     VVS2     60.7    66  4368  5.21  5.11  3.13  12.1 
## 10 0.32  Fair  F     VS2      59.6    60  1715  4.42  4.34  2.61  10.7 
## 11 0.3   Very… D     VVS2     60.6    58  2366  4.33  4.35  2.63  11.2 
## 12 0.25  Fair  F     SI2      54.4    64  1013  4.3   4.23  2.32   9.98
## 13 0.25  Prem… G     SI2      59      60  1186  5.33  5.28  3.12  10.2 
## 14 0.25  Prem… G     SI2      58.8    60  1186  5.33  5.28  3.12  10.2 
## 15 0.290 Fair  F     SI1      55.8    60  1776  4.48  4.41  2.48  10.8 
## 16 0.34  Fair  F     I1       55.8    62  2160  4.72  4.6   2.6   11.1 
## # … with 3 more variables: lcarat <dbl>, lresid <dbl>, lresid2 <dbl>

table(diamondsX$cut)

## 
##      Fair      Good Very Good   Premium     Ideal 
##        10         1         1         4         0

ggplot(diamondsX, aes(cut,price)) + geom_boxplot()

Amongst the diamonds with the most extreme prices, that does not fit the model well, can be found within the premium (highest on the cut scale) and the fair (lowest on the cut scale) diamonds.

table(diamondsX$clarity)

## 
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##    2    6    2    3    0    3    0    0

ggplot(diamondsX, aes(clarity,price)) + geom_boxplot()

Surprisingly, most unbalanced price distribution (the most extreme prices) is amongst the “second to last clarity”SI2" clarity category (second to last).

table(diamondsX$carat)

## 
## 0.25 0.29  0.3 0.32 0.34 0.35 0.51 0.61 1.01 1.03 2.46 
##    3    1    1    1    1    2    3    1    1    1    1

Extreme residuals can be found within 0.25 carat diamonds,

Conclusion: if the quality of a diamond is defined by its cut and its clarity:

diamondsX %>% 
  ggplot(aes(clarity,price))+
  geom_boxplot()+
  facet_grid(~cut)

Interestingly, it seems that amongst the extreme residuals, there is a lot of diamonds with a “fair” cut quality (the lowest of the scale) and lower clarity (SI2). On average, this type of diamond have a price higher than diamonds that are in the premium/SI2 cut category. Also, the plot of residuals (see “r plot residuals of four parameter model”) done shows that the highest residuals are amongst smaller sized diamonds (measured by the carat value) and lowest residuals are amongst the largest diamonds.

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

summary(mod_diamond2)

## 
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.206978   0.001693 7211.806  < 2e-16 ***
## lcarat       1.886239   0.001124 1677.809  < 2e-16 ***
## color.L     -0.633998   0.002910 -217.872  < 2e-16 ***
## color.Q     -0.137580   0.002676  -51.409  < 2e-16 ***
## color.C     -0.022072   0.002503   -8.819  < 2e-16 ***
## color^4      0.016570   0.002297    7.213 5.54e-13 ***
## color^5     -0.002828   0.002169   -1.304    0.192    
## color^6      0.003533   0.001971    1.793    0.073 .  
## cut.L        0.173866   0.003386   51.349  < 2e-16 ***
## cut.Q       -0.050346   0.002980  -16.897  < 2e-16 ***
## cut.C        0.019129   0.002583    7.407 1.31e-13 ***
## cut^4       -0.002410   0.002066   -1.166    0.243    
## clarity.L    1.308155   0.005179  252.598  < 2e-16 ***
## clarity.Q   -0.334090   0.004839  -69.047  < 2e-16 ***
## clarity.C    0.178423   0.004140   43.093  < 2e-16 ***
## clarity^4   -0.088059   0.003298  -26.697  < 2e-16 ***
## clarity^5    0.035885   0.002680   13.389  < 2e-16 ***
## clarity^6   -0.001371   0.002327   -0.589    0.556    
## clarity^7    0.048221   0.002051   23.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

qqnorm(mod_diamond2$residuals)
qqline(mod_diamond2$residuals)

It seems to be a good explanatory model (98% of the variance explained by the model). However, looking at the residuals plot, it shows that the model is not good with extreme prices (looking at the tails), which is not surprising given the results found on question 3. In most of the cases, the model seems to be satisfying, but errors in prices seems to be increasing as the quality and size of the diamond decreases.

DiamondsDataModel - Assignment 2

Morgane Cariou

2019-10-01