Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2, na.action = na.warn)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2, na.action = na.warn
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid

## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4

ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)

## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

Answer: The bright vertical strips comprise of bright dots, which represent high counts of diamond. It suggests that those specific carat numbers are popular on the market, with the most numbers being sold, which could be a useful indicator to jewelery makers. At the same carat, the diamonds’ price would vary because of the differences on other factors such as clarity, color, and cut, thus forming the vertical strips.

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

Answer: It suggests that the log of carat and log of price have a linear relationship. The positive value of a_1 suggests that the correlation is positive, i.e., price increases when carat increases.

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

Answer: When comparing the residual plots of lresid vs. lcarat, cut, clarity, and color, we can see that cut has the smallest impact on residual distribution, with all cut types of “fair”, “good”, “Very Good”, “Premium”, and “Ideal” sharing almost the same threshold of residual distribution, and outliers concentrating on lresid >0.5 or <-0.5 ranges.

We thereby extract the data with lresid value >0.5 or <-0.5, eliminate the “cut” factor, and plot the price against color and clarity to see if the pricing makes sense.

The plot clearly shows that the higher the clarity is, the more likely the diamonds are priced high (with high residuals); vice versa. It makes logical sense since high clarity diamonds are highly sought after.

# Use this chunk to place your code for extracting the high and low residuals and answer question 3

diamonds_high <- diamonds2 %>% 
    filter(lresid > 0.5)

diamonds_low <- diamonds2 %>% 
    filter(lresid < -0.5)

diamonds_high %>% 
  ggplot(aes(color, price))+
  geom_boxplot()+
  ggtitle("Diamonds with high residuals")+
  facet_grid(~clarity)

diamonds_low %>% 
  ggplot(aes(color, price))+
  geom_boxplot()+
  ggtitle("Diamonds with low residuals")+
  facet_grid(~clarity)

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

Answer: In order to test if mod_diamonds2 does a good job of predicting diamond prices, we summarize the model and conclude that the model has significant statistical significance with p-values <0.01, and R-square >0.98, which suggests that more than 98% of data are explained by the model.

We then compare the predicted and actual values using below plot and find that the actual prices are evenly distributed along both sides of the predicted line. It looks to me that it is a reliable and trustworthy model on determining diamond price.

# Use this chunk to place your code for assessing how well the model predicts diamond prices and answer question 4

summary(mod_diamond2)

## 
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2, 
##     na.action = na.warn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.206978   0.001693 7211.806  < 2e-16 ***
## lcarat       1.886239   0.001124 1677.809  < 2e-16 ***
## color.L     -0.633998   0.002910 -217.872  < 2e-16 ***
## color.Q     -0.137580   0.002676  -51.409  < 2e-16 ***
## color.C     -0.022072   0.002503   -8.819  < 2e-16 ***
## color^4      0.016570   0.002297    7.213 5.54e-13 ***
## color^5     -0.002828   0.002169   -1.304    0.192    
## color^6      0.003533   0.001971    1.793    0.073 .  
## cut.L        0.173866   0.003386   51.349  < 2e-16 ***
## cut.Q       -0.050346   0.002980  -16.897  < 2e-16 ***
## cut.C        0.019129   0.002583    7.407 1.31e-13 ***
## cut^4       -0.002410   0.002066   -1.166    0.243    
## clarity.L    1.308155   0.005179  252.598  < 2e-16 ***
## clarity.Q   -0.334090   0.004839  -69.047  < 2e-16 ***
## clarity.C    0.178423   0.004140   43.093  < 2e-16 ***
## clarity^4   -0.088059   0.003298  -26.697  < 2e-16 ***
## clarity^5    0.035885   0.002680   13.389  < 2e-16 ***
## clarity^6   -0.001371   0.002327   -0.589    0.556    
## clarity^7    0.048221   0.002051   23.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

diamonds3 <- diamonds2 %>%
  add_predictions(mod_diamond2)


ggplot(diamonds3, aes(lprice, pred)) +
  geom_point() +
  geom_abline(slope=1, color="green")

ANLY 505 - DiamondsDataModel

Week 3

Yiwen Zhao

01/28/2020