Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2, na.action = na.warn)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2, na.action = na.warn
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid

## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4

ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)

## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

# Use this chunk to answer question 1

The bright blue strips represent the count of diamonds in the same carat with different prices.

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

As we all know the relationship between price and carot is positive; however, we’re not sure about if this relationship is linear. According to the formula and the code below, the relationship between price and carot is not linear.

# Use this chunk to answer question 2
mod_log <- lm(log2(price) ~ log2(carat), data = diamonds)
mod_log

## 
## Call:
## lm(formula = log2(price) ~ log2(carat), data = diamonds)
## 
## Coefficients:
## (Intercept)  log2(carat)  
##      12.189        1.676

plot(mod_log)

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

Using cook.distance to see the outliears, which are sbove 5 or less than -5. If we subset the data using risiduals with within 25% or above 75% and compare the charts with the orginal charts, we can tell if those outliers are different from other factors that are with 25% and 75% risiduals. From the charts we found that they are not very different from each other. In conclusion, some of them are probably over- or under- price.

# Use this chunk to place your code for extracting the high and low residuals and answer question 3
lm1<-lm(carat~price, diamonds)
plot(lm1)

diamonds3 <- diamonds2 %>%
  filter(lresid > quantile(diamonds2$lresid, 0.25) | quantile(diamonds2$lresid, 0.75) )

summary(diamonds3)

##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1589   D: 6771   SI1    :13055   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4889   E: 9792   VS2    :12256   1st Qu.:61.00  
##  Median :0.7000   Very Good:12063   F: 9538   SI2    : 9116   Median :61.80  
##  Mean   :0.7932   Premium  :13745   G:11280   VS1    : 8169   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21528   H: 8269   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :2.5000                     I: 5389   VVS1   : 3655   Max.   :79.00  
##                                     J: 2775   (Other): 2497                  
##      table           price             x               y         
##  Min.   :43.00   Min.   :  326   Min.   :0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  948   1st Qu.:4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2396   Median :5.690   Median : 5.710  
##  Mean   :57.46   Mean   : 3906   Mean   :5.724   Mean   : 5.727  
##  3rd Qu.:59.00   3rd Qu.: 5293   3rd Qu.:6.540   3rd Qu.: 6.530  
##  Max.   :95.00   Max.   :18823   Max.   :8.890   Max.   :58.900  
##                                                                  
##        z              lprice           lcarat             lresid         
##  Min.   : 0.000   Min.   : 8.349   Min.   :-2.32193   Min.   :-1.964068  
##  1st Qu.: 2.910   1st Qu.: 9.889   1st Qu.:-1.32193   1st Qu.:-0.245488  
##  Median : 3.520   Median :11.226   Median :-0.51457   Median :-0.008442  
##  Mean   : 3.534   Mean   :11.228   Mean   :-0.57460   Mean   : 0.000000  
##  3rd Qu.: 4.030   3rd Qu.:12.370   3rd Qu.: 0.05658   3rd Qu.: 0.239301  
##  Max.   :31.800   Max.   :14.200   Max.   : 1.32193   Max.   : 1.934855  
##                                                                          
##     lresid2        
##  Min.   :-1.17388  
##  1st Qu.:-0.12437  
##  Median :-0.00094  
##  Mean   : 0.00000  
##  3rd Qu.: 0.11920  
##  Max.   : 2.78322  
##

ggplot(diamonds3, aes(lresid,price)) + geom_boxplot()

## Warning: Continuous x aesthetic -- did you forget aes(group=...)?

ggplot(diamonds3, aes(lresid,color)) + geom_boxplot()

ggplot(diamonds3, aes(lresid,cut)) + geom_boxplot()

mod_diamond3 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds3, na.action = na.warn
)

plot(mod_diamond3)

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2, na.action = na.warn
)

plot(mod_diamond2)

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices and answer question 4

summary(mod_diamond2)

## 
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2, 
##     na.action = na.warn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) 12.206978   0.001693 7211.806  < 2e-16 ***
## lcarat       1.886239   0.001124 1677.809  < 2e-16 ***
## color.L     -0.633998   0.002910 -217.872  < 2e-16 ***
## color.Q     -0.137580   0.002676  -51.409  < 2e-16 ***
## color.C     -0.022072   0.002503   -8.819  < 2e-16 ***
## color^4      0.016570   0.002297    7.213 5.54e-13 ***
## color^5     -0.002828   0.002169   -1.304    0.192    
## color^6      0.003533   0.001971    1.793    0.073 .  
## cut.L        0.173866   0.003386   51.349  < 2e-16 ***
## cut.Q       -0.050346   0.002980  -16.897  < 2e-16 ***
## cut.C        0.019129   0.002583    7.407 1.31e-13 ***
## cut^4       -0.002410   0.002066   -1.166    0.243    
## clarity.L    1.308155   0.005179  252.598  < 2e-16 ***
## clarity.Q   -0.334090   0.004839  -69.047  < 2e-16 ***
## clarity.C    0.178423   0.004140   43.093  < 2e-16 ***
## clarity^4   -0.088059   0.003298  -26.697  < 2e-16 ***
## clarity^5    0.035885   0.002680   13.389  < 2e-16 ***
## clarity^6   -0.001371   0.002327   -0.589    0.556    
## clarity^7    0.048221   0.002051   23.512  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

mod_diamond = lm(price ~ carat + color + cut + clarity, diamonds, na.action = na.warn)
summary(mod_diamond)

## 
## Call:
## lm(formula = price ~ carat + color + cut + clarity, data = diamonds, 
##     na.action = na.warn)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16813.5   -680.4   -197.6    466.4  10394.9 
## 
## Coefficients:
##              Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -3710.603     13.980 -265.414  < 2e-16 ***
## carat        8886.129     12.034  738.437  < 2e-16 ***
## color.L     -1910.288     17.712 -107.853  < 2e-16 ***
## color.Q      -627.954     16.121  -38.952  < 2e-16 ***
## color.C      -171.960     15.070  -11.410  < 2e-16 ***
## color^4        21.678     13.840    1.566    0.117    
## color^5       -85.943     13.076   -6.572 5.00e-11 ***
## color^6       -49.986     11.889   -4.205 2.62e-05 ***
## cut.L         698.907     20.335   34.369  < 2e-16 ***
## cut.Q        -327.686     17.911  -18.295  < 2e-16 ***
## cut.C         180.565     15.557   11.607  < 2e-16 ***
## cut^4          -1.207     12.458   -0.097    0.923    
## clarity.L    4217.535     30.831  136.794  < 2e-16 ***
## clarity.Q   -1832.406     28.827  -63.565  < 2e-16 ***
## clarity.C     923.273     24.679   37.411  < 2e-16 ***
## clarity^4    -361.995     19.739  -18.339  < 2e-16 ***
## clarity^5     216.616     16.109   13.447  < 2e-16 ***
## clarity^6       2.105     14.037    0.150    0.881    
## clarity^7     110.340     12.383    8.910  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared:  0.9159, Adjusted R-squared:  0.9159 
## F-statistic: 3.264e+04 on 18 and 53921 DF,  p-value: < 2.2e-16

mod_diamond Estimate Std. Error t value Pr(>|t|)
(Intercept) -3710.603 13.980 -265.414 < 2e-16 ***

mod_diamonds2 Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.206978 0.001693 7211.806 < 2e-16 ***

Please see above two models, one for the original data, which is mod_diamond, and another one for diamond2 to compare and see if mod_diamond2 did a good job. Take the intercept row as an example, they’re pretty different from each other except for P-value. However, except from the different p-value for color 4, 5, and 6, mod_diamonds2 has categorized the same important factors as the original model.

ANLY 505 - DiamondsDataModel

Week 3

Cheng An Yang

2020-02-22