ggplot(diamonds, aes(cut,price)) + geom_boxplot()
ggplot(diamonds, aes(color,price)) + geom_boxplot()
ggplot(diamonds, aes(clarity,price)) + geom_boxplot()
ggplot(diamonds, aes(carat, price)) +
geom_hex(bins=50)
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins=50)
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)
grid <- diamonds2 %>%
data_grid(carat = seq_range(carat, 20)) %>%
mutate(lcarat = log2(carat)) %>%
add_predictions(mod_diamond, "lprice") %>%
mutate(price = 2 ^ lprice)
ggplot(diamonds2, aes(carat, price)) +
geom_hex(bins = 50) +
geom_line(data = grid, color = "green", size = 1)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
ggplot(diamonds2, aes(lcarat, lresid)) +
geom_hex(bins = 50)
ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()
mod_diamond2 <- lm(
lprice ~ lcarat + color + cut + clarity, diamonds2
)
grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond2) %>%
add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
## cut lcarat color clarity pred
## <ord> <dbl> <chr> <chr> <dbl>
## 1 Fair -0.515 G VS2 11.2
## 2 Good -0.515 G VS2 11.3
## 3 Very Good -0.515 G VS2 11.4
## 4 Premium -0.515 G VS2 11.4
## 5 Ideal -0.515 G VS2 11.4
ggplot(grid, aes(cut, pred)) +
geom_point()
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
## # A tibble: 16 x 11
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Premium G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Premium G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very Good D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Premium F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Premium E SI2 59.7 59 8.82 8.76 5.25
In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?
The bright stripes represent the popular carat sizes and their average price range. The number of carats doesn’t follow a uniform distribution since round numbers are easier to sell. Similarly, diamonds with similar carat size could have approximately the same rounded price.
If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?
It means that the price of a diamond is entirely dependent on the carat size when the relationship is linear. A 1% increase in carat will result in a 1% increase in price.
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?
From the table below, we know that in the first 20 rows, there are a few 3-carat diamonds that are priced extremely low, much lower than the average price. If we look at the 50-100 rows, there are also a few 1-carat diamonds that are priced very high, much higher than the average. But if we put other factors, such as clarify into consideration. We know that the price of diamonds in addition to “carat” is also affected by “clarity.” So, there are no big errors here. It is just that the price is affected by multiple factors.
# Use this chunk to place your code for extracting the high and low residuals
diamonds2<- diamonds %>% mutate(lprice = log2(price), lcarat = log2(carat))
model2<- lm(lprice ~ lcarat + color + clarity + cut, data = diamonds2)
high<- diamonds2 %>% add_residuals(model2) %>% arrange(resid) %>% slice(1:20)
low<- diamonds2 %>% add_residuals(model2) %>% arrange(-resid) %>% slice(1:20)
bind_rows(high, low) %>% select(price, carat, resid, clarity)
## # A tibble: 40 x 4
## price carat resid clarity
## <int> <dbl> <dbl> <ord>
## 1 6512 3 -1.46 I1
## 2 10470 2.46 -1.17 SI2
## 3 10453 3.05 -1.14 I1
## 4 14220 3.01 -1.12 SI2
## 5 9925 3.01 -1.12 I1
## 6 18701 3.51 -1.09 VS2
## 7 1262 1.03 -1.04 I1
## 8 8040 3.01 -1.02 I1
## 9 12587 3.5 -0.990 I1
## 10 8044 3 -0.985 I1
## # … with 30 more rows
Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?
Based on the plot of lresid2 vs. lcarat, it seems like the model is reliable predicting the diamond price with a few outliers.
# Use this chunk to place your code for assessing how well the model predicts diamond prices
summary(diamonds2)
## carat cut color clarity
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066
## Max. :5.0100 I: 5422 VVS1 : 3655
## J: 2808 (Other): 2531
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
##
## y z lprice lcarat
## Min. : 0.000 Min. : 0.000 Min. : 8.349 Min. :-2.32193
## 1st Qu.: 4.720 1st Qu.: 2.910 1st Qu.: 9.892 1st Qu.:-1.32193
## Median : 5.710 Median : 3.530 Median :11.229 Median :-0.51457
## Mean : 5.735 Mean : 3.539 Mean :11.234 Mean :-0.56982
## 3rd Qu.: 6.540 3rd Qu.: 4.040 3rd Qu.:12.378 3rd Qu.: 0.05658
## Max. :58.900 Max. :31.800 Max. :14.200 Max. : 2.32481
##