ggplot(diamonds, aes(cut,price)) + geom_boxplot()
ggplot(diamonds, aes(color,price)) + geom_boxplot()
ggplot(diamonds, aes(clarity,price)) + geom_boxplot()
ggplot(diamonds, aes(carat, price)) +
geom_hex(bins=50)
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins=50)
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2, na.action = na.warn)
grid <- diamonds2 %>%
data_grid(carat = seq_range(carat, 20)) %>%
mutate(lcarat = log2(carat)) %>%
add_predictions(mod_diamond, "lprice") %>%
mutate(price = 2 ^ lprice)
ggplot(diamonds2, aes(carat, price)) +
geom_hex(bins = 50) +
geom_line(data = grid, color = "green", size = 1)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
ggplot(diamonds2, aes(lcarat, lresid)) +
geom_hex(bins = 50)
ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()
mod_diamond2 <- lm(
lprice ~ lcarat + color + cut + clarity, diamonds2, na.action = na.warn
)
grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond2) %>%
add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
## cut lcarat color clarity pred
## <ord> <dbl> <chr> <chr> <dbl>
## 1 Fair -0.515 G VS2 11.2
## 2 Good -0.515 G VS2 11.3
## 3 Very Good -0.515 G VS2 11.4
## 4 Premium -0.515 G VS2 11.4
## 5 Ideal -0.515 G VS2 11.4
ggplot(grid, aes(cut, pred)) +
geom_point()
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
## # A tibble: 16 x 11
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Premium G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Premium G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very Good D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Premium F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Premium E SI2 59.7 59 8.82 8.76 5.25
In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?
# Use this chunk to answer question 1
The bright blue strips represent the count of diamonds in the same carat with different prices.
If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?
As we all know the relationship between price and carot is positive; however, we’re not sure about if this relationship is linear. According to the formula and the code below, the relationship between price and carot is not linear.
# Use this chunk to answer question 2
mod_log <- lm(log2(price) ~ log2(carat), data = diamonds)
mod_log
##
## Call:
## lm(formula = log2(price) ~ log2(carat), data = diamonds)
##
## Coefficients:
## (Intercept) log2(carat)
## 12.189 1.676
plot(mod_log)
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?
Using cook.distance to see the outliears, which are sbove 5 or less than -5. If we subset the data using risiduals with within 25% or above 75% and compare the charts with the orginal charts, we can tell if those outliers are different from other factors that are with 25% and 75% risiduals. From the charts we found that they are not very different from each other. In conclusion, some of them are probably over- or under- price.
# Use this chunk to place your code for extracting the high and low residuals and answer question 3
lm1<-lm(carat~price, diamonds)
plot(lm1)
diamonds3 <- diamonds2 %>%
filter(lresid > quantile(diamonds2$lresid, 0.25) | quantile(diamonds2$lresid, 0.75) )
summary(diamonds3)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1589 D: 6771 SI1 :13055 Min. :43.00
## 1st Qu.:0.4000 Good : 4889 E: 9792 VS2 :12256 1st Qu.:61.00
## Median :0.7000 Very Good:12063 F: 9538 SI2 : 9116 Median :61.80
## Mean :0.7932 Premium :13745 G:11280 VS1 : 8169 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21528 H: 8269 VVS2 : 5066 3rd Qu.:62.50
## Max. :2.5000 I: 5389 VVS1 : 3655 Max. :79.00
## J: 2775 (Other): 2497
## table price x y
## Min. :43.00 Min. : 326 Min. :0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 948 1st Qu.:4.710 1st Qu.: 4.720
## Median :57.00 Median : 2396 Median :5.690 Median : 5.710
## Mean :57.46 Mean : 3906 Mean :5.724 Mean : 5.727
## 3rd Qu.:59.00 3rd Qu.: 5293 3rd Qu.:6.540 3rd Qu.: 6.530
## Max. :95.00 Max. :18823 Max. :8.890 Max. :58.900
##
## z lprice lcarat lresid
## Min. : 0.000 Min. : 8.349 Min. :-2.32193 Min. :-1.964068
## 1st Qu.: 2.910 1st Qu.: 9.889 1st Qu.:-1.32193 1st Qu.:-0.245488
## Median : 3.520 Median :11.226 Median :-0.51457 Median :-0.008442
## Mean : 3.534 Mean :11.228 Mean :-0.57460 Mean : 0.000000
## 3rd Qu.: 4.030 3rd Qu.:12.370 3rd Qu.: 0.05658 3rd Qu.: 0.239301
## Max. :31.800 Max. :14.200 Max. : 1.32193 Max. : 1.934855
##
## lresid2
## Min. :-1.17388
## 1st Qu.:-0.12437
## Median :-0.00094
## Mean : 0.00000
## 3rd Qu.: 0.11920
## Max. : 2.78322
##
ggplot(diamonds3, aes(lresid,price)) + geom_boxplot()
## Warning: Continuous x aesthetic -- did you forget aes(group=...)?
ggplot(diamonds3, aes(lresid,color)) + geom_boxplot()
ggplot(diamonds3, aes(lresid,cut)) + geom_boxplot()
mod_diamond3 <- lm(
lprice ~ lcarat + color + cut + clarity, diamonds3, na.action = na.warn
)
plot(mod_diamond3)
mod_diamond2 <- lm(
lprice ~ lcarat + color + cut + clarity, diamonds2, na.action = na.warn
)
plot(mod_diamond2)
Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?
# Use this chunk to place your code for assessing how well the model predicts diamond prices and answer question 4
summary(mod_diamond2)
##
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2,
## na.action = na.warn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.17388 -0.12437 -0.00094 0.11920 2.78322
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.206978 0.001693 7211.806 < 2e-16 ***
## lcarat 1.886239 0.001124 1677.809 < 2e-16 ***
## color.L -0.633998 0.002910 -217.872 < 2e-16 ***
## color.Q -0.137580 0.002676 -51.409 < 2e-16 ***
## color.C -0.022072 0.002503 -8.819 < 2e-16 ***
## color^4 0.016570 0.002297 7.213 5.54e-13 ***
## color^5 -0.002828 0.002169 -1.304 0.192
## color^6 0.003533 0.001971 1.793 0.073 .
## cut.L 0.173866 0.003386 51.349 < 2e-16 ***
## cut.Q -0.050346 0.002980 -16.897 < 2e-16 ***
## cut.C 0.019129 0.002583 7.407 1.31e-13 ***
## cut^4 -0.002410 0.002066 -1.166 0.243
## clarity.L 1.308155 0.005179 252.598 < 2e-16 ***
## clarity.Q -0.334090 0.004839 -69.047 < 2e-16 ***
## clarity.C 0.178423 0.004140 43.093 < 2e-16 ***
## clarity^4 -0.088059 0.003298 -26.697 < 2e-16 ***
## clarity^5 0.035885 0.002680 13.389 < 2e-16 ***
## clarity^6 -0.001371 0.002327 -0.589 0.556
## clarity^7 0.048221 0.002051 23.512 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared: 0.9828, Adjusted R-squared: 0.9828
## F-statistic: 1.706e+05 on 18 and 53795 DF, p-value: < 2.2e-16
mod_diamond = lm(price ~ carat + color + cut + clarity, diamonds, na.action = na.warn)
summary(mod_diamond)
##
## Call:
## lm(formula = price ~ carat + color + cut + clarity, data = diamonds,
## na.action = na.warn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16813.5 -680.4 -197.6 466.4 10394.9
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
## carat 8886.129 12.034 738.437 < 2e-16 ***
## color.L -1910.288 17.712 -107.853 < 2e-16 ***
## color.Q -627.954 16.121 -38.952 < 2e-16 ***
## color.C -171.960 15.070 -11.410 < 2e-16 ***
## color^4 21.678 13.840 1.566 0.117
## color^5 -85.943 13.076 -6.572 5.00e-11 ***
## color^6 -49.986 11.889 -4.205 2.62e-05 ***
## cut.L 698.907 20.335 34.369 < 2e-16 ***
## cut.Q -327.686 17.911 -18.295 < 2e-16 ***
## cut.C 180.565 15.557 11.607 < 2e-16 ***
## cut^4 -1.207 12.458 -0.097 0.923
## clarity.L 4217.535 30.831 136.794 < 2e-16 ***
## clarity.Q -1832.406 28.827 -63.565 < 2e-16 ***
## clarity.C 923.273 24.679 37.411 < 2e-16 ***
## clarity^4 -361.995 19.739 -18.339 < 2e-16 ***
## clarity^5 216.616 16.109 13.447 < 2e-16 ***
## clarity^6 2.105 14.037 0.150 0.881
## clarity^7 110.340 12.383 8.910 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1157 on 53921 degrees of freedom
## Multiple R-squared: 0.9159, Adjusted R-squared: 0.9159
## F-statistic: 3.264e+04 on 18 and 53921 DF, p-value: < 2.2e-16
mod_diamond Estimate Std. Error t value Pr(>|t|)
(Intercept) -3710.603 13.980 -265.414 < 2e-16 ***
mod_diamonds2 Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.206978 0.001693 7211.806 < 2e-16 ***
Please see above two models, one for the original data, which is mod_diamond, and another one for diamond2 to compare and see if mod_diamond2 did a good job. Take the intercept row as an example, they’re pretty different from each other except for P-value. However, except from the different p-value for color 4, 5, and 6, mod_diamonds2 has categorized the same important factors as the original model.