ggplot(diamonds, aes(cut,price)) + geom_boxplot()
ggplot(diamonds, aes(color,price)) + geom_boxplot()
ggplot(diamonds, aes(clarity,price)) + geom_boxplot()
ggplot(diamonds, aes(carat, price)) +
geom_hex(bins=50)
diamonds2 <- diamonds %>%
filter(carat <= 2.5) %>%
mutate(lprice = log2(price), lcarat = log2(carat))
ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins=50)
mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)
grid <- diamonds2 %>%
data_grid(carat = seq_range(carat, 20)) %>%
mutate(lcarat = log2(carat)) %>%
add_predictions(mod_diamond, "lprice") %>%
mutate(price = 2 ^ lprice)
ggplot(diamonds2, aes(carat, price)) +
geom_hex(bins = 50) +
geom_line(data = grid, color = "green", size = 1)
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond, "lresid")
ggplot(diamonds2, aes(lcarat, lresid)) +
geom_hex(bins = 50)
ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()
ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()
mod_diamond2 <- lm(
lprice ~ lcarat + color + cut + clarity, diamonds2
)
grid <- diamonds2 %>%
data_grid(cut, .model = mod_diamond2) %>%
add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
## cut lcarat color clarity pred
## <ord> <dbl> <chr> <chr> <dbl>
## 1 Fair -0.515 G VS2 11.2
## 2 Good -0.515 G VS2 11.3
## 3 Very Good -0.515 G VS2 11.4
## 4 Premium -0.515 G VS2 11.4
## 5 Ideal -0.515 G VS2 11.4
ggplot(grid, aes(cut, pred)) +
geom_point()
diamonds2 <- diamonds2 %>%
add_residuals(mod_diamond2, "lresid2")
ggplot(diamonds2, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
diamonds2 %>%
filter(abs(lresid2) > 1) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z) %>%
arrange(price)
## # A tibble: 16 x 11
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Premium G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Premium G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very Good D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Premium F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Premium E SI2 59.7 59 8.82 8.76 5.25
In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?
These bright vertical strips represent the points at which there are higher number of observations. In this case these strips represents the higher density of diamonds at around those points.
If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?
It represents the percentage increase in price for a 1% increase in carat.
However it can also be expressed in the following form:
Modelling the relationship
mod_log <- lm(log2(price) ~ log2(carat), data = diamonds)
diamonds3 <- diamonds %>%
select(price, carat) %>%
add_predictions(mod_log)
ggplot(diamonds3, aes(x = carat, y = 2^pred)) + geom_line() + labs(x = 'carat',
y = 'price')
As we can see that the relationship between price and carat is not linear. Relationship can also be approximated as a t times increase in x resulting in a t^a1 increase in y.
2^coef(mod_log)[2]
## log2(carat)
## 3.195002
A two times increase in x is likely to result in about 3.2 times increase in y
Confirming this relationship
# An increase in carat from 1 to 2
2^(predict(mod_log, newdata = tibble(carat = 2)) - predict(mod_log, newdata = tibble(carat = 1)))
## 1
## 3.195002
# An increase in carat from 2 to 4
2^(predict(mod_log, newdata = tibble(carat = 4)) - predict(mod_log, newdata = tibble(carat = 2)))
## 1
## 3.195002
# An increase in carat from 0.5 to 1
2^(predict(mod_log, newdata = tibble(carat = 1)) - predict(mod_log, newdata = tibble(carat = 0.5)))
## 1
## 3.195002
In all the cases a two times increase led to the same answer 3.195002.
So there are two ways of showing a relationship between price and carat with the log equation. First as a percentage change in price with a 1% change in carat and second as a t times change in carat leading to t^a[1] times change in price.
Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?
# Use this chunk to place your code for extracting the high and low residuals
diamonds2 %>%
filter(abs(lresid2) > 1 | abs(lresid2) < -0.8) %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z, lresid2) %>%
arrange(price)
## # A tibble: 16 x 12
## price pred carat cut color clarity depth table x y z
## <int> <dbl> <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1013 264 0.25 Fair F SI2 54.4 64 4.3 4.23 2.32
## 2 1186 284 0.25 Prem~ G SI2 59 60 5.33 5.28 3.12
## 3 1186 284 0.25 Prem~ G SI2 58.8 60 5.33 5.28 3.12
## 4 1262 2644 1.03 Fair E I1 78.2 54 5.72 5.59 4.42
## 5 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 6 1415 639 0.35 Fair G VS2 65.9 54 5.57 5.53 3.66
## 7 1715 576 0.32 Fair F VS2 59.6 60 4.42 4.34 2.61
## 8 1776 412 0.290 Fair F SI1 55.8 60 4.48 4.41 2.48
## 9 2160 314 0.34 Fair F I1 55.8 62 4.72 4.6 2.6
## 10 2366 774 0.3 Very~ D VVS2 60.6 58 4.33 4.35 2.63
## 11 3360 1373 0.51 Prem~ F SI1 62.7 62 5.09 4.96 3.15
## 12 3807 1540 0.61 Good F SI2 62.5 65 5.36 5.29 3.33
## 13 3920 1705 0.51 Fair F VVS2 65.4 60 4.98 4.9 3.23
## 14 4368 1705 0.51 Fair F VVS2 60.7 66 5.21 5.11 3.13
## 15 10011 4048 1.01 Fair D SI2 64.6 58 6.25 6.2 4.02
## 16 10470 23622 2.46 Prem~ E SI2 59.7 59 8.82 8.76 5.25
## # ... with 1 more variable: lresid2 <dbl>
There does not appear to be anything too unusual about the data.
However one thing to note here is that the model is placing much more emphasis on carat weight than any other variable. For example observation 4 in this tibble has a carat weight of 1.03 but has a fair cut, E color and i1 clarity. But because of the carat weight the model priced it more than twice of the original price. Similarly observation 10 has a carat weight of 0.30 with a very good cut and D color and VVS2 clarity but still because of the lower carat weight it got priced at about 1/3rd of the original price by the model.
Graphing the extreme residuals with the full dataset to confirm
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
# Extracting data with for the top 20 and bottom 20 residuals
res_top10 <- diamonds2 %>%
arrange(desc(lresid2)) %>%
select(lprice, lcarat, lresid2) %>%
head(20)
res_bottom10 <- diamonds2 %>%
arrange(lresid2) %>%
select(lprice, lcarat, lresid2) %>%
head(20)
df1 <- rbind(res_top10, res_bottom10)
a <- ggplot(df1, aes(lcarat, lprice)) +
geom_hex(bins = 50)
b <- ggplot(diamonds2, aes(lcarat, lprice)) +
geom_hex(bins = 20)
grid.arrange(a, b, layout_matrix = cbind(1, 2))
From the graphs its clear that there doesn’t appear to be anything unusual with the data. The model appears to be heavily favoring carot weight over all other variables while predicting the price. While carat weight is the most important factor from our analysis we can say that giving more weight to other variables would be helpful when predicting prices.
Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?
# Use this chunk to place your code for assessing how well the model predicts diamond prices
diamonds4 <- diamonds2 %>%
add_predictions(mod_diamond2) %>%
mutate(pred = round(2^pred)) %>%
select(price, pred, carat:table, x:z, lresid2, lcarat, lresid2) %>%
arrange(price)
c <- ggplot(diamonds4, aes(lcarat, lresid2)) +
geom_hex(bins = 50)
d <- ggplot(diamonds4, aes(pred, price)) +
geom_hex(bins = 50) + geom_abline(intercept = 0, slope = 1, color="blue")
grid.arrange(c, d, layout_matrix = cbind(1, 2))
Model metrics
library(Metrics)
##
## Attaching package: 'Metrics'
## The following objects are masked from 'package:modelr':
##
## mae, mape, mse, rmse
print(paste('Root means squared error: ', rmse(diamonds4$price, diamonds4$pred)))
## [1] "Root means squared error: 735.999947660903"
print(paste('Mean absolute error: ', mae(diamonds4$price, diamonds4$pred)))
## [1] "Mean absolute error: 391.585163712045"
range(diamonds4$price)
## [1] 326 18823
In terms of statistical analysis the model seems to be doing well as indicated by the residual plot and the actual vs prediction plot.
The roor mean squared error is 735, which is extrmely good considering the range of diamond prices are from 326 till 18823. Since the root mean squared error can be heavily influenced by some residuals as we saw earlier we have some extreme residual values we can use the mean absolute error, which is 391.
So these metrics indicate the model seems to be doing well in pricing diamonds.
However, we saw earlier that there are large errors in many cases due to the model heavily favoring carat weight over all other variables. Thus in order to buy and sell diamonds you would need a better model.
Having a model that does take into account the effect of variables like cut, clarity, color while pricing diamonds could be helpful. Also we saw using the log model some presence of non-linearity. Usage of other non-linear models could alsobe helpful. One good approach could be decision trees regression.