Initial Visualization

ggplot(diamonds, aes(cut,price)) + geom_boxplot()

ggplot(diamonds, aes(color,price)) + geom_boxplot()

ggplot(diamonds, aes(clarity,price)) + geom_boxplot()

ggplot(diamonds, aes(carat, price)) +
  geom_hex(bins=50)

Subset Data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat))

ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)

Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice)

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1)

Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

ggplot(diamonds2, aes(lcarat, lresid)) +
  geom_hex(bins = 50)

ggplot(diamonds2, aes(cut,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(color,lresid)) + geom_boxplot()

ggplot(diamonds2, aes(clarity,lresid)) + geom_boxplot()

Four parameter model and visualization

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

grid <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid
## # A tibble: 5 x 5
##   cut       lcarat color clarity  pred
##   <ord>      <dbl> <chr> <chr>   <dbl>
## 1 Fair      -0.515 G     VS2      11.2
## 2 Good      -0.515 G     VS2      11.3
## 3 Very Good -0.515 G     VS2      11.4
## 4 Premium   -0.515 G     VS2      11.4
## 5 Ideal     -0.515 G     VS2      11.4
ggplot(grid, aes(cut, pred)) +
  geom_point()

Plot residuals of four parameter model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) +
  geom_hex(bins = 50)

diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2) %>%
  mutate(pred = round(2^pred)) %>%
  select(price, pred, carat:table, x:z) %>%
  arrange(price)
## # A tibble: 16 x 11
##    price  pred carat cut       color clarity depth table     x     y     z
##    <int> <dbl> <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1013   264 0.25  Fair      F     SI2      54.4    64  4.3   4.23  2.32
##  2  1186   284 0.25  Premium   G     SI2      59      60  5.33  5.28  3.12
##  3  1186   284 0.25  Premium   G     SI2      58.8    60  5.33  5.28  3.12
##  4  1262  2644 1.03  Fair      E     I1       78.2    54  5.72  5.59  4.42
##  5  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  6  1415   639 0.35  Fair      G     VS2      65.9    54  5.57  5.53  3.66
##  7  1715   576 0.32  Fair      F     VS2      59.6    60  4.42  4.34  2.61
##  8  1776   412 0.290 Fair      F     SI1      55.8    60  4.48  4.41  2.48
##  9  2160   314 0.34  Fair      F     I1       55.8    62  4.72  4.6   2.6 
## 10  2366   774 0.3   Very Good D     VVS2     60.6    58  4.33  4.35  2.63
## 11  3360  1373 0.51  Premium   F     SI1      62.7    62  5.09  4.96  3.15
## 12  3807  1540 0.61  Good      F     SI2      62.5    65  5.36  5.29  3.33
## 13  3920  1705 0.51  Fair      F     VVS2     65.4    60  4.98  4.9   3.23
## 14  4368  1705 0.51  Fair      F     VVS2     60.7    66  5.21  5.11  3.13
## 15 10011  4048 1.01  Fair      D     SI2      64.6    58  6.25  6.2   4.02
## 16 10470 23622 2.46  Premium   E     SI2      59.7    59  8.82  8.76  5.25

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

Answer 1:

The vertical lines are corresponding to relatively higher count of carat values on specific points 0.3, 0, 0.75, 1, 1.5 and 2. The vertical line reflects higher count of datapoints around these specific carat values. This can be illustrated using varplots of carat and lcarat as as shown.

ggplot(data=diamonds2, aes(x=carat))+geom_bar()

#bar plot of lcarat
ggplot(data=diamonds2, aes(x=lcarat))+geom_bar()

Question #2

If log(price) = a_0 + a_1 * log(carat), what does that say about the relationship between price and carat?

Answer 2: If we solve the equation mathematically it can be transformed to the following expression (assuming base of log is e) price= e^a_0 * carat^a_1

This clearly shows that price is exponentially related to carat. price is proportional to a_1 power of carat with a multiplicative factor of a_0

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

#build a model for getting the residuals

diamonds3 <-diamonds %>% 
  mutate(lprice = log(price),
         lcarat = log(carat))

lm1<- lm(lprice ~ lcarat + color + clarity + cut, data = diamonds3)

diamonds3<-diamonds3 %>% 
  add_residuals(lm1)

#Lets take the data for  1% and lowest 1% values of residuals
#computing 1% of total rows present in the dataset

n_1_percent<-round(nrow(diamonds3)*1/100)

# use order to get the top 10 and bottom 10 rows of data by values or residuals  
lowest_10<-diamonds3[order(diamonds3$resid),][1:10,]
highest_10<-diamonds3[order(-diamonds3$resid),][1:n_1_percent,]

df1<-rbind(lowest_10,highest_10 )
#Lets compare the pattern of these subsets with the whole dataset



#compare the price v/s carat plots for highest/lowest  and whole dataset

p1<-ggplot(df1, aes(carat, price))+geom_hex(bins=20)+ggtitle("only high and low residual values")
  
  
p2<-ggplot(diamonds2, aes(carat, price))+geom_hex(bins=50)+ggtitle("whole dataset")
  

grid.arrange(p1,p2, layout_matrix = cbind(1,2))

#compare the log transformed values plots
p3<-ggplot(df1, aes(lcarat, lprice)) +
  geom_hex(bins=20)+ggtitle("only high and low residual values")

p4<-ggplot(diamonds2, aes(lcarat, lprice)) +
  geom_hex(bins=50)+ggtitle("whole dataset")

grid.arrange(p3,p4, layout_matrix = cbind(1,2))

Comments:

We compared the price v/s carat plots for original and log transformed vlaues. There does not seem any particular difference between the relation ship of price and carat for very high and very low residuals and the whole dataset

Question #4

Does the final model, mod_diamonds2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

# Use this chunk to place your code for assessing how well the model predicts diamond prices

mod_diamond2 <- lm(
  lprice ~ lcarat + color + cut + clarity, diamonds2
)

df2 <- diamonds2 %>%
  add_predictions(mod_diamond2)%>%
  mutate(predicted_price = 2 ^pred, diff=predicted_price-price)


#plot ot compare the differenc ebetween price (actual) and predicted price
ggplot(df2, aes(predicted_price, price))+geom_point()+geom_abline(intercept = 0, slope = 1, color="blue")+ggtitle("plot ot compare the difference between price (actual) and predicted price")

mean(df2$diff)
## [1] -49.172
max(df2$diff)
## [1] 13152.21
sd(df2$diff)
## [1] 734.3606

**Comments:* If we just see the graph of actual price v/s predicted price then it seems like the model is doing a decent job of predicting the price as the prices for most of the diamonds stick around the actual price. so on an avergae it does a good job. When we compute some important statistics like the mean differenc ebwetween actual and predicted price it is -49.172 which is no t very high considering th price of diamonds, however the maximum difference is 13152.21 which is pretty high. also the standard devation value is $734.3606 which is considerably high. So in summary, on avrage the model does a satisfactory job if we cosider for all the diamonds in the datset but the deviation is prediction is very flixible which leads to considerable difference in precited and ac tual value which could be a risk in case of high value diamonds