ANLY 505 - Problem Set #3

============================================================================================================
Reference: Wickham, H., & Grolemund, G. (2016). R for data science: import, tidy, transform, visualize, and model data. “O’Reilly Media, Inc.”. 376-384.
Data dictionary: https://ggplot2.tidyverse.org/reference/diamonds.html

Exploratory data analysis (EDA)

1. Initial visualization

p1 <- ggplot(diamonds, aes(cut,price)) + geom_boxplot() + theme(axis.text.x=element_text(angle=45, hjust=1)) + labs(x="Cut", y="Price (USD)") + ggtitle("Boxplot 1")
p2 <- ggplot(diamonds, aes(color,price)) + geom_boxplot() + labs(x="Color", y="Price (USD)") + ggtitle("Boxplot 2") + scale_x_discrete(limits=rev(levels(diamonds$color)))
p3 <- ggplot(diamonds, aes(clarity,price)) + geom_boxplot() + theme(axis.text.x=element_text(angle=45, hjust=1)) + labs(x="Clarity", y="Price (USD)") + ggtitle("Boxplot 3")
p4 <- ggplot(diamonds, aes(carat, price)) + geom_hex(bins=50) + labs(x="Carat", y="Price (USD)") + ggtitle("Scatter plot 1")

gridExtra::grid.arrange(p1,p2,p3,p4, nrow=2, bottom="Initial Visualization")

2. Subset data and replot

diamonds2 <- diamonds %>%
  filter(carat <= 2.5)  %>%
  mutate(lprice = log2(price), lcarat = log2(carat),
         color = factor(color, ordered = F),
         cut = factor(cut, ordered = F),
         clarity = factor(clarity, ordered = F))

ggplot(diamonds2, aes(lcarat, lprice)) + geom_hex(bins=50) + labs(x="log2(Carat)", y="log2(Price)") + ggtitle("Scatter plot 2")

3. Simple model and visualization

mod_diamond <- lm(lprice ~ lcarat, data = diamonds2) #linear model of log2(price) and log2(carat)

grid <- diamonds2 %>%
  data_grid(carat = seq_range(carat, 20)) %>%
  mutate(lcarat = log2(carat)) %>%
  add_predictions(mod_diamond, "lprice") %>%
  mutate(price = 2 ^ lprice) #back transformation of logarithm

ggplot(diamonds2, aes(carat, price)) +
  geom_hex(bins = 50) +
  geom_line(data = grid, color = "green", size = 1) +
  labs(x="Carat", y="Price (USD)") + ggtitle("Scatter plot and regression")

4. Add residuals and plot

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond, "lresid")

f1 <- ggplot(diamonds2, aes(lcarat, lresid)) + geom_hex(bins = 50) + labs(x="log2(Carat)", y="Residuals") + ggtitle("Scatter plot 3")
f2 <- ggplot(diamonds2, aes(cut, lresid)) + geom_boxplot() + theme(axis.text.x=element_text(angle=45, hjust=1)) + labs(x="Cut", y="Residuals") + ggtitle("Boxplot 4")
f3 <- ggplot(diamonds2, aes(color, lresid)) + geom_boxplot() + labs(x="Color", y="Residuals") + ggtitle("Boxplot 5") + scale_x_discrete(limits=rev(levels(diamonds$color)))
f4 <- ggplot(diamonds2, aes(clarity, lresid)) + geom_boxplot() + theme(axis.text.x=element_text(angle=45, hjust=1)) + labs(x="Clarity", y="Residuals") + ggtitle("Boxplot 6")

gridExtra::grid.arrange(f1,f2,f3,f4, nrow=2, bottom="Visualization")

5. Four parameters model and visualization

mod_diamond2 <- lm(lprice ~ lcarat + color + cut + clarity, diamonds2) #linear model of log2(price) and log2(carat), color, cut, and clarity

#take "cut" as an example for prediction in log2(price)
grid2 <- diamonds2 %>%
  data_grid(cut, .model = mod_diamond2) %>%
  add_predictions(mod_diamond2)
grid2 %>% kable() %>% kable_styling()

cut	lcarat	color	clarity	pred
Fair	-0.5145732	G	SI1	10.98985
Good	-0.5145732	G	SI1	11.10479
Very Good	-0.5145732	G	SI1	11.15824
Premium	-0.5145732	G	SI1	11.19055
Ideal	-0.5145732	G	SI1	11.22187

ggplot(grid2, aes(cut, pred)) + geom_point() + labs(x="Cut", y="Predicted log2(Price)") + ggtitle("Scatter plot 4")

6. Plot residuals of four parameters model

diamonds2 <- diamonds2 %>%
  add_residuals(mod_diamond2, "lresid2")

ggplot(diamonds2, aes(lcarat, lresid2)) + geom_hex(bins = 50) + labs(x="log2(Carat)", y="Residuals") + ggtitle("Scatter plot 5")

diamonds2b <- diamonds2 %>%
  filter(abs(lresid2) > 1) %>%
  add_predictions(mod_diamond2, "pred2") %>%
  mutate(pred2=round(2^pred2)) %>%
  select(price, pred2, lresid2, carat:table, x:z) %>%
  arrange(price)
diamonds2b %>% kable() %>% kable_styling()

price	pred2	lresid2	carat	cut	color	clarity	depth	table	x	y	z
1013	264	1.938203	0.25	Fair	F	SI2	54.4	64	4.30	4.23	2.32
1186	284	2.059699	0.25	Premium	G	SI2	59.0	60	5.33	5.28	3.12
1186	284	2.059699	0.25	Premium	G	SI2	58.8	60	5.33	5.28	3.12
1262	2644	-1.067148	1.03	Fair	E	I1	78.2	54	5.72	5.59	4.42
1415	639	1.147258	0.35	Fair	G	VS2	65.9	54	5.57	5.53	3.66
1415	639	1.147258	0.35	Fair	G	VS2	65.9	54	5.57	5.53	3.66
1715	576	1.573796	0.32	Fair	F	VS2	59.6	60	4.42	4.34	2.61
1776	412	2.107816	0.29	Fair	F	SI1	55.8	60	4.48	4.41	2.48
2160	314	2.783220	0.34	Fair	F	I1	55.8	62	4.72	4.60	2.60
2366	774	1.611117	0.30	Very Good	D	VVS2	60.6	58	4.33	4.35	2.63
3360	1373	1.290708	0.51	Premium	F	SI1	62.7	62	5.09	4.96	3.15
3807	1540	1.305923	0.61	Good	F	SI2	62.5	65	5.36	5.29	3.33
3920	1705	1.201301	0.51	Fair	F	VVS2	65.4	60	4.98	4.90	3.23
4368	1705	1.357420	0.51	Fair	F	VVS2	60.7	66	5.21	5.11	3.13
10011	4048	1.306158	1.01	Fair	D	SI2	64.6	58	6.25	6.20	4.02
10470	23622	-1.173882	2.46	Premium	E	SI2	59.7	59	8.82	8.76	5.25

Questions

Question #1

In the plot of lcarat vs. lprice, there are some bright vertical strips. What do they represent?

ANSWER: ① The brighter color (lighter blue) is the hexagon, the more counts are there. ② The vertical line means that for the same x-value of “lcarat”, there are numerous y-values of “lprice”. The latter finding might be due to logarithmic transformation since the price range [$326, $18,823] narrows down to [8.348728, 14.200209]. ③ The x-values of these bright vertical stripes indicate most preferred carat cuts, e.g., one carat. In all, many similar priced diamonds have the same carat.

Hence, these bright vertical stripes show that “lcarat” may not be the single explanatory variable for “lprice”. For example, as for the same carat of 0.25, the price can go as the highest (with two records) as $1,186 when cut, color, and clarity is respectively Premium, G, and SI2, while as the lowest (with one record) as $357 when cut, color and clarity is respectively Ideal, H, and SI1.

Moreover, if only one variable is taken into consideration, then these bright vertical stripes may represent an issue of oversampling.

Question #2

If $\log_2 (price) = a_0 + a_1 \cdot \log_2 (carat)$, what does that say about the relationship between price and carat?

summary(mod_diamond)

## 
## Call:
## lm(formula = lprice ~ lcarat, data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.96407 -0.24549 -0.00844  0.23930  1.93486 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 12.193863   0.001969  6194.5   <2e-16 ***
## lcarat       1.681371   0.001936   868.5   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3767 on 53812 degrees of freedom
## Multiple R-squared:  0.9334, Adjusted R-squared:  0.9334 
## F-statistic: 7.542e+05 on 1 and 53812 DF,  p-value: < 2.2e-16

ANSWER: As $\log_2 (carat)$ increase one unit, then $\log_2 (price)$ will change $a_1$. Based on the subset data, the estimated $a_0$ and $a_1$ is respectively 12.193863 and 1.681371 while the $p$-values are both statistically significant at an alpha level of 0.05. The adjusted R-squared is 0.9334 indicating the simple model may fit very well.

After back transformation, $price = 2^{a_0} \cdot carat^{a_1}$. A 1% increase in carat is associated with a change of $1.01^{a_1}-1=1.687093\%$ in price, given $\hat{a_1}=1.681371$ .

Question #3

Extract the diamonds that have very high and very low residuals. Is there anything unusual about these diamonds? Are they particularly bad or good, or do you think these are pricing errors?

diamonds2a <- diamonds2 %>%
  filter(abs(lresid) > 1) %>%
  add_predictions(mod_diamond, "pred") %>%
  mutate(pred=round(2^pred)) %>%
  select(price, pred, lresid, carat:table, x:z) %>%
  arrange(price)

ggplot(diamonds2a, aes(carat, price, color=lresid)) + geom_point() + labs(x="Carat", y="Price (USD)") + ggtitle("Scatterplot 6")

cut <- rbind(table(diamonds2a[diamonds2a$lresid>0,]$cut),table(diamonds2a[diamonds2a$lresid<0,]$cut))
rownames(cut) <- c("Positive residuals", "Negative residuals")

color <- rbind(table(diamonds2a[diamonds2a$lresid>0,]$color),table(diamonds2a[diamonds2a$lresid<0,]$color))

clarity <- rbind(table(diamonds2a[diamonds2a$lresid>0,]$clarity),table(diamonds2a[diamonds2a$lresid<0,]$clarity))

cbind(cut, color, clarity) %>% kable() %>% kable_styling()

	Fair	Good	Very Good	Premium	Ideal	D	E	F	G	H	I	J	I1	SI2	SI1	VS2	VS1	VVS2	VVS1	IF
Positive residuals	9	14	100	68	226	199	113	98	7	0	0	0	1	4	2	3	3	127	138	139
Negative residuals	168	41	22	83	12	3	17	46	89	76	51	44	303	20	3	0	0	0	0	0

ANSWER: Extract the diamonds that have high or low residuals (from the simple model) whose absolute values are greater than 1. Given $residual_i=\ log_2 (price_i) - \log_2 (prediction_i)$, then after back logarithm, $2^{residual_i}=\frac{price_i}{prediction_i}$ is present. A residual value of -1 indicates that $\log_2 (price)$ is one unit lower than the prediction. In other words, the price with the residual value of -1 is half of the predicted price while the price with the residual value of 1 is twice the predicted price.

As Scatterplot 6 shows that larger carat diamonds with the extreme residuals might be cheaper than the predictions, especially when the carat is greater than 1.5.

For those 743 observations with extreme residuals, the group of the positive residuals mainly has more diamonds with ideal cut, better color (D), and better clarity (IF). In contrast, the group of the negative residuals mainly has more diamonds with fair cut, worse color (J), and worse clarity (I1).

In conclusion, there is not much unusual or pricing error. “Nothing for nothing and very little for a halfpenny.” The higher is the price, the better is the quality of the merchandise.

Question #4

Does the final model, mod_diamond2, do a good job of predicting diamond prices? Would you trust it to tell you how much to spend if you were buying a diamond and why?

summary(mod_diamond2)

## 
## Call:
## lm(formula = lprice ~ lcarat + color + cut + clarity, data = diamonds2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.17388 -0.12437 -0.00094  0.11920  2.78322 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.366696   0.008403 1352.76   <2e-16 ***
## lcarat        1.886239   0.001124 1677.81   <2e-16 ***
## colorE       -0.078489   0.003034  -25.87   <2e-16 ***
## colorF       -0.137370   0.003068  -44.78   <2e-16 ***
## colorG       -0.232097   0.003003  -77.29   <2e-16 ***
## colorH       -0.362063   0.003188 -113.56   <2e-16 ***
## colorI       -0.537257   0.003573 -150.36   <2e-16 ***
## colorJ       -0.737525   0.004417 -166.99   <2e-16 ***
## cutGood       0.114935   0.005596   20.54   <2e-16 ***
## cutVery Good  0.168393   0.005206   32.34   <2e-16 ***
## cutPremium    0.200701   0.005150   38.97   <2e-16 ***
## cutIdeal      0.232023   0.005105   45.45   <2e-16 ***
## claritySI2    0.589367   0.007572   77.84   <2e-16 ***
## claritySI1    0.825861   0.007525  109.75   <2e-16 ***
## clarityVS2    1.041576   0.007564  137.71   <2e-16 ***
## clarityVS1    1.142960   0.007674  148.94   <2e-16 ***
## clarityVVS2   1.338362   0.007898  169.45   <2e-16 ***
## clarityVVS1   1.441974   0.008117  177.65   <2e-16 ***
## clarityIF     1.579102   0.008758  180.30   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1916 on 53795 degrees of freedom
## Multiple R-squared:  0.9828, Adjusted R-squared:  0.9828 
## F-statistic: 1.706e+05 on 18 and 53795 DF,  p-value: < 2.2e-16

ANSWER: The estimated coefficients for the four parameters model are shown above. All the coefficients’ $p$-values are statistically significant at an alpha level of 0.05. The adjusted R-squared is 0.9828, which is better than 0.9334 from the simple model. In general, the four parameters model may fit very well and do a good job of predicting diamond prices. However, overfitting can still exist. I do not fully trust the predicted prices. The model only tells that “you get what you pay for”. Cluster analysis may improve the model.