library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Horizontal Stripes

This is an exploration of the horizontal stripes problem in the diamonds dataset. First replicate the problem and note that there is an apparent range of undesirable values of price per carat for many combinations of cut, color and clarity.

diamonds %>% 
  sample_frac(size=.1) %>% 
  mutate(ppc = price/carat) -> lild 
  
lild %>% 
  ggplot(aes(x=cut,y=ppc)) + 
  geom_jitter(alpha=.1) +
  facet_grid(color~clarity)

Density of ppc

Examine the distribution of ppc for some combinations of the categorical variables.

lild %>% 
  filter( color=="G",clarity=="VS1") %>% 
  ggplot(aes(x=ppc)) + geom_density() + facet_wrap(~cut,ncol =1, scales="free_y") 

Exercise

Try this for a different slice of the categorical variables. Fix 2 of the three and facet_wrap on the third.

Carat-Price Relationship

We have been assuming that all other things being the same, price is proportional to carat. This assumption says that the influence of carat can be ignored if we think of ppc as the target to be explained. Is there a possible non-linear relationship between price and carat.

Focus on one categorical cell.

lild %>% 
  filter(cut=="Ideal",color=="G",clarity=="VS1") %>%
  ggplot(aes(x=carat,y=price)) + 
  geom_point(alpha=.5) +
  geom_smooth(color="red") +
  geom_smooth(method = "lm",color="blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Exercise

Try this for a few different cells.

A Polynomial Model

lild %>% 
  filter(cut=="Ideal",color=="G",clarity=="VS1") -> cell
cellmod2 = lm(price~poly(carat,2),data=cell)
summary(cellmod2)
## 
## Call:
## lm(formula = price ~ poly(carat, 2), data = cell)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2981.59  -199.09   -13.34   284.11  1807.66 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       4451.5       66.8  66.635  < 2e-16 ***
## poly(carat, 2)1  36610.8      630.2  58.091  < 2e-16 ***
## poly(carat, 2)2   5497.2      630.2   8.722 1.81e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 630.2 on 86 degrees of freedom
## Multiple R-squared:  0.9757, Adjusted R-squared:  0.9751 
## F-statistic:  1725 on 2 and 86 DF,  p-value: < 2.2e-16
fv = cellmod2$fitted.values
cellplus = cbind(cell,fv)

Repeat the Graphics

above and add the fitted values from the quadratic model.

cellplus %>% 
  ggplot(aes(x=carat,y=price)) + 
  geom_point(alpha=.5) +
  geom_smooth(color="red") +
  geom_smooth(method = "lm",color="blue") +
  geom_point(aes(x=carat,y=fv),color = "yellow")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'