library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
This is an exploration of the horizontal stripes problem in the diamonds dataset. First replicate the problem and note that there is an apparent range of undesirable values of price per carat for many combinations of cut, color and clarity.
diamonds %>%
sample_frac(size=.1) %>%
mutate(ppc = price/carat) -> lild
lild %>%
ggplot(aes(x=cut,y=ppc)) +
geom_jitter(alpha=.1) +
facet_grid(color~clarity)
Examine the distribution of ppc for some combinations of the categorical variables.
lild %>%
filter( color=="G",clarity=="VS1") %>%
ggplot(aes(x=ppc)) + geom_density() + facet_wrap(~cut,ncol =1, scales="free_y")
Try this for a different slice of the categorical variables. Fix 2 of the three and facet_wrap on the third.
We have been assuming that all other things being the same, price is proportional to carat. This assumption says that the influence of carat can be ignored if we think of ppc as the target to be explained. Is there a possible non-linear relationship between price and carat.
Focus on one categorical cell.
lild %>%
filter(cut=="Ideal",color=="G",clarity=="VS1") %>%
ggplot(aes(x=carat,y=price)) +
geom_point(alpha=.5) +
geom_smooth(color="red") +
geom_smooth(method = "lm",color="blue")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Try this for a few different cells.
lild %>%
filter(cut=="Ideal",color=="G",clarity=="VS1") -> cell
cellmod2 = lm(price~poly(carat,2),data=cell)
summary(cellmod2)
##
## Call:
## lm(formula = price ~ poly(carat, 2), data = cell)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2981.59 -199.09 -13.34 284.11 1807.66
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4451.5 66.8 66.635 < 2e-16 ***
## poly(carat, 2)1 36610.8 630.2 58.091 < 2e-16 ***
## poly(carat, 2)2 5497.2 630.2 8.722 1.81e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 630.2 on 86 degrees of freedom
## Multiple R-squared: 0.9757, Adjusted R-squared: 0.9751
## F-statistic: 1725 on 2 and 86 DF, p-value: < 2.2e-16
fv = cellmod2$fitted.values
cellplus = cbind(cell,fv)
above and add the fitted values from the quadratic model.
cellplus %>%
ggplot(aes(x=carat,y=price)) +
geom_point(alpha=.5) +
geom_smooth(color="red") +
geom_smooth(method = "lm",color="blue") +
geom_point(aes(x=carat,y=fv),color = "yellow")
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'