We want to explore the diamonds dataset graphically to get some ideas about what makes a diamond valuable.
We will work with a 10% sample to avoid long pauses as we complete tasks.
First load the tidyverse.
library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag(): dplyr, stats
Take the sample and examine it.
diamonds %>%
sample_frac(size=.1) %>%
mutate(ppc = price/carat) %>%
select(price, carat, ppc, cut,
color,clarity) ->
lild
glimpse(lild)
## Observations: 5,394
## Variables: 6
## $ price <int> 1258, 644, 566, 14103, 2579, 1183, 2683, 3321, 12811, ...
## $ carat <dbl> 0.56, 0.35, 0.27, 1.54, 0.76, 0.43, 0.75, 0.52, 2.29, ...
## $ ppc <dbl> 2246.429, 1840.000, 2096.296, 9157.792, 3393.421, 2751...
## $ cut <ord> Good, Premium, Ideal, Ideal, Very Good, Ideal, Premium...
## $ color <ord> D, D, H, G, G, G, F, D, J, F, H, D, F, E, F, G, H, D, ...
## $ clarity <ord> SI2, SI1, IF, VS2, SI1, VVS1, SI1, VVS1, SI1, VS2, SI1...
summary(lild)
## price carat ppc cut
## Min. : 326.0 Min. :0.2100 Min. : 1139 Fair : 141
## 1st Qu.: 980.2 1st Qu.:0.4000 1st Qu.: 2534 Good : 506
## Median : 2479.0 Median :0.7100 Median : 3514 Very Good:1202
## Mean : 3953.9 Mean :0.8025 Mean : 4034 Premium :1360
## 3rd Qu.: 5292.0 3rd Qu.:1.0400 3rd Qu.: 4980 Ideal :2185
## Max. :18818.0 Max. :4.0100 Max. :16726
##
## color clarity
## D: 687 SI1 :1340
## E: 996 VS2 :1197
## F: 947 SI2 : 966
## G:1132 VS1 : 760
## H: 822 VVS2 : 510
## I: 542 VVS1 : 371
## J: 268 (Other): 250
table(lild$clarity)
##
## I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF
## 77 966 1340 1197 760 510 371 173
How do you examine the relationship between a categorical explanatory variable and a quantitative response?
lild %>% ggplot(aes(x=cut,y=ppc)) + geom_boxplot()
lild %>% ggplot(aes(x=cut,y=ppc)) + geom_jitter()
lild %>% ggplot(aes(x=cut,y=ppc)) + geom_jitter(alpha=.1)
How do we incorporate the other categorical variables?
Use facet_grid()
lild %>% ggplot(aes(x=cut,y=ppc)) +
geom_jitter(alpha=.1) +
facet_grid(color~clarity)
Use the mtcars dataset to build regression models which predict mpg.
First simple model.
m1 = lm(mpg~disp,data=mtcars)
summary(m1)
##
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.8922 -2.2022 -0.9631 1.6272 7.2305
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.599855 1.229720 24.070 < 2e-16 ***
## disp -0.041215 0.004712 -8.747 9.38e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared: 0.7183, Adjusted R-squared: 0.709
## F-statistic: 76.51 on 1 and 30 DF, p-value: 9.38e-10
The model we get out of this can be written as \[mpg = 29.599855 -0.041215 * disp\]
We could use this equation manually in R to compute the mpg for a 200 Cubic inch engine.
mpg200 = 29.599855 -0.041215 * 200
mpg200
## [1] 21.35686
R provides us with a better way to do this. We can use the predict function and supply a dataframe of 1 or more new values of the independent variable.
MyVals = data.frame(disp=c(100,200,300))
MyResults = predict(m1,MyVals)
MyResults
## 1 2 3
## 25.47834 21.35683 17.23532
MyTable = cbind(MyVals,MyResults)
MyTable
## disp MyResults
## 1 100 25.47834
## 2 200 21.35683
## 3 300 17.23532
The slope is the most important result from the model. This model says that an increase of 100 cubic inches in displace will reduce mpg by 4.12.
What happens if we include other variables in the model? Let’s include wt, the weight of the vehicle.
m2 = lm(mpg~disp+wt,data=mtcars)
summary(m2)
##
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4087 -2.3243 -0.7683 1.7721 6.3484
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 34.96055 2.16454 16.151 4.91e-16 ***
## disp -0.01773 0.00919 -1.929 0.06362 .
## wt -3.35082 1.16413 -2.878 0.00743 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared: 0.7809, Adjusted R-squared: 0.7658
## F-statistic: 51.69 on 2 and 29 DF, p-value: 2.744e-10
This new model says that an increase of 100 cubic inches in displacement will only reduce mpg by about 1.7. This is less than half of the impact predicted by m1. Why?
What does the relatiohsip between wt and disp look like?
mtcars %>% ggplot(aes(x=disp,y=wt)) + geom_point()
Let’s build a model of these two variables.
m3 = lm(wt~disp,data=mtcars)
summary(m3)
##
## Call:
## lm(formula = wt ~ disp, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.89044 -0.29775 -0.00684 0.33428 0.66525
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.5998146 0.1729964 9.248 2.74e-10 ***
## disp 0.0070103 0.0006629 10.576 1.22e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.4574 on 30 degrees of freedom
## Multiple R-squared: 0.7885, Adjusted R-squared: 0.7815
## F-statistic: 111.8 on 1 and 30 DF, p-value: 1.222e-11
This says that when we increase the disp by 100, we would tend to simultaneously increase the weight of the vehicle by 100(.007) = .7. Since variable wt is measured in thousands of pounds, this is about 700 pounds.
If we only include disp in our model, disp carries information about wt as well as its own value.