We want to explore the diamonds dataset graphically to get some ideas about what makes a diamond valuable.

We will work with a 10% sample to avoid long pauses as we complete tasks.

First load the tidyverse.

library(tidyverse)
## Loading tidyverse: ggplot2
## Loading tidyverse: tibble
## Loading tidyverse: tidyr
## Loading tidyverse: readr
## Loading tidyverse: purrr
## Loading tidyverse: dplyr
## Conflicts with tidy packages ----------------------------------------------
## filter(): dplyr, stats
## lag():    dplyr, stats

Take the sample and examine it.

diamonds %>%
   sample_frac(size=.1) %>% 
   mutate(ppc = price/carat) %>% 
   select(price, carat, ppc, cut, 
   color,clarity) ->
   lild
glimpse(lild)   
## Observations: 5,394
## Variables: 6
## $ price   <int> 1258, 644, 566, 14103, 2579, 1183, 2683, 3321, 12811, ...
## $ carat   <dbl> 0.56, 0.35, 0.27, 1.54, 0.76, 0.43, 0.75, 0.52, 2.29, ...
## $ ppc     <dbl> 2246.429, 1840.000, 2096.296, 9157.792, 3393.421, 2751...
## $ cut     <ord> Good, Premium, Ideal, Ideal, Very Good, Ideal, Premium...
## $ color   <ord> D, D, H, G, G, G, F, D, J, F, H, D, F, E, F, G, H, D, ...
## $ clarity <ord> SI2, SI1, IF, VS2, SI1, VVS1, SI1, VVS1, SI1, VS2, SI1...
summary(lild)
##      price             carat             ppc               cut      
##  Min.   :  326.0   Min.   :0.2100   Min.   : 1139   Fair     : 141  
##  1st Qu.:  980.2   1st Qu.:0.4000   1st Qu.: 2534   Good     : 506  
##  Median : 2479.0   Median :0.7100   Median : 3514   Very Good:1202  
##  Mean   : 3953.9   Mean   :0.8025   Mean   : 4034   Premium  :1360  
##  3rd Qu.: 5292.0   3rd Qu.:1.0400   3rd Qu.: 4980   Ideal    :2185  
##  Max.   :18818.0   Max.   :4.0100   Max.   :16726                   
##                                                                     
##  color       clarity    
##  D: 687   SI1    :1340  
##  E: 996   VS2    :1197  
##  F: 947   SI2    : 966  
##  G:1132   VS1    : 760  
##  H: 822   VVS2   : 510  
##  I: 542   VVS1   : 371  
##  J: 268   (Other): 250
table(lild$clarity)
## 
##   I1  SI2  SI1  VS2  VS1 VVS2 VVS1   IF 
##   77  966 1340 1197  760  510  371  173

How do you examine the relationship between a categorical explanatory variable and a quantitative response?

lild %>% ggplot(aes(x=cut,y=ppc)) + geom_boxplot()

lild %>% ggplot(aes(x=cut,y=ppc)) + geom_jitter()

lild %>% ggplot(aes(x=cut,y=ppc)) + geom_jitter(alpha=.1)

How do we incorporate the other categorical variables?

Use facet_grid()

lild %>% ggplot(aes(x=cut,y=ppc)) + 
  geom_jitter(alpha=.1) +
  facet_grid(color~clarity)

Basic Regression Models

Use the mtcars dataset to build regression models which predict mpg.

First simple model.

m1 = lm(mpg~disp,data=mtcars)
summary(m1)
## 
## Call:
## lm(formula = mpg ~ disp, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.8922 -2.2022 -0.9631  1.6272  7.2305 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.599855   1.229720  24.070  < 2e-16 ***
## disp        -0.041215   0.004712  -8.747 9.38e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.251 on 30 degrees of freedom
## Multiple R-squared:  0.7183, Adjusted R-squared:  0.709 
## F-statistic: 76.51 on 1 and 30 DF,  p-value: 9.38e-10

The model we get out of this can be written as \[mpg = 29.599855 -0.041215 * disp\]

We could use this equation manually in R to compute the mpg for a 200 Cubic inch engine.

mpg200 = 29.599855 -0.041215 * 200
mpg200
## [1] 21.35686

R provides us with a better way to do this. We can use the predict function and supply a dataframe of 1 or more new values of the independent variable.

MyVals = data.frame(disp=c(100,200,300))

MyResults = predict(m1,MyVals)
MyResults
##        1        2        3 
## 25.47834 21.35683 17.23532
MyTable = cbind(MyVals,MyResults)
MyTable
##   disp MyResults
## 1  100  25.47834
## 2  200  21.35683
## 3  300  17.23532

The slope is the most important result from the model. This model says that an increase of 100 cubic inches in displace will reduce mpg by 4.12.

What happens if we include other variables in the model? Let’s include wt, the weight of the vehicle.

m2 = lm(mpg~disp+wt,data=mtcars)
summary(m2)
## 
## Call:
## lm(formula = mpg ~ disp + wt, data = mtcars)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4087 -2.3243 -0.7683  1.7721  6.3484 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 34.96055    2.16454  16.151 4.91e-16 ***
## disp        -0.01773    0.00919  -1.929  0.06362 .  
## wt          -3.35082    1.16413  -2.878  0.00743 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.917 on 29 degrees of freedom
## Multiple R-squared:  0.7809, Adjusted R-squared:  0.7658 
## F-statistic: 51.69 on 2 and 29 DF,  p-value: 2.744e-10

This new model says that an increase of 100 cubic inches in displacement will only reduce mpg by about 1.7. This is less than half of the impact predicted by m1. Why?

What does the relatiohsip between wt and disp look like?

mtcars %>% ggplot(aes(x=disp,y=wt)) + geom_point()

Let’s build a model of these two variables.

m3 = lm(wt~disp,data=mtcars)
summary(m3)
## 
## Call:
## lm(formula = wt ~ disp, data = mtcars)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.89044 -0.29775 -0.00684  0.33428  0.66525 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.5998146  0.1729964   9.248 2.74e-10 ***
## disp        0.0070103  0.0006629  10.576 1.22e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.4574 on 30 degrees of freedom
## Multiple R-squared:  0.7885, Adjusted R-squared:  0.7815 
## F-statistic: 111.8 on 1 and 30 DF,  p-value: 1.222e-11

This says that when we increase the disp by 100, we would tend to simultaneously increase the weight of the vehicle by 100(.007) = .7. Since variable wt is measured in thousands of pounds, this is about 700 pounds.

If we only include disp in our model, disp carries information about wt as well as its own value.