Multiple and Logistic Regression in R

Course Description

In this course you’ll take your skills with simple linear regression to the next level. By learning multiple and logistic regression techniques you will gain the skills to model and predict both numeric and categorical outcomes using multiple input variables. You’ll also learn how to fit, visualize, and interpret these models. Then you’ll apply your skills to learn about Italian restaurants in New York City!

1. Parallel Slopes

In this chapter you’ll learn about the class of linear models called “parallel slopes models.” These include one numeric and one categorical explanatory variable.

Fitting a parallel slopes model

We use the lm() function to fit linear models to data. In this case, we want to understand how the price of MarioKart games sold at auction varies as a function of not only the number of wheels included in the package, but also whether the item is new or used. Obviously, it is expected that you might have to pay a premium to buy these new. But how much is that premium? Can we estimate its value after controlling for the number of wheels?

We will fit a parallel slopes model using lm(). In addition to the data argument, lm() needs to know which variables you want to include in your regression model, and how you want to include them. It accomplishes this using a formula argument. A simple linear regression formula looks like y ~ x, where y is the name of the response variable, and x is the name of the explanatory variable. Here, we will simply extend this formula to include multiple explanatory variables. A parallel slopes model has the form y ~ x + z, where z is a categorical explanatory variable, and x is a numerical explanatory variable.

The output from lm() is a model object, which when printed, will show the fitted coefficients.

glimpse(mario_kart)

## Rows: 141
## Columns: 12
## $ id          <dbl> 150377422259, 260483376854, 320432342985, 280405224677,...
## $ duration    <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1...
## $ n_bids      <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, ...
## $ cond        <fct> new, used, new, new, new, new, used, new, used, used, n...
## $ start_pr    <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 1...
## $ ship_pr     <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4...
## $ total_pr    <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99,...
## $ ship_sp     <fct> standard, firstClass, firstClass, standard, media, stan...
## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 48...
## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, ye...
## $ wheels      <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2...
## $ title       <fct> "~~ Wii MARIO KART &amp; WHEEL ~ NINTENDO Wii ~ BRAND N...

str(mario_kart)

## tibble [141 x 12] (S3: tbl_df/tbl/data.frame)
##  $ id         : num [1:141] 1.5e+11 2.6e+11 3.2e+11 2.8e+11 1.7e+11 ...
##  $ duration   : int [1:141] 3 7 3 3 1 3 1 1 3 7 ...
##  $ n_bids     : int [1:141] 20 13 16 18 20 19 13 15 29 8 ...
##  $ cond       : Factor w/ 2 levels "new","used": 1 2 1 1 1 1 2 1 2 2 ...
##  $ start_pr   : num [1:141] 0.99 0.99 0.99 0.99 0.01 ...
##  $ ship_pr    : num [1:141] 4 3.99 3.5 0 0 4 0 2.99 4 4 ...
##  $ total_pr   : num [1:141] 51.5 37 45.5 44 71 ...
##  $ ship_sp    : Factor w/ 8 levels "firstClass","media",..: 6 1 1 6 2 6 6 8 5 1 ...
##  $ seller_rate: int [1:141] 1580 365 998 7 820 270144 7284 4858 27 201 ...
##  $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
##  $ wheels     : int [1:141] 1 1 1 1 2 0 0 2 1 1 ...
##  $ title      : Factor w/ 80 levels " Mario Kart Wii with Wii Wheel for Wii (New)",..: 80 60 22 7 4 19 34 5 79 70 ...

# fit parallel slopes
lm(total_pr ~ wheels + cond, data = mario_kart)

## 
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
## 
## Coefficients:
## (Intercept)       wheels     condused  
##      42.370        7.233       -5.585

Using geom_line() and augment()

Parallel slopes models are so-named because we can visualize these models in the data space as not one line, but two parallel lines. To do this, we’ll draw two things:

a scatterplot showing the data, with color separating the points into groups
a line for each value of the categorical variable

Our plotting strategy is to compute the fitted values, plot these, and connect the points to form a line. The augment() function from the broom package provides an easy way to add the fitted values to our data frame, and the geom_line() function can then use that data frame to plot the points and connect them.

Note that this approach has the added benefit of automatically coloring the lines appropriately to match the data.

You already know how to use ggplot() and geom_point() to make the scatterplot. The only twist is that now you’ll pass your augment()-ed model as the data argument in your ggplot() call. When you add your geom_line(), instead of letting the y aesthetic inherit its values from the ggplot() call, you can set it to the .fitted column of the augment()-ed model. This has the advantage of automatically coloring the lines for you.

The parallel slopes model mod relating total price to the number of wheels and condition is already in your workspace.

mod <- lm(formula = total_pr ~ wheels + cond, data = mario_kart)

augment() the model mod and explore the returned data frame using glimpse(). Notice the new variables that have been created.
Draw the scatterplot and save it as data_space by passing the augment()-ed model to ggplot() and using geom_point().
Use geom_line() once to add two parallel lines corresponding to our model.

library(broom)

## Warning: package 'broom' was built under R version 4.0.2

# Augment the model
augmented_mod <- augment(mod)
glimpse(augmented_mod)

## Rows: 141
## Columns: 9
## $ total_pr   <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, ...
## $ wheels     <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2,...
## $ cond       <fct> new, used, new, new, new, new, used, new, used, used, ne...
## $ .fitted    <dbl> 49.60260, 44.01777, 49.60260, 49.60260, 56.83544, 42.369...
## $ .resid     <dbl> 1.9473995, -6.9777674, -4.1026005, -5.6026005, 14.164559...
## $ .std.resid <dbl> 0.40270893, -1.43671086, -0.84838977, -1.15857953, 2.926...
## $ .hat       <dbl> 0.02103158, 0.01250410, 0.02103158, 0.02103158, 0.019156...
## $ .sigma     <dbl> 4.902339, 4.868399, 4.892414, 4.881308, 4.750591, 4.8998...
## $ .cooksd    <dbl> 1.161354e-03, 8.712334e-03, 5.154337e-03, 9.612441e-03, ...

# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wheels, y = total_pr, color = cond)) + 
  geom_point()
  
# single call to geom_line()
data_space + 
  geom_line(aes(y = .fitted))

Intercept interpretation

Recall that the cond variable is either new or used. Here are the fitted coefficients from your model:

lm(total_pr ~ wheels + cond, data = mario_kart)

## 
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
## 
## Coefficients:
## (Intercept)       wheels     condused  
##      42.370        7.233       -5.585

Choose the correct interpretation of the coefficient on condused:

Answer: The expected price of a used MarioKart is $5.58 less than that of a new one with the same number of wheels.

Syntax from math

The babies data set contains observations about the birthweight and other characteristics of children born in the San Francisco Bay area from 1960–1967.

We would like to build a model for birthweight as a function of the mother’s age and whether this child was her first (parity == 0). Use the mathematical specification below to code the model in R.

$birthweight = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot parity + \epsilon$

The birthweight variable is recorded in the column bwt.

Use lm() to build the parallel slopes model specified above. It’s not necessary to use factor() in this case as the variable parity is coded using binary numeric values.

library(openintro)

# build model
lm(bwt ~ age + parity, babies)

## 
## Call:
## lm(formula = bwt ~ age + parity, data = babies)
## 
## Coefficients:
## (Intercept)          age       parity  
##   118.27782      0.06315     -1.65248

Syntax from plot

This time, we’d like to build a model for birthweight as a function of the length of gestation and the mother’s smoking status. Use the plot to inform your model specification.

Use lm() to build a parallel slopes model implied by the plot. It’s not necessary to use factor() in this case either.

# build model
lm(bwt ~ gestation + smoke, babies)

## 
## Call:
## lm(formula = bwt ~ gestation + smoke, data = babies)
## 
## Coefficients:
## (Intercept)    gestation        smoke  
##     -0.9317       0.4429      -8.0883

2. Evaluating and extending parallel slopes model

This chapter covers model evaluation. By looking at different properties of the model, including the adjusted R-squared, you’ll learn to compare models so that you can select the best one. You’ll also learn about interaction terms in linear models.

R-squared vs. adjusted R-squared

Two common measures of how well a model fits to data are $R^2$ (the coefficient of determination) and the adjusted $R^2$. The former measures the percentage of the variability in the response variable that is explained by the model. To compute this, we define

$R^2 = 1 - \frac{SSE}{SST} \,,$

where $SSE$ and $SST$ are the sum of the squared residuals, and the total sum of the squares, respectively. One issue with this measure is that the $SSE$ can only decrease as new variable are added to the model, while the $SST$ depends only on the response variable and therefore is not affected by changes to the model. This means that you can increase $R^2$ by adding any additional variable to your model—even random noise.

The adjusted $R^2$ includes a term that penalizes a model for each additional explanatory variable (where $p$ is the number of explanatory variables).

$R^2_{adj} = 1 - \frac{SSE}{SST} \cdot \frac{n-1}{n-p-1} \,,$

We can see both measures in the output of the summary() function on our model object.

Some Problems with R-squared

Problem 1: Every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.

Problem 2: If a model has too many predictors and higher order polynomials, it begins to model the random noise in the data. This condition is known as overfitting the model and it produces misleadingly high R-squared values and a lessened ability to make predictions.

The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.

Suppose you compare a five-predictor model with a higher R-squared to a one-predictor model. Does the five predictor model have a higher R-squared because it’s better? Or is the R-squared higher because it has more predictors? Simply compare the adjusted R-squared values to find out!

The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.

Use summary() to compute $R^2$ and adjusted $R^2$ on the model object called mod.
Use mutate() and rnorm() to add a new variable called noise to the mario_kart data set that consists of random noise. Save the new dataframe as mario_kart_noisy.
Use lm() to fit a model that includes wheels, cond, and the random noise term.
Use summary() to compute $R^2$ and adjusted $R^2$ on the new model object. Did the value of $R^2$ increase? What about adjusted $R^2$?

# R^2 and adjusted R^2
summary(mod)

## 
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.0078  -3.0754  -0.8254   2.9822  14.1646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.3698     1.0651  39.780  < 2e-16 ***
## wheels        7.2328     0.5419  13.347  < 2e-16 ***
## condused     -5.5848     0.9245  -6.041 1.35e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.887 on 138 degrees of freedom
## Multiple R-squared:  0.7165, Adjusted R-squared:  0.7124 
## F-statistic: 174.4 on 2 and 138 DF,  p-value: < 2.2e-16

# add random noise
mario_kart_noisy <- mario_kart %>%
  mutate(noise = rnorm(nrow(mario_kart)))
  
# compute new model
mod2 <- lm(total_pr ~ wheels + cond + noise, data = mario_kart_noisy)

# new R^2 and adjusted R^2
summary(mod2)

## 
## Call:
## lm(formula = total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.4572  -3.2610  -0.7703   2.7461  14.5646 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  42.3915     1.0677  39.703  < 2e-16 ***
## wheels        7.2160     0.5436  13.276  < 2e-16 ***
## condused     -5.5949     0.9265  -6.039 1.38e-08 ***
## noise         0.2692     0.4008   0.672    0.503    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.897 on 137 degrees of freedom
## Multiple R-squared:  0.7174, Adjusted R-squared:  0.7113 
## F-statistic:   116 on 3 and 137 DF,  p-value: < 2.2e-16

Prediction

Once we have fit a regression model, we can use it to make predictions for unseen observations or retrieve the fitted values. Here, we explore two methods for doing the latter.

A traditional way to return the fitted values (i.e. the $\hat{y}'s$) is to run the predict() function on the model object. This will return a vector of the fitted values. Note that predict() will take an optional newdata argument that will allow you to make predictions for observations that are not in the original data.

A newer alternative is the augment() function from the broom package, which returns a data.frame with the response varible $(y)$, the relevant explanatory variables (the $x's$), the fitted value $(\hat{y})$ and some information about the residuals $(e)$. augment() will also take a newdata argument that allows you to make predictions.

The fitted model mod is already in your environment.

Compute the fitted values of the model as a vector using predict().
Compute the fitted values of the model as one column in a data.frame using augment().

# return a vector
predict(mod)

##        1        2        3        4        5        6        7        8 
## 49.60260 44.01777 49.60260 49.60260 56.83544 42.36976 36.78493 56.83544 
##        9       10       11       12       13       14       15       16 
## 44.01777 44.01777 56.83544 56.83544 56.83544 56.83544 44.01777 36.78493 
##       17       18       19       20       21       22       23       24 
## 49.60260 49.60260 56.83544 36.78493 56.83544 56.83544 56.83544 44.01777 
##       25       26       27       28       29       30       31       32 
## 56.83544 36.78493 36.78493 36.78493 49.60260 36.78493 36.78493 44.01777 
##       33       34       35       36       37       38       39       40 
## 51.25061 44.01777 44.01777 36.78493 44.01777 56.83544 56.83544 49.60260 
##       41       42       43       44       45       46       47       48 
## 44.01777 51.25061 56.83544 56.83544 44.01777 56.83544 36.78493 36.78493 
##       49       50       51       52       53       54       55       56 
## 44.01777 56.83544 36.78493 44.01777 42.36976 36.78493 36.78493 44.01777 
##       57       58       59       60       61       62       63       64 
## 44.01777 36.78493 36.78493 56.83544 36.78493 56.83544 36.78493 51.25061 
##       65       66       67       68       69       70       71       72 
## 56.83544 44.01777 58.48345 51.25061 49.60260 44.01777 49.60260 56.83544 
##       73       74       75       76       77       78       79       80 
## 56.83544 51.25061 44.01777 36.78493 36.78493 36.78493 44.01777 56.83544 
##       81       82       83       84       85       86       87       88 
## 44.01777 65.71629 44.01777 56.83544 36.78493 49.60260 49.60260 36.78493 
##       89       90       91       92       93       94       95       96 
## 44.01777 36.78493 51.25061 44.01777 36.78493 51.25061 42.36976 56.83544 
##       97       98       99      100      101      102      103      104 
## 51.25061 44.01777 51.25061 56.83544 56.83544 56.83544 36.78493 49.60260 
##      105      106      107      108      109      110      111      112 
## 51.25061 44.01777 56.83544 49.60260 36.78493 44.01777 51.25061 56.83544 
##      113      114      115      116      117      118      119      120 
## 64.06828 44.01777 49.60260 44.01777 49.60260 51.25061 42.36976 44.01777 
##      121      122      123      124      125      126      127      128 
## 56.83544 44.01777 49.60260 44.01777 51.25061 56.83544 56.83544 49.60260 
##      129      130      131      132      133      134      135      136 
## 56.83544 36.78493 44.01777 44.01777 36.78493 56.83544 36.78493 44.01777 
##      137      138      139      140      141 
## 36.78493 51.25061 49.60260 36.78493 56.83544

# return a data frame
augment(mod)

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 141 x 9
##    total_pr wheels cond  .fitted .resid .std.resid   .hat .sigma   .cooksd
##       <dbl>  <int> <fct>   <dbl>  <dbl>      <dbl>  <dbl>  <dbl>     <dbl>
##  1     51.6      1 new      49.6  1.95      0.403  0.0210   4.90 0.00116  
##  2     37.0      1 used     44.0 -6.98     -1.44   0.0125   4.87 0.00871  
##  3     45.5      1 new      49.6 -4.10     -0.848  0.0210   4.89 0.00515  
##  4     44        1 new      49.6 -5.60     -1.16   0.0210   4.88 0.00961  
##  5     71        2 new      56.8 14.2       2.93   0.0192   4.75 0.0557   
##  6     45        0 new      42.4  2.63      0.551  0.0475   4.90 0.00505  
##  7     37.0      0 used     36.8  0.235     0.0486 0.0209   4.91 0.0000168
##  8     54.0      2 new      56.8 -2.85     -0.588  0.0192   4.90 0.00225  
##  9     47        1 used     44.0  2.98      0.614  0.0125   4.90 0.00159  
## 10     50        1 used     44.0  5.98      1.23   0.0125   4.88 0.00640  
## # ... with 131 more rows

Thought experiments

Suppose that after going apple picking you have 12 apples left over. You decide to conduct an experiment to investigate how quickly they will rot under certain conditions. You place six apples in a cool spot in your basement, and leave the other six on the window sill in the kitchen. Every week, you estimate the percentage of the surface area of the apple that is rotten or moldy.

Consider the following models:

$rot = \beta_0 + \beta_1 \cdot t + \beta_2 \cdot temp \,,$

and

$rot = \beta_0 + \beta_1 \cdot t + \beta_2 \cdot temp + \beta_3 \cdot temp \cdot t \,,$

where $t$ is time, measured in weeks, and $temp$ is a binary variable indicating either cool or warm.

If you decide to keep the interaction term present in the second model, you are implicitly assuming that:

Answer: The rate at which apples rot will vary based on the temperature.

Fitting a model with interaction

The presence of a significant interaction indicates that the effect of one predictor variable on the response variable is different at different values of the other predictor variable. It is tested by adding a term to the model in which the two predictor variables are multiplied. Adding an interaction term to a model drastically changes the interpretation of all the coefficients.

Including an interaction term in a model is easy — we just have to tell lm() that we want to include that new variable. An expression of the form,

lm(y ~ x + z + x:z, data = mydata)`

will do the trick. The use of the colon (:) here means that the interaction between $x$ and $z$ will be a third term in the model.

The data frame mario_kart is already loaded in your workspace.

Use lm() to fit a model for the price of a MarioKart as a function of its condition and the duration of the auction, with interaction.

# include interaction
lm(total_pr ~ cond + duration + cond:duration, mario_kart)

## 
## Call:
## lm(formula = total_pr ~ cond + duration + cond:duration, data = mario_kart)
## 
## Coefficients:
##       (Intercept)           condused           duration  condused:duration  
##            58.268            -17.122             -1.966              2.325

Visualizing interaction models

Interaction allows the slope of the regression line in each group to vary. In this case, this means that the relationship between the final price and the length of the auction is moderated by the condition of each item.

Interaction models are easy to visualize in the data space with ggplot2 because they have the same coefficients as if the models were fit independently to each group defined by the level of the categorical variable. In this case, new and used MarioKarts each get their own regression line. To see this, we can set an aesthetic (e.g. color) to the categorical variable, and then add a geom_smooth() layer to overlay the regression line for each color.

The dataset mario_kart is already loaded in your workspace.

Use the color aesthetic and the geom_smooth() function to plot the interaction model between duration and condition in the data space. Make sure you set the method and se arguments of geom_smooth().

# interaction plot
ggplot(mario_kart, aes(y = total_pr, x = duration, color = cond)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Consequences of Simpson’s paradox

In the simple linear regression model for average SAT score, (total) as a function of average teacher salary (salary), the fitted coefficient was -5.02 points per thousand dollars. This suggests that for every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 5 points lower.

In the model that includes the percentage of students taking the SAT, the coefficient on salary becomes 1.84 points per thousand dollars. Choose the correct interpretation of this slope coefficient.

Answer: For every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 2 points higher, after controlling for the percentage of students taking the SAT.

Simpson’s paradox in action

Simpson’s Paradox occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated.

Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data

A mild version of Simpson’s paradox can be observed in the MarioKart auction data. Consider the relationship between the final auction price and the length of the auction. It seems reasonable to assume that longer auctions would result in higher prices, since—other things being equal—a longer auction gives more bidders more time to see the auction and bid on the item.

However, a simple linear regression model reveals the opposite: longer auctions are associated with lower final prices. The problem is that all other things are not equal. In this case, the new MarioKarts—which people pay a premium for—were mostly sold in one-day auctions, while a plurality of the used MarioKarts were sold in the standard seven-day auctions.

Our simple linear regression model is misleading, in that it suggests a negative relationship between final auction price and duration. However, for the used MarioKarts, the relationship is positive.

The object slr is already defined for you.

Fit a simple linear regression model for final auction price (total_pr) as a function of duration (duration).
Use aes() to add a color aesthetic that’s mapped to the condition variable to the slr object, which is the plot shown at right.

slr <- ggplot(mario_kart, aes(y = total_pr, x = duration)) + 
  geom_point() + 
  geom_smooth(method = "lm", se = FALSE)

# model with one slope
lm(total_pr ~ duration, data = mario_kart)

## 
## Call:
## lm(formula = total_pr ~ duration, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration  
##      52.374       -1.317

# plot with two slopes
slr + aes(color = cond)

## `geom_smooth()` using formula 'y ~ x'

3. Multiple Regression

This chapter will show you how to add two, three, and even more numeric explanatory variables to a linear model.

Fitting a MLR model

In terms of the R code, fitting a multiple linear regression model is easy: simply add variables to the model formula you specify in the lm() command.

In a parallel slopes model, we had two explanatory variables: one was numeric and one was categorical. Here, we will allow both explanatory variables to be numeric.

The dataset mario_kart is already loaded in your workspace.

Fit a multiple linear regression model for total price as a function of the duration of the auction and the starting price.

# Fit the model using duration and start_pr
mod <- lm(total_pr ~ duration + start_pr, mario_kart)

Tiling the plane

One method for visualizing a multiple linear regression model is to create a heatmap of the fitted values in the plane defined by the two explanatory variables. This heatmap will illustrate how the model output changes over different combinations of the explanatory variables.

This is a multistep process:

First, create a grid of the possible pairs of values of the explanatory variables. The grid should be over the actual range of the data present in each variable. We’ve done this for you and stored the result as a data frame called grid.

grid <- expand.grid(duration = seq(1, 10, by = 1), start_pr = seq(0.01, 69.95, by = 0.01))

Use augment() with the newdata argument to find the $\hat{y}'s$ corresponding to the values in grid.
Add these to the data_space plot by using the fill aesthetic and geom_tile().

# add predictions to grid
price_hats <- broom::augment(mod, newdata = grid)

data_space <- ggplot(data = mario_kart,
aes(x = duration, y = start_pr)) +
geom_point(aes(color = total_pr)) +
theme_bw()
data_space

# tile the plane
data_space + 
  geom_tile(data = price_hats, aes(fill = .fitted), alpha = 0.5)

Models in 3D

An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.

We have created three objects that you will need:

x: a vector of unique values of duration
y: a vector of unique values of start_pr
plane: a matrix of the fitted values across all combinations of x and y

Much like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.

Run the plot_ly command to draw 3D scatterplot for total_pr as a function of duration and start_pr by mapping the z variable to the response and the x and y variables to the explanatory variables. Duration should be on the x-axis and starting price should be on the y-axis.
Use add_surface() to draw a plane through the cloud of points by setting z = ~plane.

library(plotly)

# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
  add_markers()

x <- seq(1, 10, length = 70)
y <- seq(0.010, 59.950, length = 70)
plane <- outer(x, y, function(a, b){summary(mod)$coef[1,1] +
summary(mod)$coef[2,1]*a + summary(mod)$coef[3,1]*b})

# draw the plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

Coefficient magnitude

The coefficients from our model for the total auction price of MarioKarts as a function of auction duration and starting price are shown below.

lm(total_pr ~ duration + start_pr, mario_kart)

## 
## Call:
## lm(formula = total_pr ~ duration + start_pr, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration     start_pr  
##      51.030       -1.508        0.233

A colleague claims that these results imply that the duration of the auction is a more important determinant of final price than starting price, because the coefficient is larger. This interpretation is false because:

Answer: The coefficients have different units (dollars per day and dollars per dollar, respectively) and so they are not directly comparable.

Practicing interpretation

Fit a multiple regression model for the total auction price of an item in the mario_kart data set as a function of the starting price and the duration of the auction. Compute the coefficients and choose the correct interpretation of the duration variable.

lm(total_pr ~ start_pr + duration, mario_kart)

## 
## Call:
## lm(formula = total_pr ~ start_pr + duration, data = mario_kart)
## 
## Coefficients:
## (Intercept)     start_pr     duration  
##      51.030        0.233       -1.508

Answer: For each additional day the auction lasts, the expected final price declines by $1.51, after controlling for starting price.

Interpreting Coefficients

lm(bwt ~ gestation + age + smoke, babies)

## 
## Call:
## lm(formula = bwt ~ gestation + age + smoke, data = babies)
## 
## Coefficients:
## (Intercept)    gestation          age        smoke  
##     -4.6037       0.4455       0.1069      -8.0143

Gestation Coefficient: Each additional day of gestation is associated with an increase in expected birthweight of 0.45 ounces after controlling for mother’s age and smoking status.

Age Coefficient: Each additional year of the mother’s age is associated with an increase in expected birthweight of 0.1 ounces per year after controlling for gestational length and the mother’s smoking status.

Smoke Coefficient: Mothers who smoke will deliver babies that weigh 8 ounces less on average than mothers of the same age and gestational length who don’t smoke.

Visualizing parallel planes

By including the duration, starting price, and condition variables in our model, we now have two explanatory variables and one categorical variable. Our model now takes the geometric form of two parallel planes!

The first plane corresponds to the model output when the condition of the item is new, while the second plane corresponds to the model output when the condition of the item is used. The planes have the same slopes along both the duration and starting price axes—it is the $z$-intercept that is different.

Once again we have stored the x and y vectors for you. Since we now have two planes, there are matrix objects plane0 and plane1 stored for you as well.

Use plot_ly to draw 3D scatterplot for total_pr as a function of duration, start_pr, and cond by mapping the z variable to the response and the x and y variables to the explanatory variables. Duration should be on the x-axis and starting price should be on the y-axis. Use color to represent cond. *Use add_surface() (twice) to draw two planes through the cloud of points, one for new MarioKarts and another for used ones. Use the objects plane0 and plane1.

plane0 <- outer(x, y, function(a, b){53.3447530 -0.6559841*a + 0.1981653*b})
plane1 <- outer(x, y, function(a, b){53.3447530 -0.6559841*a + 0.1981653*b - 8.9493214})

# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
  add_markers(color = ~cond) 
  
# draw two planes
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
  add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Parallel plane interpretation

The coefficients from our parallel planes model is shown below.

lm(total_pr ~ duration + start_pr + cond, data = mario_kart)

## 
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration     start_pr     condused  
##     53.3448      -0.6560       0.1982      -8.9493

Choose the right interpretation of $\beta_3$:

Answer: The expected premium for new (relative to used) MarioKarts is $8.95, after controlling for the duration and starting price of the auction.

Interpretation of coefficient in a big model

This time we have thrown even more variables into our model, including the number of bids in each auction (n_bids) and the number of wheels. Unfortunately this makes a full visualization of our model impossible, but we can still interpret the coefficients.

lm(total_pr ~ duration + start_pr + cond + wheels + n_bids, mario_kart)

## 
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond + wheels + 
##     n_bids, data = mario_kart)
## 
## Coefficients:
## (Intercept)     duration     start_pr     condused       wheels       n_bids  
##     39.3741      -0.2752       0.1796      -4.7720       6.7216       0.1909

Choose the correct interpretation of the coefficient on the number of wheels:

Answer: Each additional wheel is associated with an increase in the expected auction price of $6.72, after controlling for auction duration, starting price, number of bids, and the condition of the item.

4. Logistic Regression

In this chapter you’ll learn about using logistic regression, a generalized linear model (GLM), to predict a binary outcome and classify observations.

Video Explanation:

Fitting a line to a binary response

Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

When our response variable is binary, a regression model has several limitations. Among the more obvious—and logically incongruous—is that the regression line extends infinitely in either direction. This means that even though our response variable $y$ only takes on the values 0 and 1, our fitted values $\hat{y}$ can range anywhere from $−∞$ to $∞$. This doesn’t make sense.

To see this in action, we’ll fit a linear regression model to data about 55 students who applied to medical school. We want to understand how their undergraduate $GPA$ relates to the probability they will be accepted by a particular school $(Acceptance)$.

The medical school acceptance data is loaded in your workspace as MedGPA.

Create a scatterplot called data_space for Acceptance as a function of GPA. Use geom_jitter() to apply a small amount of jitter to the points in the $y$-direction by setting width = 0 and height = 0.05.
Use geom_smooth() to add the simple linear regression line to data_space.

library(Stat2Data)
data(MedGPA)

# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# linear regression line
data_space + 
  geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

In the previous exercise, we identified a major limitation to fitting a linear regression model when we have a binary response variable. However, it is not always inappropriate to do so. Note that our regression line only makes illogical predictions for students with very high or very low GPAs. For GPAs closer to average, the predictions seem fine.

Moreover, the alternative logistic regression model — which we will fit next — is very similar to the linear regression model for observations near the average of the explanatory variable. It just so happens that the logistic curve is very straight near its middle. Thus, in these cases a linear regression model may still be acceptable, even for a binary response.

Use filter() to find the subset of the observations in MedGPA whose GPAs are between 3.375 and 3.77, inclusive.
Create a scatterplot called data_space for Acceptance as a function of GPA for only those observations. Use geom_jitter() to apply 0.05 jitter to the points in the $y$-direction and no jitter to the $x$ direction.
Use geom_smooth() to add only the simple linear regression line to data_space.

# filter
MedGPA_middle <- MedGPA %>%
filter(GPA >= 3.375 & GPA <= 3.77)

# scatterplot with jitter
data_space <- ggplot(MedGPA_middle, aes(x = GPA, y = Acceptance)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# linear regression line
data_space + geom_smooth(method = "lm", se = FALSE)

## `geom_smooth()` using formula 'y ~ x'

Fitting a model

Logistic regression is a special case of a broader class of generalized linear models, often known as GLMs. Specifying a logistic regression model is very similar to specify a regression model, with two important differences:

We use the glm() function instead of lm()
We specify the family argument and set it to binomial. This tells the GLM function that we want to fit a logistic regression model to our binary response. [The terminology stems from the assumption that our binary response follows a binomial distribution.] We still use the formula and data arguments with glm().

Note that the mathematical model is now:

$\log{ \left( \frac{y}{1-y} \right) } = \beta_0 + \beta_1 \cdot x + \epsilon \,,$

where $ϵ$ is the error term.

Use glm() to fit a logistic regression model for Acceptance as a function of GPA.

# fit model
glm(Acceptance ~ GPA, data = MedGPA, family = binomial)

## 
## Call:  glm(formula = Acceptance ~ GPA, family = binomial, data = MedGPA)
## 
## Coefficients:
## (Intercept)          GPA  
##     -19.207        5.454  
## 
## Degrees of Freedom: 54 Total (i.e. Null);  53 Residual
## Null Deviance:       75.79 
## Residual Deviance: 56.84     AIC: 60.84

Using geom_smooth()

Our logistic regression model can be visualized in the data space by overlaying the appropriate logistic curve. We can use the geom_smooth() function to do this. Recall that geom_smooth() takes a method argument that allows you to specify what type of smoother you want to see. In our case, we need to specify that we want to use the glm() function to do the smoothing.

However we also need to tell the glm() function which member of the GLM family we want to use. To do this, we will pass the family argument to glm() as a list using the method.args argument to geom_smooth(). This mechanism is common in R, and allows one function to pass a list of arguments to another function.

Create a scatterplot called data_space for Acceptance as a function of GPA. Use geom_jitter() to apply a small amount of jitter to the points in the $y$-direction. Set width = 0 and height = 0.05 in geom_jitter().
Use geom_smooth() to add the logistic regression line to data_space by specifying the method and method.args arguments to fit a logistic glm.

# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) + 
  geom_jitter(width = 0, height = 0.05, alpha = 0.5)

# add logistic curve
data_space +
  geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))

## `geom_smooth()` using formula 'y ~ x'

Using bins

One of the difficulties in working with a binary response variable is understanding how it “changes.” The response itself $(y)$ is either 0 or 1, while the fitted values $(\hat{y})$ — which are interpreted as probabilities — are between 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the probability of being accepted?

What we’d like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say $n$), and we could then take the average of those $n$ observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless—it would take an infinite amount of data to make these estimates robust.

Instead, what we can do is put the observations into bins based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.

We have created a data.frame called MedGPA_binned that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.

gpa_bins <- quantile(MedGPA$GPA, probs = seq(0, 1, 1/6))
gpa_bins

##        0% 16.66667% 33.33333%       50% 66.66667% 83.33333%      100% 
##      2.72      3.30      3.44      3.58      3.70      3.87      3.97

MedGPA$bins <- cut(MedGPA$GPA, breaks = gpa_bins, include.lowest = TRUE)
head(MedGPA)

##   Accept Acceptance Sex BCPM  GPA VR PS WS BS MCAT Apps       bins
## 1      D          0   F 3.59 3.62 11  9  9  9   38    5 (3.58,3.7]
## 2      A          1   M 3.75 3.84 12 13  8 12   45    3 (3.7,3.87]
## 3      A          1   F 3.24 3.23  9 10  5  9   33   19 [2.72,3.3]
## 4      A          1   F 3.74 3.69 12 11  7 10   40    5 (3.58,3.7]
## 5      A          1   F 3.53 3.38  9 11  4 11   35   11 (3.3,3.44]
## 6      A          1   M 3.59 3.72 10  9  7 10   36    5 (3.7,3.87]

MedGPA_binned <- MedGPA %>%
group_by(bins) %>%
summarize(mean_GPA = mean(GPA), acceptance_rate = mean(Acceptance))

## `summarise()` ungrouping output (override with `.groups` argument)

MedGPA_binned

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 6 x 3
##   bins        mean_GPA acceptance_rate
##   <fct>          <dbl>           <dbl>
## 1 [2.72,3.3]      3.11           0.2  
## 2 (3.3,3.44]      3.39           0.2  
## 3 (3.44,3.58]     3.54           0.75 
## 4 (3.58,3.7]      3.65           0.333
## 5 (3.7,3.87]      3.79           0.889
## 6 (3.87,3.97]     3.91           1

Here we are plotting $y$ as a function of $x$, where that function is

$y = \frac{\exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x )}}{1 + \exp( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) } \,.$

Note that the left hand side is the expected probability $y$ of being accepted to medical school.

Create a scatterplot called data_space for acceptance_rate as a function of mean_GPA using the binned data in MedGPA_binned. Use geom_line() to connect the points.
Augment the model mod. Create predictions on the scale of the response variable by using the type.predict argument.
Use geom_line() to illustrate the model through the fitted values.

# binned points and line
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) + 
  geom_point() + geom_line()

mod <- glm(Acceptance ~ GPA, data = MedGPA, family = binomial)

MedGPA_plus <- mod %>%
augment(type.predict = "response")

MedGPA_plus <- MedGPA_plus %>%
mutate(odds_hat = .fitted / (1 - .fitted))

# logistic model on probability scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = "red")

Odds scale

Everything starts with the concept of probability. Let’s say that the probability of success of some event is .8. Then the probability of failure is 1 – .8 = .2. The odds of success are defined as the ratio of the probability of success over the probability of failure. In our example, the odds of success are .8/.2 = 4. That is to say that the odds of success are 4 to 1. If the probability of success is .5, i.e., 50-50 percent chance, then the odds of success is 1 to 1.

The transformation from probability to odds is a monotonic transformation, meaning the odds increase as the probability increases or vice versa. Probability ranges from 0 and 1. Odds range from 0 and positive infinity.

Why do we take all the trouble doing the transformation from probability to log odds? One reason is that it is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. Another reason is that among all of the infinitely many choices of transformation, the log of odds is one of the easiest to understand and interpret. This transformation is called logit transformation. The other common choice is the probit transformation.

Video Explanation:

For most people, the idea that we could estimate the probability of being admitted to medical school based on undergraduate GPA is fairly intuitive. However, thinking about how the probability changes as a function of GPA is complicated by the non-linear logistic curve. By translating the response from the probability scale to the odds scale, we make the right hand side of our equation easier to understand.

If the probability of getting accepted is $y$, then the odds are $y / (1-y)$. Expressions of probabilities in terms of odds are common in many situations, perhaps most notably gambling.

Here we are plotting $y / (1-y)$ as a function of $x$, where that function is

$odds(\hat{y}) = \frac{\hat{y}}{1-\hat{y}} = \exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) }$

Note that the left hand side is the expected odds of being accepted to medical school. The right hand side is now a familiar exponential function of $x$.

The MedGPA_binned data frame contains the data for each GPA bin, while the MedGPA_plus data frame records the original observations after being augment()-ed by mod.

Add a variable called odds to MedGPA_binned that records the odds of being accepted to medical school for each bin.
Create a scatterplot called data_space for odds as a function of mean_GPA using the binned data in MedGPA_binned. Connect the points with geom_line().
Add a variable called odds_hat to MedGPA_plus that records the predicted odds of being accepted for each observation.
Use geom_line() to illustrate the model through the fitted values. Note that you should be plotting the $\widehat{odds}$.

# compute odds for bins
MedGPA_binned <- MedGPA_binned %>%
  mutate(odds = acceptance_rate / (1 - acceptance_rate))

# plot binned odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = odds)) + 
  geom_point() + geom_line()

# compute odds for observations
MedGPA_plus <- MedGPA_plus %>%
  mutate(odds_hat = .fitted / (1 - .fitted))

# logistic model on odds scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = odds_hat), color = "red")

Log-odds scale

Previously, we considered two formulations of logistic regression models:

on the probability scale, the units are easy to interpret, but the function is non-linear, which makes it hard to understand
on the odds scale, the units are harder (but not impossible) to interpret, and the function in exponential, which makes it harder (but not impossible) to interpret

We’ll now add a third formulation:

on the log-odds scale, the units are nearly impossible to interpret, but the function is linear, which makes it easy to understand

As you can see, none of these three is uniformly superior. Most people tend to interpret the fitted values on the probability scale and the function on the log-odds scale. The interpretation of the coefficients is most commonly done on the odds scale. Recall that we interpreted our slope coefficient $\beta_1$ in linear regression as the expected change in $y$ given a one unit change in $x$. On the probability scale, the function is non-linear and so this approach won’t work. On the log-odds, the function is linear, but the units are not interpretable (what does the $log$ of the odds mean??). However, on the odds scale, a one unit change in $x$ leads to the odds being multiplied by a factor of $\beta_1$. To see why, we form the odds ratio:

$OR = \frac{odds(\hat{y} | x + 1 )}{ odds(\hat{y} | x )} = \exp{\beta_1}$

Thus, the exponentiated coefficent $\beta_1$ tells us how the expected odds change for a one unit increase in the explanatory variable. It is tempting to interpret this as a change in the expected probability, but this is wrong and can lead to nonsensical predictions (e.g. expected probabilities greater than 1).

Add a variable called log_odds to MedGPA_binned that records the odds of being accepted for each bin. Recall that $odds(p) = p / (1-p)$.
Create a scatterplot called data_space for log_odds as a function of mean_GPA using the binned data in MedGPA_binned. Use geom_line to connect the points.
Add a variable called log_odds_hat to MedGPA_plus that records the predicted odds of being accepted for each observation.
Use geom_line() to illustrate the model through the fitted values. Note that you should be plotting the $\log{\widehat{odds}}$.

# compute log odds for bins
MedGPA_binned <- MedGPA_binned %>%
  mutate(log_odds = log(acceptance_rate / (1 - acceptance_rate)))

# plot binned log odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = log_odds)) + 
  geom_point() + geom_line()

# compute log odds for observations
MedGPA_plus <- MedGPA_plus %>%
  mutate(log_odds_hat = log(.fitted / (1 - .fitted)))

# logistic model on log odds scale
data_space +
  geom_line(data = MedGPA_plus, aes(x = GPA, y = log_odds_hat), color = "red")

Interpretation of logistic regression

The fitted coefficient $\hat{\beta}_1$ from the medical school logistic regression model is 5.45. The exponential of this is 233.73.

Donald’s GPA is 2.9, and thus the model predicts that the probability of him getting into medical school is 3.26%. The odds of Donald getting into medical school are 0.0337, or—phrased in gambling terms—29.6:1. If Donald hacks the school’s registrar and changes his GPA to 3.9, then which of the following statements is FALSE:

His expected odds of getting into medical school improve to 7.8833 (or about 9:8).
His expected probability of getting into medical school improves to 88.7%.
His expected log-odds of getting into medical school improve by 5.45.
His expected probability of getting into medical school improves to 7.9%.

Making probabilistic predictions

Just as we did with linear regression, we can use our logistic regression model to make predictions about new observations. In this exercise, we will use the newdata argument to the augment() function from the broom package to make predictions about students who were not in our original data set. These predictions are sometimes called out-of-sample.

Following our previous discussion about scales, with logistic regression it is important that we specify on which scale we want the predicted values. Although the default is terms – which uses the log-odds scale – we want our predictions on the probability scale, which is the scale of the response variable. The type.predict argument to augment() controls this behavior.

A logistic regression model object, mod, has been defined for you.

Create a new data frame which has one variable called GPA and one row, with the value 3.51.
Use augment() to find the expected probability of admission to medical school for a student with a GPA of 3.51.

# create new data frame
new_data <- data.frame(GPA = 3.51)

# make predictions
augment(mod, newdata = new_data, type.predict = "response")

## Warning: `...` is not empty.
## 
## We detected these problematic arguments:
## * `needs_dots`
## 
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?

## # A tibble: 1 x 2
##     GPA .fitted
##   <dbl>   <dbl>
## 1  3.51   0.484

Making binary predictions

Naturally, we want to know how well our model works. Did it predict acceptance for the students who were actually accepted to medical school? Did it predict rejections for the student who were not admitted? These types of predictions are called in-sample. One common way to evaluate models with a binary response is with a confusion matrix. [Yes, that is actually what it is called!]

However, note that while our response variable is binary, our fitted values are probabilities. Thus, we have to round them somehow into binary predictions. While the probabilities convey more information, we might ultimately have to make a decision, and so this rounding is common in practice. There are many different ways to round, but for simplicity we will predict admission if the fitted probability is greater than 0.5, and rejection otherwise.

First, we’ll use augment() to make the predictions, and then mutate() and round() to convert these probabilities into binary decisions. Then we will form the confusion matrix using the table() function. table() will compute a 2-way table when given a data frame with two categorical variables, so we will first use select() to grab only those variables.

You will find that this model made only 15 mistakes on these 55 observations, so it is nearly 73% accurate.

The model object mod is already in your worskpace.

Create a data frame with the actual observations, and their fitted probabilities, and add a new column, Acceptance_hat, with the binary decision by rounding the fitted probabilities.
Compute the confusion matrix between the actual and predicted acceptance.

# data frame with binary predictions
tidy_mod <- augment(mod, type.predict = "response") %>% 
  mutate(Acceptance_hat = round(.fitted)) 
  
# confusion matrix
tidy_mod %>% 
  select(Acceptance, Acceptance_hat) %>%
  table()

##           Acceptance_hat
## Acceptance  0  1
##          0 16  9
##          1  6 24

5. Case Study: Italian restaurants in NYC

Explore the relationship between price and the quality of food, service, and decor for Italian restaurants in NYC.

Exploratory data analysis

Multiple regression can be an effective technique for understanding how a response variable changes as a result of changes to more than one explanatory variable. But it is not magic – understanding the relationships among the explanatory variables is also necessary, and will help us build a better model. This process is often called exploratory data analysis (EDA) and is covered in another DataCamp course.

One quick technique for jump-starting EDA is to examine all of the pairwise scatterplots in your data. This can be achieved using the pairs() function. Look for variables in the nyc data set that are strongly correlated, as those relationships will help us check for multicollinearity later on.

Which pairs of variables appear to be strongly correlated?

nyc <- read.csv("nyc.csv")
pairs(nyc)

SLR models

Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? It would be hard to imagine that it wasn’t. We’ll start our modeling process by plotting and fitting a model for Price as a function of Food.

On your own, interpret these coefficients and examine the fit of the model. What does the coefficient of Food mean in plain English? “Each additional rating point of food quality is associated with a…”

Use ggplot to make a scatter plot for Price as a function of Food.
Use lm() to fit a simple linear regression model for Price as a function of Food.

# Price by Food plot
nyc <- read.csv("nyc.csv")

ggplot(data = nyc, aes(x = Food, y = Price)) +
  geom_point()

# Price by Food model
lm(Price ~ Food, data = nyc)

## 
## Call:
## lm(formula = Price ~ Food, data = nyc)
## 
## Coefficients:
## (Intercept)         Food  
##     -17.832        2.939

Interpretation
Each additional rating point of food quality is associated with a price increase of $2.9.

Parallel lines with location

In real estate, a common mantra is that the three most important factors in determining the price of a property are “location, location, and location.” If location drives up property values and rents, then we might imagine that location would increase a restaurant’s costs, which would result in them having higher prices. In many parts of New York, the east side (east of 5th Avenue) is more developed and perhaps more expensive. [This is increasingly less true, but was more true at the time these data were collected.]

Let’s expand our model into a parallel slopes model by including the East variable in addition to Food.

Use lm() to fit a parallel slopes model for Price as a function of Food and East. Interpret the coefficients and the fit of the model. Can you explain the meaning of the coefficient on East in simple terms? Did the coefficient on Food change from the previous model? If so, why? Did it change by a lot or just a little?

Identify the statement that is FALSE:

Each additional rating point of food quality is associated with a $2.88 increase in the expected price of meal, after controlling for location.
The premium for an Italian restaurant in NYC associated with being on the east side of 5th Avenue is $1.46, after controlling for the quality of the food.
The change in the coefficient of food from 2.94 in the simple linear model to 2.88 in this model has profound practical implications for restaurant owners.

A plane in 3D

One reason that many people go to a restaurant—apart from the food—is that they don’t have to cook or clean up. Many people appreciate the experience of being waited upon, and we can all agree that the quality of the service at restaurants varies widely. Are people willing to pay more for better restaurant Service? More interestingly, are they willing to pay more for better service, after controlling for the quality of the food?

Multiple regression gives us a way to reason about these questions. Fit the model with Food and Service and interpret the coefficients and fit. Did the coefficient on Food change from the previous model? What do the coefficients on Food and Service tell you about how these restaurants set prices?

Next, let’s visually assess our model using plotly. The x and y vectors, as well as the plane matrix, have been created for you.

Use lm() to fit a multiple regression model for Price as a function of Food and Service.
Use plot_ly to draw 3D scatterplot for Price as a function of Food and Service by mapping the z variable to the response and the x and y variables to the explanatory variables. Place the food quality on the x-axis and service rating on the y-axis.
Use add_surface() to draw a plane through the cloud of points using the object plane.

# fit model
lm(Price ~ Food + Service, data = nyc)

## 
## Call:
## lm(formula = Price ~ Food + Service, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service  
##     -21.159        1.495        1.704

# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
  add_markers() 

# draw a plane
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)

Collinearity

In statistics, multicollinearity (also collinearity) is a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable.

A collinearity is a special case when two or more variables are exactly correlated. This means that the regression coefficients are not uniquely determined. In turn it hurts the interpretability of the model as then the regression coefficients are not unique and have influences from other features.

Parallel planes with location

We have explored models that included the quality of both food and service, as well as location, but we haven’t put these variables all into the same model. Let’s now build a parallel planes model that incorporates all three variables.

Examine the coefficients closely. Do they make sense based on what you understand about these data so far? How did the coefficients change from the previous models that you fit?

Use lm() to fit a parallel planes model for Price as a function of Food, Service, and East.

# Price by Food and Service and East
lm(Price ~ Food + Service + East, nyc)

## 
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service         East  
##    -20.8155       1.4863       1.6647       0.9649

Interpretation of location coefficient

Which of the following statements is FALSE of the fitted coefficients from the parallel planes model?

The premium for being on the East side of 5th Avenue is just less than a dollar, after controlling for the quality of food and service.
The impact of location is relatively small, since one additional rating point of either food or service would result in a higher expected price than moving a restaurant from the West side to the East side.
The expected price of a meal on the East side is about 96% of the cost of a meal on the West side, after controlling for the quality of food and service.

Impact of location

The impact of location brings us to a modeling question: should we keep this variable in our model? In a later course, you will learn how we can conduct formal hypothesis tests to help us answer that question. In this course, we will focus on the size of the effect. Is the impact of location big or small?

One way to think about this would be in terms of the practical significance. Is the value of the coefficient large enough to make a difference to your average person? The units are in dollars so in this case this question is not hard to grasp.

Another way is to examine the impact of location in the context of the variability of the other variables. We can do this by building our parallel planes in 3D and seeing how far apart they are. Are the planes close together or far apart? Does the East variable clearly separate the data into two distinct groups? Or are the points all mixed up together?

Use plot_ly to draw 3D scatterplot for Price as a function of Food, Service, and East by mapping the z variable to the response and the x and y variables to the numeric explanatory variables. Use color to indicate the value of East. Place Food on the x-axis and Service on the y-axis.
Use add_surface() (twice) to draw two planes through the cloud of points, one for restaurants on the West side and another for restaurants on the East side. Use the objects plane0 and plane1.

# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
  add_markers(color = ~factor(East)) 

# draw two planes
p %>%
  add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
  add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

Full model

One variable we haven’t considered is Decor. Do people, on average, pay more for a meal in a restaurant with nicer decor? If so, does it still matter after controlling for the quality of food, service, and location?

By adding a third numeric explanatory variable to our model, we lose the ability to visualize the model in even three dimensions. Our model is now a hyperplane – or rather, parallel hyperplanes – and while we won’t go any further with the geometry, know that we can continue to add as many variables to our model as we want. As humans, our spatial visualization ability taps out after three numeric variables (maybe you could argue for four, but certainly no further), but neither the mathematical equation for the regression model, nor the formula specification for the model in R, is bothered by the higher dimensionality.

Use lm() to fit a parallel planes model for Price as a function of Food, Service, Decor, and East.

lm(Price ~ Food + Service + Decor + East, nyc)

## 
## Call:
## lm(formula = Price ~ Food + Service + Decor + East, data = nyc)
## 
## Coefficients:
## (Intercept)         Food      Service        Decor         East  
##  -24.023800     1.538120    -0.002727     1.910087     2.068050

Notice the dramatic change in the value of the Service coefficient.

Which of the following interpretations is invalid?

Since the quality of food, decor, and service were all strongly correlated, multicollinearity is the likely explanation.
Once we control for the quality of food, decor, and location, the additional information conveyed by service is negligible.
Service is not an important factor in determining the price of a meal.
None of the above.