In this course you’ll take your skills with simple linear regression to the next level. By learning multiple and logistic regression techniques you will gain the skills to model and predict both numeric and categorical outcomes using multiple input variables. You’ll also learn how to fit, visualize, and interpret these models. Then you’ll apply your skills to learn about Italian restaurants in New York City!
In this chapter you’ll learn about the class of linear models called “parallel slopes models.” These include one numeric and one categorical explanatory variable.
We use the lm() function to fit linear models to data. In this case, we want to understand how the price of MarioKart games sold at auction varies as a function of not only the number of wheels included in the package, but also whether the item is new or used. Obviously, it is expected that you might have to pay a premium to buy these new. But how much is that premium? Can we estimate its value after controlling for the number of wheels?
We will fit a parallel slopes model using lm(). In addition to the data argument, lm() needs to know which variables you want to include in your regression model, and how you want to include them. It accomplishes this using a formula argument. A simple linear regression formula looks like y ~ x, where y is the name of the response variable, and x is the name of the explanatory variable. Here, we will simply extend this formula to include multiple explanatory variables. A parallel slopes model has the form y ~ x + z, where z is a categorical explanatory variable, and x is a numerical explanatory variable.
The output from lm() is a model object, which when printed, will show the fitted coefficients.
## Rows: 141
## Columns: 12
## $ id <dbl> 150377422259, 260483376854, 320432342985, 280405224677,...
## $ duration <int> 3, 7, 3, 3, 1, 3, 1, 1, 3, 7, 1, 1, 1, 1, 7, 7, 3, 3, 1...
## $ n_bids <int> 20, 13, 16, 18, 20, 19, 13, 15, 29, 8, 15, 15, 13, 16, ...
## $ cond <fct> new, used, new, new, new, new, used, new, used, used, n...
## $ start_pr <dbl> 0.99, 0.99, 0.99, 0.99, 0.01, 0.99, 0.01, 1.00, 0.99, 1...
## $ ship_pr <dbl> 4.00, 3.99, 3.50, 0.00, 0.00, 4.00, 0.00, 2.99, 4.00, 4...
## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99,...
## $ ship_sp <fct> standard, firstClass, firstClass, standard, media, stan...
## $ seller_rate <int> 1580, 365, 998, 7, 820, 270144, 7284, 4858, 27, 201, 48...
## $ stock_photo <fct> yes, yes, no, yes, yes, yes, yes, yes, yes, no, yes, ye...
## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2...
## $ title <fct> "~~ Wii MARIO KART & WHEEL ~ NINTENDO Wii ~ BRAND N...
## tibble [141 x 12] (S3: tbl_df/tbl/data.frame)
## $ id : num [1:141] 1.5e+11 2.6e+11 3.2e+11 2.8e+11 1.7e+11 ...
## $ duration : int [1:141] 3 7 3 3 1 3 1 1 3 7 ...
## $ n_bids : int [1:141] 20 13 16 18 20 19 13 15 29 8 ...
## $ cond : Factor w/ 2 levels "new","used": 1 2 1 1 1 1 2 1 2 2 ...
## $ start_pr : num [1:141] 0.99 0.99 0.99 0.99 0.01 ...
## $ ship_pr : num [1:141] 4 3.99 3.5 0 0 4 0 2.99 4 4 ...
## $ total_pr : num [1:141] 51.5 37 45.5 44 71 ...
## $ ship_sp : Factor w/ 8 levels "firstClass","media",..: 6 1 1 6 2 6 6 8 5 1 ...
## $ seller_rate: int [1:141] 1580 365 998 7 820 270144 7284 4858 27 201 ...
## $ stock_photo: Factor w/ 2 levels "no","yes": 2 2 1 2 2 2 2 2 2 1 ...
## $ wheels : int [1:141] 1 1 1 1 2 0 0 2 1 1 ...
## $ title : Factor w/ 80 levels " Mario Kart Wii with Wii Wheel for Wii (New)",..: 80 60 22 7 4 19 34 5 79 70 ...
##
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
##
## Coefficients:
## (Intercept) wheels condused
## 42.370 7.233 -5.585
Parallel slopes models are so-named because we can visualize these models in the data space as not one line, but two parallel lines. To do this, we’ll draw two things:
Our plotting strategy is to compute the fitted values, plot these, and connect the points to form a line. The augment() function from the broom package provides an easy way to add the fitted values to our data frame, and the geom_line() function can then use that data frame to plot the points and connect them.
Note that this approach has the added benefit of automatically coloring the lines appropriately to match the data.
You already know how to use ggplot() and geom_point() to make the scatterplot. The only twist is that now you’ll pass your augment()-ed model as the data argument in your ggplot() call. When you add your geom_line(), instead of letting the y aesthetic inherit its values from the ggplot() call, you can set it to the .fitted column of the augment()-ed model. This has the advantage of automatically coloring the lines for you.
The parallel slopes model mod relating total price to the number of wheels and condition is already in your workspace.
## Warning: package 'broom' was built under R version 4.0.2
## Rows: 141
## Columns: 9
## $ total_pr <dbl> 51.55, 37.04, 45.50, 44.00, 71.00, 45.00, 37.02, 53.99, ...
## $ wheels <int> 1, 1, 1, 1, 2, 0, 0, 2, 1, 1, 2, 2, 2, 2, 1, 0, 1, 1, 2,...
## $ cond <fct> new, used, new, new, new, new, used, new, used, used, ne...
## $ .fitted <dbl> 49.60260, 44.01777, 49.60260, 49.60260, 56.83544, 42.369...
## $ .resid <dbl> 1.9473995, -6.9777674, -4.1026005, -5.6026005, 14.164559...
## $ .std.resid <dbl> 0.40270893, -1.43671086, -0.84838977, -1.15857953, 2.926...
## $ .hat <dbl> 0.02103158, 0.01250410, 0.02103158, 0.02103158, 0.019156...
## $ .sigma <dbl> 4.902339, 4.868399, 4.892414, 4.881308, 4.750591, 4.8998...
## $ .cooksd <dbl> 1.161354e-03, 8.712334e-03, 5.154337e-03, 9.612441e-03, ...
# scatterplot, with color
data_space <- ggplot(augmented_mod, aes(x = wheels, y = total_pr, color = cond)) +
geom_point()
# single call to geom_line()
data_space +
geom_line(aes(y = .fitted))
Intercept interpretation
Recall that the cond variable is either new or used. Here are the fitted coefficients from your model:
##
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
##
## Coefficients:
## (Intercept) wheels condused
## 42.370 7.233 -5.585
Choose the correct interpretation of the coefficient on condused:
Answer: The expected price of a used MarioKart is $5.58 less than that of a new one with the same number of wheels.
The babies data set contains observations about the birthweight and other characteristics of children born in the San Francisco Bay area from 1960–1967.
We would like to build a model for birthweight as a function of the mother’s age and whether this child was her first (parity == 0). Use the mathematical specification below to code the model in R.
\(birthweight = \beta_0 + \beta_1 \cdot age + \beta_2 \cdot parity + \epsilon\)
The birthweight variable is recorded in the column bwt.
Use lm() to build the parallel slopes model specified above. It’s not necessary to use factor() in this case as the variable parity is coded using binary numeric values.
##
## Call:
## lm(formula = bwt ~ age + parity, data = babies)
##
## Coefficients:
## (Intercept) age parity
## 118.27782 0.06315 -1.65248
This time, we’d like to build a model for birthweight as a function of the length of gestation and the mother’s smoking status. Use the plot to inform your model specification.
Use lm() to build a parallel slopes model implied by the plot. It’s not necessary to use factor() in this case either.
##
## Call:
## lm(formula = bwt ~ gestation + smoke, data = babies)
##
## Coefficients:
## (Intercept) gestation smoke
## -0.9317 0.4429 -8.0883
This chapter covers model evaluation. By looking at different properties of the model, including the adjusted R-squared, you’ll learn to compare models so that you can select the best one. You’ll also learn about interaction terms in linear models.
Two common measures of how well a model fits to data are \(R^2\) (the coefficient of determination) and the adjusted \(R^2\). The former measures the percentage of the variability in the response variable that is explained by the model. To compute this, we define
\(R^2 = 1 - \frac{SSE}{SST} \,,\)
where \(SSE\) and \(SST\) are the sum of the squared residuals, and the total sum of the squares, respectively. One issue with this measure is that the \(SSE\) can only decrease as new variable are added to the model, while the \(SST\) depends only on the response variable and therefore is not affected by changes to the model. This means that you can increase \(R^2\) by adding any additional variable to your model—even random noise.
The adjusted \(R^2\) includes a term that penalizes a model for each additional explanatory variable (where \(p\) is the number of explanatory variables).
\(R^2_{adj} = 1 - \frac{SSE}{SST} \cdot \frac{n-1}{n-p-1} \,,\)
We can see both measures in the output of the summary() function on our model object.
Some Problems with R-squared
Problem 1: Every time you add a predictor to a model, the R-squared increases, even if due to chance alone. It never decreases. Consequently, a model with more terms may appear to have a better fit simply because it has more terms.
Problem 2: If a model has too many predictors and higher order polynomials, it begins to model the random noise in the data. This condition is known as overfitting the model and it produces misleadingly high R-squared values and a lessened ability to make predictions.
The adjusted R-squared compares the explanatory power of regression models that contain different numbers of predictors.
Suppose you compare a five-predictor model with a higher R-squared to a one-predictor model. Does the five predictor model have a higher R-squared because it’s better? Or is the R-squared higher because it has more predictors? Simply compare the adjusted R-squared values to find out!
The adjusted R-squared is a modified version of R-squared that has been adjusted for the number of predictors in the model. The adjusted R-squared increases only if the new term improves the model more than would be expected by chance. It decreases when a predictor improves the model by less than expected by chance. The adjusted R-squared can be negative, but it’s usually not. It is always lower than the R-squared.
##
## Call:
## lm(formula = total_pr ~ wheels + cond, data = mario_kart)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.0078 -3.0754 -0.8254 2.9822 14.1646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.3698 1.0651 39.780 < 2e-16 ***
## wheels 7.2328 0.5419 13.347 < 2e-16 ***
## condused -5.5848 0.9245 -6.041 1.35e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.887 on 138 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.7124
## F-statistic: 174.4 on 2 and 138 DF, p-value: < 2.2e-16
# add random noise
mario_kart_noisy <- mario_kart %>%
mutate(noise = rnorm(nrow(mario_kart)))
# compute new model
mod2 <- lm(total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
# new R^2 and adjusted R^2
summary(mod2)
##
## Call:
## lm(formula = total_pr ~ wheels + cond + noise, data = mario_kart_noisy)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.4572 -3.2610 -0.7703 2.7461 14.5646
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 42.3915 1.0677 39.703 < 2e-16 ***
## wheels 7.2160 0.5436 13.276 < 2e-16 ***
## condused -5.5949 0.9265 -6.039 1.38e-08 ***
## noise 0.2692 0.4008 0.672 0.503
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.897 on 137 degrees of freedom
## Multiple R-squared: 0.7174, Adjusted R-squared: 0.7113
## F-statistic: 116 on 3 and 137 DF, p-value: < 2.2e-16
Once we have fit a regression model, we can use it to make predictions for unseen observations or retrieve the fitted values. Here, we explore two methods for doing the latter.
A traditional way to return the fitted values (i.e. the \(\hat{y}'s\)) is to run the predict() function on the model object. This will return a vector of the fitted values. Note that predict() will take an optional newdata argument that will allow you to make predictions for observations that are not in the original data.
A newer alternative is the augment() function from the broom package, which returns a data.frame with the response varible \((y)\), the relevant explanatory variables (the \(x's\)), the fitted value \((\hat{y})\) and some information about the residuals \((e)\). augment() will also take a newdata argument that allows you to make predictions.
The fitted model mod is already in your environment.
## 1 2 3 4 5 6 7 8
## 49.60260 44.01777 49.60260 49.60260 56.83544 42.36976 36.78493 56.83544
## 9 10 11 12 13 14 15 16
## 44.01777 44.01777 56.83544 56.83544 56.83544 56.83544 44.01777 36.78493
## 17 18 19 20 21 22 23 24
## 49.60260 49.60260 56.83544 36.78493 56.83544 56.83544 56.83544 44.01777
## 25 26 27 28 29 30 31 32
## 56.83544 36.78493 36.78493 36.78493 49.60260 36.78493 36.78493 44.01777
## 33 34 35 36 37 38 39 40
## 51.25061 44.01777 44.01777 36.78493 44.01777 56.83544 56.83544 49.60260
## 41 42 43 44 45 46 47 48
## 44.01777 51.25061 56.83544 56.83544 44.01777 56.83544 36.78493 36.78493
## 49 50 51 52 53 54 55 56
## 44.01777 56.83544 36.78493 44.01777 42.36976 36.78493 36.78493 44.01777
## 57 58 59 60 61 62 63 64
## 44.01777 36.78493 36.78493 56.83544 36.78493 56.83544 36.78493 51.25061
## 65 66 67 68 69 70 71 72
## 56.83544 44.01777 58.48345 51.25061 49.60260 44.01777 49.60260 56.83544
## 73 74 75 76 77 78 79 80
## 56.83544 51.25061 44.01777 36.78493 36.78493 36.78493 44.01777 56.83544
## 81 82 83 84 85 86 87 88
## 44.01777 65.71629 44.01777 56.83544 36.78493 49.60260 49.60260 36.78493
## 89 90 91 92 93 94 95 96
## 44.01777 36.78493 51.25061 44.01777 36.78493 51.25061 42.36976 56.83544
## 97 98 99 100 101 102 103 104
## 51.25061 44.01777 51.25061 56.83544 56.83544 56.83544 36.78493 49.60260
## 105 106 107 108 109 110 111 112
## 51.25061 44.01777 56.83544 49.60260 36.78493 44.01777 51.25061 56.83544
## 113 114 115 116 117 118 119 120
## 64.06828 44.01777 49.60260 44.01777 49.60260 51.25061 42.36976 44.01777
## 121 122 123 124 125 126 127 128
## 56.83544 44.01777 49.60260 44.01777 51.25061 56.83544 56.83544 49.60260
## 129 130 131 132 133 134 135 136
## 56.83544 36.78493 44.01777 44.01777 36.78493 56.83544 36.78493 44.01777
## 137 138 139 140 141
## 36.78493 51.25061 49.60260 36.78493 56.83544
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 141 x 9
## total_pr wheels cond .fitted .resid .std.resid .hat .sigma .cooksd
## <dbl> <int> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 51.6 1 new 49.6 1.95 0.403 0.0210 4.90 0.00116
## 2 37.0 1 used 44.0 -6.98 -1.44 0.0125 4.87 0.00871
## 3 45.5 1 new 49.6 -4.10 -0.848 0.0210 4.89 0.00515
## 4 44 1 new 49.6 -5.60 -1.16 0.0210 4.88 0.00961
## 5 71 2 new 56.8 14.2 2.93 0.0192 4.75 0.0557
## 6 45 0 new 42.4 2.63 0.551 0.0475 4.90 0.00505
## 7 37.0 0 used 36.8 0.235 0.0486 0.0209 4.91 0.0000168
## 8 54.0 2 new 56.8 -2.85 -0.588 0.0192 4.90 0.00225
## 9 47 1 used 44.0 2.98 0.614 0.0125 4.90 0.00159
## 10 50 1 used 44.0 5.98 1.23 0.0125 4.88 0.00640
## # ... with 131 more rows
Thought experiments
Suppose that after going apple picking you have 12 apples left over. You decide to conduct an experiment to investigate how quickly they will rot under certain conditions. You place six apples in a cool spot in your basement, and leave the other six on the window sill in the kitchen. Every week, you estimate the percentage of the surface area of the apple that is rotten or moldy.
Consider the following models:
\(rot = \beta_0 + \beta_1 \cdot t + \beta_2 \cdot temp \,,\)
and
\(rot = \beta_0 + \beta_1 \cdot t + \beta_2 \cdot temp + \beta_3 \cdot temp \cdot t \,,\)
where \(t\) is time, measured in weeks, and \(temp\) is a binary variable indicating either cool or warm.
If you decide to keep the interaction term present in the second model, you are implicitly assuming that:
Answer: The rate at which apples rot will vary based on the temperature.
The presence of a significant interaction indicates that the effect of one predictor variable on the response variable is different at different values of the other predictor variable. It is tested by adding a term to the model in which the two predictor variables are multiplied. Adding an interaction term to a model drastically changes the interpretation of all the coefficients.
Including an interaction term in a model is easy — we just have to tell lm() that we want to include that new variable. An expression of the form,
will do the trick. The use of the colon (:) here means that the interaction between \(x\) and \(z\) will be a third term in the model.
The data frame mario_kart is already loaded in your workspace.
##
## Call:
## lm(formula = total_pr ~ cond + duration + cond:duration, data = mario_kart)
##
## Coefficients:
## (Intercept) condused duration condused:duration
## 58.268 -17.122 -1.966 2.325
Interaction allows the slope of the regression line in each group to vary. In this case, this means that the relationship between the final price and the length of the auction is moderated by the condition of each item.
Interaction models are easy to visualize in the data space with ggplot2 because they have the same coefficients as if the models were fit independently to each group defined by the level of the categorical variable. In this case, new and used MarioKarts each get their own regression line. To see this, we can set an aesthetic (e.g. color) to the categorical variable, and then add a geom_smooth() layer to overlay the regression line for each color.
The dataset mario_kart is already loaded in your workspace.
# interaction plot
ggplot(mario_kart, aes(y = total_pr, x = duration, color = cond)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
In the simple linear regression model for average SAT score, (total) as a function of average teacher salary (salary), the fitted coefficient was -5.02 points per thousand dollars. This suggests that for every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 5 points lower.
In the model that includes the percentage of students taking the SAT, the coefficient on salary becomes 1.84 points per thousand dollars. Choose the correct interpretation of this slope coefficient.
Answer: For every additional thousand dollars of salary for teachers in a particular state, the expected SAT score for a student from that state is about 2 points higher, after controlling for the percentage of students taking the SAT.
Simpson’s Paradox occurs when trends that appear when a dataset is separated into groups reverse when the data are aggregated.
Simpson’s Paradox: How to Prove Opposite Arguments with the Same Data
A mild version of Simpson’s paradox can be observed in the MarioKart auction data. Consider the relationship between the final auction price and the length of the auction. It seems reasonable to assume that longer auctions would result in higher prices, since—other things being equal—a longer auction gives more bidders more time to see the auction and bid on the item.
However, a simple linear regression model reveals the opposite: longer auctions are associated with lower final prices. The problem is that all other things are not equal. In this case, the new MarioKarts—which people pay a premium for—were mostly sold in one-day auctions, while a plurality of the used MarioKarts were sold in the standard seven-day auctions.
Our simple linear regression model is misleading, in that it suggests a negative relationship between final auction price and duration. However, for the used MarioKarts, the relationship is positive.
The object slr is already defined for you.
slr <- ggplot(mario_kart, aes(y = total_pr, x = duration)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE)
# model with one slope
lm(total_pr ~ duration, data = mario_kart)
##
## Call:
## lm(formula = total_pr ~ duration, data = mario_kart)
##
## Coefficients:
## (Intercept) duration
## 52.374 -1.317
## `geom_smooth()` using formula 'y ~ x'
This chapter will show you how to add two, three, and even more numeric explanatory variables to a linear model.
In terms of the R code, fitting a multiple linear regression model is easy: simply add variables to the model formula you specify in the lm() command.
In a parallel slopes model, we had two explanatory variables: one was numeric and one was categorical. Here, we will allow both explanatory variables to be numeric.
The dataset mario_kart is already loaded in your workspace.
One method for visualizing a multiple linear regression model is to create a heatmap of the fitted values in the plane defined by the two explanatory variables. This heatmap will illustrate how the model output changes over different combinations of the explanatory variables.
This is a multistep process:
# add predictions to grid
price_hats <- broom::augment(mod, newdata = grid)
data_space <- ggplot(data = mario_kart,
aes(x = duration, y = start_pr)) +
geom_point(aes(color = total_pr)) +
theme_bw()
data_space
An alternative way to visualize a multiple regression model with two numeric explanatory variables is as a plane in three dimensions. This is possible in R using the plotly package.
We have created three objects that you will need:
Much like ggplot(), the plot_ly() function will allow you to create a plot object with variables mapped to x, y, and z aesthetics. The add_markers() function is similar to geom_point() in that it allows you to add points to your 3D plot.
library(plotly)
# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
add_markers()
x <- seq(1, 10, length = 70)
y <- seq(0.010, 59.950, length = 70)
plane <- outer(x, y, function(a, b){summary(mod)$coef[1,1] +
summary(mod)$coef[2,1]*a + summary(mod)$coef[3,1]*b})
# draw the plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
Coefficient magnitude
The coefficients from our model for the total auction price of MarioKarts as a function of auction duration and starting price are shown below.
##
## Call:
## lm(formula = total_pr ~ duration + start_pr, data = mario_kart)
##
## Coefficients:
## (Intercept) duration start_pr
## 51.030 -1.508 0.233
A colleague claims that these results imply that the duration of the auction is a more important determinant of final price than starting price, because the coefficient is larger. This interpretation is false because:
Answer: The coefficients have different units (dollars per day and dollars per dollar, respectively) and so they are not directly comparable.
Practicing interpretation
Fit a multiple regression model for the total auction price of an item in the mario_kart data set as a function of the starting price and the duration of the auction. Compute the coefficients and choose the correct interpretation of the duration variable.
##
## Call:
## lm(formula = total_pr ~ start_pr + duration, data = mario_kart)
##
## Coefficients:
## (Intercept) start_pr duration
## 51.030 0.233 -1.508
Answer: For each additional day the auction lasts, the expected final price declines by $1.51, after controlling for starting price.
Interpreting Coefficients
##
## Call:
## lm(formula = bwt ~ gestation + age + smoke, data = babies)
##
## Coefficients:
## (Intercept) gestation age smoke
## -4.6037 0.4455 0.1069 -8.0143
Gestation Coefficient: Each additional day of gestation is associated with an increase in expected birthweight of 0.45 ounces after controlling for mother’s age and smoking status.
Age Coefficient: Each additional year of the mother’s age is associated with an increase in expected birthweight of 0.1 ounces per year after controlling for gestational length and the mother’s smoking status.
Smoke Coefficient: Mothers who smoke will deliver babies that weigh 8 ounces less on average than mothers of the same age and gestational length who don’t smoke.
By including the duration, starting price, and condition variables in our model, we now have two explanatory variables and one categorical variable. Our model now takes the geometric form of two parallel planes!
The first plane corresponds to the model output when the condition of the item is new, while the second plane corresponds to the model output when the condition of the item is used. The planes have the same slopes along both the duration and starting price axes—it is the \(z\)-intercept that is different.
Once again we have stored the x and y vectors for you. Since we now have two planes, there are matrix objects plane0 and plane1 stored for you as well.
plane0 <- outer(x, y, function(a, b){53.3447530 -0.6559841*a + 0.1981653*b})
plane1 <- outer(x, y, function(a, b){53.3447530 -0.6559841*a + 0.1981653*b - 8.9493214})
# draw the 3D scatterplot
p <- plot_ly(data = mario_kart, z = ~total_pr, x = ~duration, y = ~start_pr, opacity = 0.6) %>%
add_markers(color = ~cond)
# draw two planes
p %>%
add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Parallel plane interpretation
The coefficients from our parallel planes model is shown below.
##
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond, data = mario_kart)
##
## Coefficients:
## (Intercept) duration start_pr condused
## 53.3448 -0.6560 0.1982 -8.9493
Choose the right interpretation of \(\beta_3\):
Answer: The expected premium for new (relative to used) MarioKarts is $8.95, after controlling for the duration and starting price of the auction.
Interpretation of coefficient in a big model
This time we have thrown even more variables into our model, including the number of bids in each auction (n_bids) and the number of wheels. Unfortunately this makes a full visualization of our model impossible, but we can still interpret the coefficients.
##
## Call:
## lm(formula = total_pr ~ duration + start_pr + cond + wheels +
## n_bids, data = mario_kart)
##
## Coefficients:
## (Intercept) duration start_pr condused wheels n_bids
## 39.3741 -0.2752 0.1796 -4.7720 6.7216 0.1909
Choose the correct interpretation of the coefficient on the number of wheels:
Answer: Each additional wheel is associated with an increase in the expected auction price of $6.72, after controlling for auction duration, starting price, number of bids, and the condition of the item.
In this chapter you’ll learn about using logistic regression, a generalized linear model (GLM), to predict a binary outcome and classify observations.
Video Explanation:
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary). Like all regression analyses, the logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.
When our response variable is binary, a regression model has several limitations. Among the more obvious—and logically incongruous—is that the regression line extends infinitely in either direction. This means that even though our response variable \(y\) only takes on the values 0 and 1, our fitted values \(\hat{y}\) can range anywhere from \(−∞\) to \(∞\). This doesn’t make sense.
To see this in action, we’ll fit a linear regression model to data about 55 students who applied to medical school. We want to understand how their undergraduate \(GPA\) relates to the probability they will be accepted by a particular school \((Acceptance)\).
The medical school acceptance data is loaded in your workspace as MedGPA.
library(Stat2Data)
data(MedGPA)
# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# linear regression line
data_space +
geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
In the previous exercise, we identified a major limitation to fitting a linear regression model when we have a binary response variable. However, it is not always inappropriate to do so. Note that our regression line only makes illogical predictions for students with very high or very low GPAs. For GPAs closer to average, the predictions seem fine.
Moreover, the alternative logistic regression model — which we will fit next — is very similar to the linear regression model for observations near the average of the explanatory variable. It just so happens that the logistic curve is very straight near its middle. Thus, in these cases a linear regression model may still be acceptable, even for a binary response.
# filter
MedGPA_middle <- MedGPA %>%
filter(GPA >= 3.375 & GPA <= 3.77)
# scatterplot with jitter
data_space <- ggplot(MedGPA_middle, aes(x = GPA, y = Acceptance)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# linear regression line
data_space + geom_smooth(method = "lm", se = FALSE)
## `geom_smooth()` using formula 'y ~ x'
Logistic regression is a special case of a broader class of generalized linear models, often known as GLMs. Specifying a logistic regression model is very similar to specify a regression model, with two important differences:
Note that the mathematical model is now:
\(\log{ \left( \frac{y}{1-y} \right) } = \beta_0 + \beta_1 \cdot x + \epsilon \,,\)
where \(ϵ\) is the error term.
Use glm() to fit a logistic regression model for Acceptance as a function of GPA.
##
## Call: glm(formula = Acceptance ~ GPA, family = binomial, data = MedGPA)
##
## Coefficients:
## (Intercept) GPA
## -19.207 5.454
##
## Degrees of Freedom: 54 Total (i.e. Null); 53 Residual
## Null Deviance: 75.79
## Residual Deviance: 56.84 AIC: 60.84
Our logistic regression model can be visualized in the data space by overlaying the appropriate logistic curve. We can use the geom_smooth() function to do this. Recall that geom_smooth() takes a method argument that allows you to specify what type of smoother you want to see. In our case, we need to specify that we want to use the glm() function to do the smoothing.
However we also need to tell the glm() function which member of the GLM family we want to use. To do this, we will pass the family argument to glm() as a list using the method.args argument to geom_smooth(). This mechanism is common in R, and allows one function to pass a list of arguments to another function.
# scatterplot with jitter
data_space <- ggplot(data = MedGPA, aes(y = Acceptance, x = GPA)) +
geom_jitter(width = 0, height = 0.05, alpha = 0.5)
# add logistic curve
data_space +
geom_smooth(method = "glm", se = FALSE, method.args = list(family = "binomial"))
## `geom_smooth()` using formula 'y ~ x'
One of the difficulties in working with a binary response variable is understanding how it “changes.” The response itself \((y)\) is either 0 or 1, while the fitted values \((\hat{y})\) — which are interpreted as probabilities — are between 0 and 1. But if every medical school applicant is either admitted or not, what does it mean to talk about the probability of being accepted?
What we’d like is a larger sample of students, so that for each GPA value (e.g. 3.54) we had many observations (say \(n\)), and we could then take the average of those \(n\) observations to arrive at the estimated probability of acceptance. Unfortunately, since the explanatory variable is continuous, this is hopeless—it would take an infinite amount of data to make these estimates robust.
Instead, what we can do is put the observations into bins based on their GPA value. Within each bin, we can compute the proportion of accepted students, and we can visualize our model as a smooth logistic curve through those binned values.
We have created a data.frame called MedGPA_binned that aggregates the original data into separate bins for each 0.25 of GPA. It also contains the fitted values from the logistic regression model.
## 0% 16.66667% 33.33333% 50% 66.66667% 83.33333% 100%
## 2.72 3.30 3.44 3.58 3.70 3.87 3.97
## Accept Acceptance Sex BCPM GPA VR PS WS BS MCAT Apps bins
## 1 D 0 F 3.59 3.62 11 9 9 9 38 5 (3.58,3.7]
## 2 A 1 M 3.75 3.84 12 13 8 12 45 3 (3.7,3.87]
## 3 A 1 F 3.24 3.23 9 10 5 9 33 19 [2.72,3.3]
## 4 A 1 F 3.74 3.69 12 11 7 10 40 5 (3.58,3.7]
## 5 A 1 F 3.53 3.38 9 11 4 11 35 11 (3.3,3.44]
## 6 A 1 M 3.59 3.72 10 9 7 10 36 5 (3.7,3.87]
MedGPA_binned <- MedGPA %>%
group_by(bins) %>%
summarize(mean_GPA = mean(GPA), acceptance_rate = mean(Acceptance))
## `summarise()` ungrouping output (override with `.groups` argument)
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 6 x 3
## bins mean_GPA acceptance_rate
## <fct> <dbl> <dbl>
## 1 [2.72,3.3] 3.11 0.2
## 2 (3.3,3.44] 3.39 0.2
## 3 (3.44,3.58] 3.54 0.75
## 4 (3.58,3.7] 3.65 0.333
## 5 (3.7,3.87] 3.79 0.889
## 6 (3.87,3.97] 3.91 1
Here we are plotting \(y\) as a function of \(x\), where that function is
\(y = \frac{\exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x )}}{1 + \exp( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) } \,.\)
Note that the left hand side is the expected probability \(y\) of being accepted to medical school.
# binned points and line
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = acceptance_rate)) +
geom_point() + geom_line()
mod <- glm(Acceptance ~ GPA, data = MedGPA, family = binomial)
MedGPA_plus <- mod %>%
augment(type.predict = "response")
MedGPA_plus <- MedGPA_plus %>%
mutate(odds_hat = .fitted / (1 - .fitted))
# logistic model on probability scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = .fitted), color = "red")
Everything starts with the concept of probability. Let’s say that the probability of success of some event is .8. Then the probability of failure is 1 – .8 = .2. The odds of success are defined as the ratio of the probability of success over the probability of failure. In our example, the odds of success are .8/.2 = 4. That is to say that the odds of success are 4 to 1. If the probability of success is .5, i.e., 50-50 percent chance, then the odds of success is 1 to 1.
The transformation from probability to odds is a monotonic transformation, meaning the odds increase as the probability increases or vice versa. Probability ranges from 0 and 1. Odds range from 0 and positive infinity.
Why do we take all the trouble doing the transformation from probability to log odds? One reason is that it is usually difficult to model a variable which has restricted range, such as probability. This transformation is an attempt to get around the restricted range problem. It maps probability ranging between 0 and 1 to log odds ranging from negative infinity to positive infinity. Another reason is that among all of the infinitely many choices of transformation, the log of odds is one of the easiest to understand and interpret. This transformation is called logit transformation. The other common choice is the probit transformation.
Video Explanation:
For most people, the idea that we could estimate the probability of being admitted to medical school based on undergraduate GPA is fairly intuitive. However, thinking about how the probability changes as a function of GPA is complicated by the non-linear logistic curve. By translating the response from the probability scale to the odds scale, we make the right hand side of our equation easier to understand.
If the probability of getting accepted is \(y\), then the odds are \(y / (1-y)\). Expressions of probabilities in terms of odds are common in many situations, perhaps most notably gambling.
Here we are plotting \(y / (1-y)\) as a function of \(x\), where that function is
\(odds(\hat{y}) = \frac{\hat{y}}{1-\hat{y}} = \exp{( \hat{\beta}_0 + \hat{\beta}_1 \cdot x ) }\)
Note that the left hand side is the expected odds of being accepted to medical school. The right hand side is now a familiar exponential function of \(x\).
The MedGPA_binned data frame contains the data for each GPA bin, while the MedGPA_plus data frame records the original observations after being augment()-ed by mod.
# compute odds for bins
MedGPA_binned <- MedGPA_binned %>%
mutate(odds = acceptance_rate / (1 - acceptance_rate))
# plot binned odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = odds)) +
geom_point() + geom_line()
# compute odds for observations
MedGPA_plus <- MedGPA_plus %>%
mutate(odds_hat = .fitted / (1 - .fitted))
# logistic model on odds scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = odds_hat), color = "red")
Previously, we considered two formulations of logistic regression models:
We’ll now add a third formulation:
As you can see, none of these three is uniformly superior. Most people tend to interpret the fitted values on the probability scale and the function on the log-odds scale. The interpretation of the coefficients is most commonly done on the odds scale. Recall that we interpreted our slope coefficient \(\beta_1\) in linear regression as the expected change in \(y\) given a one unit change in \(x\). On the probability scale, the function is non-linear and so this approach won’t work. On the log-odds, the function is linear, but the units are not interpretable (what does the \(log\) of the odds mean??). However, on the odds scale, a one unit change in \(x\) leads to the odds being multiplied by a factor of \(\beta_1\). To see why, we form the odds ratio:
\(OR = \frac{odds(\hat{y} | x + 1 )}{ odds(\hat{y} | x )} = \exp{\beta_1}\)
Thus, the exponentiated coefficent \(\beta_1\) tells us how the expected odds change for a one unit increase in the explanatory variable. It is tempting to interpret this as a change in the expected probability, but this is wrong and can lead to nonsensical predictions (e.g. expected probabilities greater than 1).
# compute log odds for bins
MedGPA_binned <- MedGPA_binned %>%
mutate(log_odds = log(acceptance_rate / (1 - acceptance_rate)))
# plot binned log odds
data_space <- ggplot(data = MedGPA_binned, aes(x = mean_GPA, y = log_odds)) +
geom_point() + geom_line()
# compute log odds for observations
MedGPA_plus <- MedGPA_plus %>%
mutate(log_odds_hat = log(.fitted / (1 - .fitted)))
# logistic model on log odds scale
data_space +
geom_line(data = MedGPA_plus, aes(x = GPA, y = log_odds_hat), color = "red")
Interpretation of logistic regression
The fitted coefficient \(\hat{\beta}_1\) from the medical school logistic regression model is 5.45. The exponential of this is 233.73.
Donald’s GPA is 2.9, and thus the model predicts that the probability of him getting into medical school is 3.26%. The odds of Donald getting into medical school are 0.0337, or—phrased in gambling terms—29.6:1. If Donald hacks the school’s registrar and changes his GPA to 3.9, then which of the following statements is FALSE:
Just as we did with linear regression, we can use our logistic regression model to make predictions about new observations. In this exercise, we will use the newdata argument to the augment() function from the broom package to make predictions about students who were not in our original data set. These predictions are sometimes called out-of-sample.
Following our previous discussion about scales, with logistic regression it is important that we specify on which scale we want the predicted values. Although the default is terms – which uses the log-odds scale – we want our predictions on the probability scale, which is the scale of the response variable. The type.predict argument to augment() controls this behavior.
A logistic regression model object, mod, has been defined for you.
# create new data frame
new_data <- data.frame(GPA = 3.51)
# make predictions
augment(mod, newdata = new_data, type.predict = "response")
## Warning: `...` is not empty.
##
## We detected these problematic arguments:
## * `needs_dots`
##
## These dots only exist to allow future extensions and should be empty.
## Did you misspecify an argument?
## # A tibble: 1 x 2
## GPA .fitted
## <dbl> <dbl>
## 1 3.51 0.484
Naturally, we want to know how well our model works. Did it predict acceptance for the students who were actually accepted to medical school? Did it predict rejections for the student who were not admitted? These types of predictions are called in-sample. One common way to evaluate models with a binary response is with a confusion matrix. [Yes, that is actually what it is called!]
However, note that while our response variable is binary, our fitted values are probabilities. Thus, we have to round them somehow into binary predictions. While the probabilities convey more information, we might ultimately have to make a decision, and so this rounding is common in practice. There are many different ways to round, but for simplicity we will predict admission if the fitted probability is greater than 0.5, and rejection otherwise.
First, we’ll use augment() to make the predictions, and then mutate() and round() to convert these probabilities into binary decisions. Then we will form the confusion matrix using the table() function. table() will compute a 2-way table when given a data frame with two categorical variables, so we will first use select() to grab only those variables.
You will find that this model made only 15 mistakes on these 55 observations, so it is nearly 73% accurate.
The model object mod is already in your worskpace.
# data frame with binary predictions
tidy_mod <- augment(mod, type.predict = "response") %>%
mutate(Acceptance_hat = round(.fitted))
# confusion matrix
tidy_mod %>%
select(Acceptance, Acceptance_hat) %>%
table()
## Acceptance_hat
## Acceptance 0 1
## 0 16 9
## 1 6 24
Explore the relationship between price and the quality of food, service, and decor for Italian restaurants in NYC.
Multiple regression can be an effective technique for understanding how a response variable changes as a result of changes to more than one explanatory variable. But it is not magic – understanding the relationships among the explanatory variables is also necessary, and will help us build a better model. This process is often called exploratory data analysis (EDA) and is covered in another DataCamp course.
One quick technique for jump-starting EDA is to examine all of the pairwise scatterplots in your data. This can be achieved using the pairs() function. Look for variables in the nyc data set that are strongly correlated, as those relationships will help us check for multicollinearity later on.
Which pairs of variables appear to be strongly correlated?
Based on your knowledge of the restaurant industry, do you think that the quality of the food in a restaurant is an important determinant of the price of a meal at that restaurant? It would be hard to imagine that it wasn’t. We’ll start our modeling process by plotting and fitting a model for Price as a function of Food.
On your own, interpret these coefficients and examine the fit of the model. What does the coefficient of Food mean in plain English? “Each additional rating point of food quality is associated with a…”
# Price by Food plot
nyc <- read.csv("nyc.csv")
ggplot(data = nyc, aes(x = Food, y = Price)) +
geom_point()
##
## Call:
## lm(formula = Price ~ Food, data = nyc)
##
## Coefficients:
## (Intercept) Food
## -17.832 2.939
Interpretation
Each additional rating point of food quality is associated with a price increase of $2.9.
In real estate, a common mantra is that the three most important factors in determining the price of a property are “location, location, and location.” If location drives up property values and rents, then we might imagine that location would increase a restaurant’s costs, which would result in them having higher prices. In many parts of New York, the east side (east of 5th Avenue) is more developed and perhaps more expensive. [This is increasingly less true, but was more true at the time these data were collected.]
Let’s expand our model into a parallel slopes model by including the East variable in addition to Food.
Use lm() to fit a parallel slopes model for Price as a function of Food and East. Interpret the coefficients and the fit of the model. Can you explain the meaning of the coefficient on East in simple terms? Did the coefficient on Food change from the previous model? If so, why? Did it change by a lot or just a little?
Identify the statement that is FALSE:
Each additional rating point of food quality is associated with a $2.88 increase in the expected price of meal, after controlling for location.
The premium for an Italian restaurant in NYC associated with being on the east side of 5th Avenue is $1.46, after controlling for the quality of the food.
The change in the coefficient of food from 2.94 in the simple linear model to 2.88 in this model has profound practical implications for restaurant owners.
One reason that many people go to a restaurant—apart from the food—is that they don’t have to cook or clean up. Many people appreciate the experience of being waited upon, and we can all agree that the quality of the service at restaurants varies widely. Are people willing to pay more for better restaurant Service? More interestingly, are they willing to pay more for better service, after controlling for the quality of the food?
Multiple regression gives us a way to reason about these questions. Fit the model with Food and Service and interpret the coefficients and fit. Did the coefficient on Food change from the previous model? What do the coefficients on Food and Service tell you about how these restaurants set prices?
Next, let’s visually assess our model using plotly. The x and y vectors, as well as the plane matrix, have been created for you.
##
## Call:
## lm(formula = Price ~ Food + Service, data = nyc)
##
## Coefficients:
## (Intercept) Food Service
## -21.159 1.495 1.704
# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
add_markers()
# draw a plane
p %>%
add_surface(x = ~x, y = ~y, z = ~plane, showscale = FALSE)
Collinearity
In statistics, multicollinearity (also collinearity) is a phenomenon in which one feature variable in a regression model is highly linearly correlated with another feature variable.
A collinearity is a special case when two or more variables are exactly correlated. This means that the regression coefficients are not uniquely determined. In turn it hurts the interpretability of the model as then the regression coefficients are not unique and have influences from other features.
We have explored models that included the quality of both food and service, as well as location, but we haven’t put these variables all into the same model. Let’s now build a parallel planes model that incorporates all three variables.
Examine the coefficients closely. Do they make sense based on what you understand about these data so far? How did the coefficients change from the previous models that you fit?
Use lm() to fit a parallel planes model for Price as a function of Food, Service, and East.
##
## Call:
## lm(formula = Price ~ Food + Service + East, data = nyc)
##
## Coefficients:
## (Intercept) Food Service East
## -20.8155 1.4863 1.6647 0.9649
Interpretation of location coefficient
Which of the following statements is FALSE of the fitted coefficients from the parallel planes model?
The premium for being on the East side of 5th Avenue is just less than a dollar, after controlling for the quality of food and service.
The impact of location is relatively small, since one additional rating point of either food or service would result in a higher expected price than moving a restaurant from the West side to the East side.
The expected price of a meal on the East side is about 96% of the cost of a meal on the West side, after controlling for the quality of food and service.
The impact of location brings us to a modeling question: should we keep this variable in our model? In a later course, you will learn how we can conduct formal hypothesis tests to help us answer that question. In this course, we will focus on the size of the effect. Is the impact of location big or small?
One way to think about this would be in terms of the practical significance. Is the value of the coefficient large enough to make a difference to your average person? The units are in dollars so in this case this question is not hard to grasp.
Another way is to examine the impact of location in the context of the variability of the other variables. We can do this by building our parallel planes in 3D and seeing how far apart they are. Are the planes close together or far apart? Does the East variable clearly separate the data into two distinct groups? Or are the points all mixed up together?
# draw 3D scatterplot
p <- plot_ly(data = nyc, z = ~Price, x = ~Food, y = ~Service, opacity = 0.6) %>%
add_markers(color = ~factor(East))
# draw two planes
p %>%
add_surface(x = ~x, y = ~y, z = ~plane0, showscale = FALSE) %>%
add_surface(x = ~x, y = ~y, z = ~plane1, showscale = FALSE)
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
Full model
One variable we haven’t considered is Decor. Do people, on average, pay more for a meal in a restaurant with nicer decor? If so, does it still matter after controlling for the quality of food, service, and location?
By adding a third numeric explanatory variable to our model, we lose the ability to visualize the model in even three dimensions. Our model is now a hyperplane – or rather, parallel hyperplanes – and while we won’t go any further with the geometry, know that we can continue to add as many variables to our model as we want. As humans, our spatial visualization ability taps out after three numeric variables (maybe you could argue for four, but certainly no further), but neither the mathematical equation for the regression model, nor the formula specification for the model in R, is bothered by the higher dimensionality.
Use lm() to fit a parallel planes model for Price as a function of Food, Service, Decor, and East.
##
## Call:
## lm(formula = Price ~ Food + Service + Decor + East, data = nyc)
##
## Coefficients:
## (Intercept) Food Service Decor East
## -24.023800 1.538120 -0.002727 1.910087 2.068050
Notice the dramatic change in the value of the Service coefficient.
Which of the following interpretations is invalid?
Since the quality of food, decor, and service were all strongly correlated, multicollinearity is the likely explanation.
Once we control for the quality of food, decor, and location, the additional information conveyed by service is negligible.
Service is not an important factor in determining the price of a meal.
None of the above.