Multiple regression is one of the most used statistical analysis techniques. Like simple bivariate regression, the idea is to estimate (or predict) a dependent variable. The difference between simple bivariate regression and multiple regression is that more than one explanatory variable (a.k.a., independent variable) is used to estimate (predict) a dependent variable. In R multiple regression takes the following form:
lm(DepVariable ~ ExplanVariable1 + ExplanVariable2 + ExplanVariable…, data = DataSet)
Here we are trying to predict the dependent variable with an equation that uses the explanatory variables. In the following we will use R’s lm() and glm() functions to create these equations.
##Data We will be using National Basketball Association (NBA) data for our examples of multiple regression and multiple logistic regression. The data come from Basketball Reference https://www.basketball-reference.com/ and the data are for the 2018-2019 season. The variables include:
The idea here is simple: A team wants its 3 and 2 point percentages to be high, and they want to limit their opponents 3 and 2 point percentages.
Below is the data set:
#view data
datatable(NBA1819)
2018-19 NBA Season
Next we will examine the correlations between the numeric variables in the data set.
ggpairs(NBA1819[, 2:6])
Correlation Matrix
Our goal is estimate (predict) Wins during the 2018-19 season. The correlation matrix shows for Wins relatively strong positive correlations with ThreePointPctTeam and TwoPointPctTeam and relatively strong negative correlations with ThreePointPctOpponent and TwoPointPctOpponent. We will use all four of these variables to estimate (predict) Wins.
Tommy = lm(Wins ~ scale(TwoPointPctTeam) + scale(ThreePointPctTeam) +
scale(TwoPointPctOpponent) + scale(ThreePointPctOpponent),
data = NBA1819)
summary(Tommy)
##
## Call:
## lm(formula = Wins ~ scale(TwoPointPctTeam) + scale(ThreePointPctTeam) +
## scale(TwoPointPctOpponent) + scale(ThreePointPctOpponent),
## data = NBA1819)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.0922 -3.9831 -0.8699 3.4566 12.0016
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.000 1.039 39.448 < 2e-16 ***
## scale(TwoPointPctTeam) 4.221 1.208 3.495 0.001790 **
## scale(ThreePointPctTeam) 4.393 1.130 3.888 0.000659 ***
## scale(TwoPointPctOpponent) -4.029 1.231 -3.272 0.003114 **
## scale(ThreePointPctOpponent) -3.454 1.185 -2.914 0.007408 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.693 on 25 degrees of freedom
## Multiple R-squared: 0.8069, Adjusted R-squared: 0.776
## F-statistic: 26.12 on 4 and 25 DF, p-value: 1.308e-08
#Standard Deviations
sd(NBA1819$TwoPointPctTeam); sd(NBA1819$ThreePointPctTeam)
## [1] 0.02018822
## [1] 0.01532127
sd(NBA1819$TwoPointPctOpponent); sd(NBA1819$ThreePointPctOpponent)
## [1] 0.01750534
## [1] 0.01131655
In this multiple regression model (equation) all four coefficients are statistically significant (i.e., they are all less than 0.05), so we assume that they are not actually zero (0) which would have meant that they had no association (effect) on winning basketball games.
The regression coefficients are interpreted a bit differently in multiple regression than in simple bivariate regression. Because there is more than one explanatory variable in the model (the equation) the interpretation the other explanatory variables must be taken into consideration. So controlling for the other explanatory variables in the model, a one standard deviation increase in a team’s two point percentage (an increase of 0.02) will lead to 4.22 more wins over a season; a one standard deviation increase in a team’s Three point percentage (an increase of 0.015) will lead to 4.39 more wins over a season; a one standard deviation increase in a team’s opponents’ two point percentage (an increase of 0.017) will lead to 4.02 fewer wins over a season; and a one standard deviation increase in a team’s opponents’ three point percentage (an increase of 0.011) will lead to 3.45 fewer wins over a season. These results suggest that the most important component in winning is the percentage of three point shots that a team makes. If you follow the NBA you know that this is the current theory of winning.
plot(predict(Tommy), NBA1819$Wins,
xlab="Predicted Wins",ylab="Actual Wins")
abline(a=0,b=1)
Plot of predicted wins vs. actual wins
The plot of actual wins versus predicted wins shows that the predicted wins do not deviate very far from the actual wins (they are close to the 45-degree line). This is also reflected in the model’s R2 (0.806) which shows that the model accounts of 80.6 percent of the variations in wins. In multiple regression models the adjusted R2 is a more accurate measure of fit (0.776). Both measures show that this is a strong model.
##Multiple Logistic Regression
Hilary Parker, a data scientist for the Biden campaign: “We did a lot of logistic regression. As you can imagine there are a lot of binary outcomes.”
Roger Peng, Professor Department of Biostatistics, Johns Hopkins University: “It is the bread and butter of data science.”
The Not So Standard Deviations podcast, episode 119
Going to the NBA playoffs is a binary outcome: teams either make the playoff or they do not. Every season in the NBA, 16 teams can make the playoffs while the other 14 stay at home. We are going to build a multiple logistic regression model to estimate (predict) which of the 16 teams made the 2019 NBA playoffs. The dependent variable is Playoffs (1-made it, 0-did not) and the explanatory variables will be the same explanatory variables used in the multiple regression model above because to make the playoffs you have to wind games during the regular season.
Heinsohn = glm(Playoff ~ TwoPointPctTeam + ThreePointPctTeam + TwoPointPctOpponent +
ThreePointPctOpponent, family = binomial, data = NBA1819)
summary(Heinsohn)
##
## Call:
## glm(formula = Playoff ~ TwoPointPctTeam + ThreePointPctTeam +
## TwoPointPctOpponent + ThreePointPctOpponent, family = binomial,
## data = NBA1819)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.71146 -0.15790 0.05172 0.40288 1.76705
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 86.51 83.53 1.036 0.3004
## TwoPointPctTeam 13.03 45.93 0.284 0.7767
## ThreePointPctTeam 108.62 52.32 2.076 0.0379 *
## TwoPointPctOpponent -104.73 66.75 -1.569 0.1166
## ThreePointPctOpponent -219.48 130.04 -1.688 0.0915 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 15.353 on 25 degrees of freedom
## AIC: 25.353
##
## Number of Fisher Scoring iterations: 7
#McFadden's pseudo R2
r2_mcfadden(Heinsohn)
## # R2 for Generalized Linear Regression
##
## R2: 0.630
## adj. R2: 0.581
#Change model coefficients into percent changes
a = exp(coef(Heinsohn))
round(100 * (a - 1))
## (Intercept) TwoPointPctTeam ThreePointPctTeam
## 3.726220e+39 4.559684e+07 1.486835e+49
## TwoPointPctOpponent ThreePointPctOpponent
## -1.000000e+02 -1.000000e+02
The models shows that only the coefficient for ThreePointPctTeam is statistically significant at the 0.05 criteria. Despite there being only one statistically significant explanatory variable in the model, the pseudo R2 and the adjusted pseudo R2 are rather high (0.63 and 0.581). This suggests that this is a strong model.
Looking at the transformed multiple logistic regression coefficients, controlling for the other explanatory variables in the model, a one percent increase in a teams three point percentage leads to an extremely high likelihood (1 with 49 zeros after it) of making the playoffs. Once again the most important thing in the modern NBA is making 3-pointers.
probabilities = predict(Heinsohn, type = "response")
PlayoffsEstimates = ifelse(probabilities > 0.5, "Yes", "No")
PlayoffsReal = ifelse(NBA1819$Playoff == 1, "Yes", "No")
table(PlayoffsReal, PlayoffsEstimates)
## PlayoffsEstimates
## PlayoffsReal No Yes
## No 12 2
## Yes 2 14
cbind(NBA1819$Team, PlayoffsReal, PlayoffsEstimates)
## PlayoffsReal PlayoffsEstimates
## 1 "Atlanta Hawks" "No" "No"
## 2 "Boston Celtics" "Yes" "Yes"
## 3 "Brooklyn Nets" "Yes" "Yes"
## 4 "Charlotte Hornets" "No" "No"
## 5 "Chicago Bulls" "No" "No"
## 6 "Cleveland Cavaliers" "No" "No"
## 7 "Dallas Mavericks" "No" "No"
## 8 "Denver Nuggets" "Yes" "Yes"
## 9 "Detroit Pistons" "Yes" "No"
## 10 "Golden State Warriors" "Yes" "Yes"
## 11 "Houston Rockets" "Yes" "Yes"
## 12 "Indiana Pacers" "Yes" "Yes"
## 13 "Los Angeles Clippers" "Yes" "Yes"
## 14 "Los Angeles Lakers" "No" "No"
## 15 "Memphis Grizzlies" "No" "No"
## 16 "Miami Heat" "No" "Yes"
## 17 "Milwaukee Bucks" "Yes" "Yes"
## 18 "Minnesota Timberwolves" "No" "No"
## 19 "New Orleans Pelicans" "No" "No"
## 20 "New York Knicks" "No" "No"
## 21 "Oklahoma City Thunder" "Yes" "No"
## 22 "Orlando Magic" "Yes" "Yes"
## 23 "Philadelphia 76ers" "Yes" "Yes"
## 24 "Phoenix Suns" "No" "No"
## 25 "Portland Trail Blazers" "Yes" "Yes"
## 26 "Sacramento Kings" "No" "Yes"
## 27 "San Antonio Spurs" "Yes" "Yes"
## 28 "Toronto Raptors" "Yes" "Yes"
## 29 "Utah Jazz" "Yes" "Yes"
## 30 "Washington Wizards" "No" "No"
The model incorrectly estimated (predicted) that the Pistons and the Thunder would not make the playoff when in fact they did, and it estimated (predicted) that the Heat and the Kings would make the playoff when in fact they did not. The model got 12 playoff teams out of 16 right which is not too bad.
If this model considered what conference a team was in (East or West) would it make even better predictions?