Regression is the workhorse of statistical analysis. At its very simplest bivariate regression is the fitting of a line though the middle of a set of points defined by two variables, one on a y-axis and the other on a x-axis. The bivariate regression line is actually a 2-dimensional mean representing the middle of the points.
The idea behind simple bivariate regression is to estimate (or predict) a dependent variable using an explanatory variable (also known as an independent variable). As you can probably already see, regression is related to correlation but it is not correlation. In R a simple regression model takes the form:
lm(DependentVariable ~ ExplanatoryVariable, data = DataSet)
The dependent variable (the thing that you are researching) is always on the left and the explanatory variable is always on the right. (Later we will see that you can have more than one explanatory variable on the right side). To be very simple, you are trying to predict the dependent variable with an equation that uses the explanatory variable. The regression procedure (in this case R’s lm() function) creates this equation.
We will be using National Basketball Association (NBA) data for our examples of regression analysis. The data comes from Basketball Reference https://www.basketball-reference.com/ and the data are for the 2018-2019 season. The variables include:
Below is the data set:
#view data
datatable(NBA)
2018-19 NBA Season
First we will examine the correlations between the numeric variable in the data set.
ggpairs(NBA[, 2:6])
a correlation/scatter plot matrix
We want to estimate (predict) Wins and the correlation/scatter plot matrix suggests that TwoPointPct is the best variable for doing this (rather oddly this goes against the modern theory of the game – 3-pointers). To make the results more interpretable we will scale the explanatory variable (TwoPointPct) to have a mean of 0 and a standard deviation of 1. Here are the result of the simple regression analysis:
Jabbar = lm(Wins ~ scale(TwoPointPct), data = NBA)
summary(Jabbar)
##
## Call:
## lm(formula = Wins ~ scale(TwoPointPct), data = NBA)
##
## Residuals:
## Min 1Q Median 3Q Max
## -28.461 -6.459 3.316 7.707 15.927
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 41.000 2.043 20.07 <2e-16 ***
## scale(TwoPointPct) 4.882 2.078 2.35 0.026 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 11.19 on 28 degrees of freedom
## Multiple R-squared: 0.1648, Adjusted R-squared: 0.1349
## F-statistic: 5.523 on 1 and 28 DF, p-value: 0.02605
Starting with the coefficients, when the explanatory variable is not scaled the intercept is meaningless, but because it is scaled here the intercept represent the average number of points the 30 teams won during the 2018-19 season (41). the estimate for TwoPointPct tells us that if a team increased ther 2-point percentage by one standard deviation (here 0.022) they would be estimated to win 4.8 more games in the 2018-19 season. The P-value for this estimate is 0.026 which is less than 0.05 so we accepted it as statically significant, which is another way of saying that we trust that the estimate (4.8 more wins) is not actually 0 more wins. Lastly, we look at the R2. It is 0.164 which means that it accounts for (explains) 16.4 percent of the variation in Wins. The relationship between Wins and TwoPointPct is a weak one. A interesting fact, if you go back up to the correlation matrix and look at the correlation between Wins and TwoPointPct (0.406) and you square this number (take it to the 2nd power) you will get 0.164. As noted, regression is related to correlation but it is not correlation.
Simple bivariate logistic regression differs from Simple bivariate regression in several ways. The primary way that it differs is that it attempts to predict categories or classes – e.g., pass vs. fail, woman vs man, freshman vs. senior, TTU students vs. UT students, …. In our example will will try to predict if a team made the NBA playoffs in 2019 using AvgPoints as the explanatory variable.
Bird = glm(Playoff ~ AvgPoints, data = NBA, family = binomial)
summary(Bird)
##
## Call:
## glm(formula = Playoff ~ AvgPoints, family = binomial, data = NBA)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6100 -1.0726 0.7356 1.1043 1.3594
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -16.3340 11.3821 -1.435 0.151
## AvgPoints 0.1550 0.1071 1.447 0.148
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 41.455 on 29 degrees of freedom
## Residual deviance: 39.168 on 28 degrees of freedom
## AIC: 43.168
##
## Number of Fisher Scoring iterations: 4
round(((exp(coef(Bird)) - 1)) *100, 3)
## (Intercept) AvgPoints
## -100.000 16.763
probabilites = predict(Bird, type = "response")
PlayoffsEstimates = ifelse(probabilites > 0.5, "Yes", "No")
PlayoffsReal = ifelse(NBA$Playoff == 1, "Yes", "No")
table(PlayoffsReal, PlayoffsEstimates)
## PlayoffsEstimates
## PlayoffsReal No Yes
## No 8 6
## Yes 5 11
cbind(NBA$Teams, PlayoffsReal, PlayoffsEstimates)
## PlayoffsReal PlayoffsEstimates
## 1 "Atlanta Hawks" "No" "No"
## 2 "Boston Celtics" "Yes" "No"
## 3 "Brooklyn Nets" "Yes" "Yes"
## 4 "Charlotte Hornets" "No" "Yes"
## 5 "Chicago Bulls" "No" "No"
## 6 "Cleveland Cavaliers" "No" "Yes"
## 7 "Dallas Mavericks" "No" "No"
## 8 "Denver Nuggets" "Yes" "Yes"
## 9 "Detroit Pistons" "Yes" "No"
## 10 "Golden State Warriors" "Yes" "Yes"
## 11 "Houston Rockets" "Yes" "Yes"
## 12 "Indiana Pacers" "Yes" "Yes"
## 13 "Los Angeles Clippers" "Yes" "Yes"
## 14 "Los Angeles Lakers" "No" "Yes"
## 15 "Memphis Grizzlies" "No" "No"
## 16 "Miami Heat" "No" "No"
## 17 "Milwaukee Bucks" "Yes" "Yes"
## 18 "Minnesota Timberwolves" "No" "Yes"
## 19 "New Orleans Pelicans" "No" "Yes"
## 20 "New York Knicks" "No" "No"
## 21 "Oklahoma City Thunder" "Yes" "Yes"
## 22 "Orlando Magic" "Yes" "No"
## 23 "Philadelphia 76ers" "Yes" "Yes"
## 24 "Phoenix Suns" "No" "No"
## 25 "Portland Trail Blazers" "Yes" "Yes"
## 26 "Sacramento Kings" "No" "No"
## 27 "San Antonio Spurs" "Yes" "No"
## 28 "Toronto Raptors" "Yes" "Yes"
## 29 "Utah Jazz" "Yes" "No"
## 30 "Washington Wizards" "No" "Yes"
Instead of the lm() function we use the glm() function with a binomial distribution to estimate the logistic regression equation. Looking again at the estimates, we see that they are in log odds unites, we are going to transform them into something a bit easier to interpret. If we look at the estimate for AvgPoints’s P-value we see that it is 0.148 which is greater than 0.05, so we are going to assume that AvgPoints has no effect on teams reaching the playoffs in 2019.
Even though we do not trust that the AvgPoints estimate (0.155) is statistically significant, just as an example we are going to interpret it any way. We transform it first – round(((exp(coef(Bird)) - 1)) *100, 3) – to turn it into a percent change. The transformed estimate is 16.7 percent which means that if teams scoring average increases by 1 percent there chances of making the playoffs in 2019 increase by 16.7 percent. But this estimate is insignificant so we assume that their changes will actually increase by 0 percent.
How accurate did the model predict playoff and non-playoff teams? If predicted correctly 8 non-playoff teams, and it predicted correctly 11 playoff teams. Of course only 16 team can make the playoff so that is a problem. In general this is not a very good model, it is 63.3 percent accurate (19 /30) which might seem good until you realized that just by chance you should at least be 50 percent accurate.
The last table shows actually playoff and non-playoff teams and their predictions. Of the six teams that were estimated to make the playoff but actually did not, their defensive ranking were 17 - Charlotte, 21 - LA Lakers, 23- Minnesota, 24 - Cleveland, 27- New Orleans, and 29 - Washington. Evidently, making the playoffs takes more than just scoring lots of points.