Prompt

Using R, build a regression model for data that interests you. Conduct residual analysis. Was the linear model appropriate? Why or why not?

Regression Model

I chose to use some NBA team data easily accessed using the \(NBAloveR\) package. The goal is to determine the relationship between the Offensive Rating, Defensive Rating, and Rating Difference as compared to the team’s win total in a given season. Rating Difference is a simple derived variable.

Load Libraries

library(tidyverse)
library(ggplot2)
library(NBAloveR)

Load Data

As the package only allows to pull data from a single franchise, I selected five just to keep the code readable. I selected the Boston Celtics, Los Angeles Lakers, Chicago Bulls, San Antonio Spurs, and Philadelphia 76ers.

bos <- getTeamHistory(team_code = "bos")
lal <- getTeamHistory(team_code = "lal")
chi <- getTeamHistory(team_code = "chi")
sas <- getTeamHistory(team_code = "sas")
phi <- getTeamHistory(team_code = "phi")

nba <- rbind(bos, lal)
nba <- rbind(nba, chi)
nba <- rbind(nba, sas)
nba <- rbind(nba, phi)

Data Derivation

Derive the attribute Rating Difference.

nba$RtgDif <- nba$ORtg - nba$DRtg

head(nba)
##    Season  Lg           Team  W  L  W/L%   Finish  SRS Pace RelativePace  ORtg
## 1 2020-21 NBA Boston Celtics 30 26 0.536 3rd of 5 2.40 97.9         -1.3 114.3
## 2 2019-20 NBA Boston Celtics 48 24 0.667 2nd of 5 5.83 99.5         -0.8 113.3
## 3 2018-19 NBA Boston Celtics 49 33 0.598 3rd of 5 3.90 99.6         -0.4 112.2
## 4 2017-18 NBA Boston Celtics 55 27 0.671 2nd of 5 3.23 96.0         -1.3 107.6
## 5 2016-17 NBA Boston Celtics 53 29 0.646 1st of 5 2.25 96.8          0.4 111.2
## 6 2015-16 NBA Boston Celtics 48 34 0.585 2nd of 5 2.84 98.5          2.7 106.8
##   RelativeORtg  DRtg RelativeDRtg               Playoffs            Coaches
## 1          2.2 112.2          0.1                        B. Stevens (30-26)
## 2          2.7 107.0         -3.6   Lost E. Conf. Finals B. Stevens (48-24)
## 3          1.8 107.8         -2.6    Lost E. Conf. Semis B. Stevens (49-33)
## 4         -1.0 103.9         -4.7   Lost E. Conf. Finals B. Stevens (55-27)
## 5          2.4 108.4         -0.4   Lost E. Conf. Finals B. Stevens (53-29)
## 6          0.4 103.6         -2.8 Lost E. Conf. 1st Rnd. B. Stevens (48-34)
##        TopWinShare RtgDif
## 1   J. Tatum (4.9)    2.1
## 2   J. Tatum (6.9)    6.3
## 3  K. Irving (9.1)    4.4
## 4  K. Irving (8.9)    3.7
## 5 I. Thomas (12.5)    2.8
## 6  I. Thomas (9.7)    3.2

EDA Plots

Plot each rating attribute against team wins.

nba %>%
  ggplot(aes(x=ORtg, y=W)) + 
  geom_point() + 
  labs(title = 'Offensive Rating vs Wins') + geom_smooth(method='lm', formula= y~x)

nba %>%
  ggplot(aes(x=DRtg, y=W)) + 
  geom_point() + 
  labs(title = 'Defensive Rating vs Wins') + geom_smooth(method='lm', formula= y~x)

nba %>%
  ggplot(aes(x=RtgDif, y=W)) + 
  geom_point() + 
  labs(title = 'Rating Difference vs Wins') + geom_smooth(method='lm', formula= y~x)

For Offensive Rating, a slight positive linear trend is visible, yet the scatterplot indicates wide variability in the relationship. Similar for Defensive Rating, except the linear relationship is negative. As for Rating Difference, the scatterplot shows a strong positive relationship between the rating difference and the team wins.

Correlation

corr <- cor(nba$RtgDif, nba$W, use = "complete.obs")
corr
## [1] 0.9151873

The correlation between Rating Difference and team wins is above 90%, indicating a high correlation between the two variables.

Linear Regression Models

model_ortg <- lm(W ~ ORtg, data=nba)
summary(model_ortg)
## 
## Call:
## lm(formula = W ~ ORtg, data = nba)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.914  -8.165   1.713   9.013  26.528 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -20.34924    9.76658  -2.084    0.038 *  
## ORtg          0.63019    0.09424   6.687 1.02e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.14 on 320 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.1226, Adjusted R-squared:  0.1199 
## F-statistic: 44.71 on 1 and 320 DF,  p-value: 1.017e-10
model_drtg <- lm(W ~ DRtg, data=nba)
summary(model_drtg)
## 
## Call:
## lm(formula = W ~ DRtg, data = nba)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -36.239  -9.704   0.741  10.228  27.284 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 89.06285    9.78860   9.099  < 2e-16 ***
## DRtg        -0.43562    0.09609  -4.533 8.21e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.56 on 320 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.06035,    Adjusted R-squared:  0.05741 
## F-statistic: 20.55 on 1 and 320 DF,  p-value: 8.214e-06
model_rtgdif <- lm(W ~ RtgDif, data=nba)
summary(model_rtgdif)
## 
## Call:
## lm(formula = W ~ RtgDif, data = nba)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -25.9646  -2.2960   0.7784   3.3895   9.8133 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 40.33265    0.31117  129.62   <2e-16 ***
## RtgDif       2.51466    0.06191   40.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.223 on 320 degrees of freedom
##   (7 observations deleted due to missingness)
## Multiple R-squared:  0.8376, Adjusted R-squared:  0.8371 
## F-statistic:  1650 on 1 and 320 DF,  p-value: < 2.2e-16

After generating the linear regression models for each rating variable against the team wins, a clear relationship is identified between the rating difference and the team wins with an R-squared value of greater than 83%. This result indicates the Rating Difference does explain most of the variation in the Wins variable. I do believe for the derived rating difference value, the linear model is appropriate. Alone, the Offensive Rating, nor the Defensive Rating, indicates whether a team is good or not. A high-scoring team may also be very bad at defensive. By taking the difference, teams with a good offense and a good defense would expect to be considered “good”, and thus win more games. Also, by taking the difference, the pace of play will not impact the model quite as much. A high scoring team in the 1990s may not be the same as a high-scoring team in the 2010s. By taking the difference, the style and pace of games for a certain era is largely masked.