Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?

I chose to use the ‘nba-player-advanced-metrics’ dataset from fivethiryeight

Load Data

#load data from Github path
nba_playerdata <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/nba-player-advanced-metrics/master/nba-data-historical.csv", TRUE, ",")
#I will be performing analysis on a subset of the entire raw data
subset_data <-  nba_playerdata[c(2,3,5,12,13,25)]

Data Transformation

#I am interested in players who are either Shooting Guards or Center position. I am also only interested in the players that play 25 minutes or more in the given season
conditions <-  filter (subset_data, pos %in%c("SG","C"), MPG > 25) 
#Sorting by highest points scored per 36 minutes descending
df <- conditions[with(conditions, order(-P.36)), ]
#Column Names
names (df) <- c("Player","Year","Position","Minutes","Points", "Usage Rate") 
#Top 50 Scorers
df [1:50,1:6]
##                Player Year Position Minutes Points Usage Rate
## 2108   Michael Jordan 1987       SG    40.0   34.8       38.3
## 1023      Kobe Bryant 2006       SG    41.0   34.2       38.7
## 2057   Michael Jordan 1988       SG    40.4   32.7       34.1
## 2155   Michael Jordan 1986       SG    25.1   32.7       38.6
## 27       James Harden 2020       SG    36.7   32.6       36.4
## 1790   Michael Jordan 1993       SG    39.3   32.3       34.7
## 1889   Michael Jordan 1991       SG    37.0   32.0       32.9
## 1952   Michael Jordan 1990       SG    39.0   32.0       33.7
## 1613   Michael Jordan 1996       SG    37.7   31.9       33.3
## 2385    George Gervin 1982       SG    35.7   31.8       35.0
## 185      James Harden 2018       SG    35.4   31.7       36.1
## 1219    Tracy McGrady 2003       SG    39.4   31.5       35.2
## 818       Dwyane Wade 2009       SG    38.6   31.3       36.2
## 1553   Michael Jordan 1997       SG    37.9   31.3       33.2
## 295  DeMarcus Cousins 2017        C    34.4   30.7       37.5
## 1504 Shaquille O'Neal 1998        C    36.3   30.1       32.9
## 1481   Michael Jordan 1998       SG    38.8   30.0       33.7
## 1708 Shaquille O'Neal 1995        C    37.0   30.0       31.9
## 2011   Michael Jordan 1989       SG    40.2   30.0       32.1
## 956       Kobe Bryant 2007       SG    40.8   29.8       33.6
## 73       Bradley Beal 2020       SG    36.0   29.7       34.4
## 673       Kobe Bryant 2011       SG    33.9   29.7       35.1
## 1444 Shaquille O'Neal 1999        C    34.8   29.7       32.4
## 1839   Michael Jordan 1992       SG    38.8   29.6       31.7
## 746       Dwyane Wade 2010       SG    36.3   29.4       34.9
## 950          Yao Ming 2007        C    33.8   29.4       33.5
## 1263 Shaquille O'Neal 2002        C    36.1   29.4       31.8
## 1338    Allen Iverson 2001       SG    42.0   29.4       35.9
## 1774   David Robinson 1994        C    40.5   29.4       32.0
## 301     DeMar DeRozan 2017       SG    35.4   29.3       34.3
## 1918     Ricky Pierce 1991       SG    28.8   29.3       30.7
## 1979     Ricky Pierce 1990       SG    29.0   29.2       31.3
## 288       Joel Embiid 2017        C    25.4   29.1       36.0
## 1281    Allen Iverson 2002       SG    43.7   29.1       37.8
## 142       Joel Embiid 2019        C    33.7   29.0       33.3
## 960       Dwyane Wade 2007       SG    37.9   29.0       34.7
## 2491    George Gervin 1980       SG    37.6   29.0       31.7
## 1646 Shaquille O'Neal 1996        C    36.0   28.9       32.8
## 600       Kobe Bryant 2012       SG    38.5   28.8       35.7
## 1382 Shaquille O'Neal 2000        C    40.0   28.6       31.2
## 279     Anthony Davis 2017        C    36.1   28.5       32.6
## 1327 Shaquille O'Neal 2001        C    39.5   28.5       31.6
## 814       Kobe Bryant 2009       SG    36.1   28.4       32.2
## 57        Joel Embiid 2020        C    30.2   28.3       32.6
## 1202 Shaquille O'Neal 2003        C    37.8   28.3       30.2
## 614       Brook Lopez 2012        C    27.2   28.2       32.7
## 1201      Kobe Bryant 2003       SG    41.5   28.2       32.9
## 2494    World B. Free 1980       SG    38.0   28.2       32.7
## 1314 Jerry Stackhouse 2001       SG    40.2   28.1       35.2
## 2439    George Gervin 1981       SG    33.7   28.1       32.3

Linear Model

The model will predict the points a player will score in 36 minutes.

Position will be the dichotomous variable. In this case, I will assign a value of ‘1’ if the player is a shooting guard and a ‘0’ if the player is a center.

Minutes Per Game will be another predictor.

Usage Rate will be the quadratic predictor because a player is likely to score more if they are relied on more to carry an offensive load for their respective team. There have been seasons where a player has had a lack of talent/support surrounding the offense or the other scorers are injured. For these reasons, the star players in the above list had to single handedly carry the offensive load.

#dichotomous
x<- as.numeric(gsub("SG", 1, gsub("C", 0, df$Position)))
x2 <- as.data.frame (x)
df2 <- cbind(df,x2)
#quadratic
usg <- (df$`Usage Rate`)^2
final_df <- cbind (df2, usg)
#data frame for regression
head (final_df)
##              Player Year Position Minutes Points Usage Rate x     usg
## 2108 Michael Jordan 1987       SG    40.0   34.8       38.3 1 1466.89
## 1023    Kobe Bryant 2006       SG    41.0   34.2       38.7 1 1497.69
## 2057 Michael Jordan 1988       SG    40.4   32.7       34.1 1 1162.81
## 2155 Michael Jordan 1986       SG    25.1   32.7       38.6 1 1489.96
## 27     James Harden 2020       SG    36.7   32.6       36.4 1 1324.96
## 1790 Michael Jordan 1993       SG    39.3   32.3       34.7 1 1204.09
regression_model <- lm (Points ~ x+Minutes+usg, data=final_df)
summary(regression_model)
## 
## Call:
## lm(formula = Points ~ x + Minutes + usg, data = final_df)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.3860 -0.9326  0.0246  1.0460  5.5605 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  6.2908008  0.2372435  26.516   <2e-16 ***
## x           -0.0426340  0.0619928  -0.688    0.492    
## Minutes      0.0678619  0.0081233   8.354   <2e-16 ***
## usg          0.0189545  0.0001506 125.900   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.555 on 2655 degrees of freedom
## Multiple R-squared:  0.8884, Adjusted R-squared:  0.8883 
## F-statistic:  7045 on 3 and 2655 DF,  p-value: < 2.2e-16

Model Summary

\[ \hat{y} = 6.2908 -0.0426 * position + 0.0679 * minutes + 0.0190 * usage \]

Position

The p-value is high and we should not rely on the model results to be accurate.

Minutes Per Game

The p-value is low.

Usage Rate

The p-value is low.

\(R^2\) is high at 88%.

Inference & Residual Analysis

qqnorm(regression_model$residuals)
qqline(regression_model$residuals)