Using R, build a multiple regression model for data that interests you. Include in this model at least one quadratic term, one dichotomous term, and one dichotomous vs. quantitative interaction term. Interpret all coefficients. Conduct residual analysis. Was the linear model appropriate? Why or why not?
I chose to use the ‘nba-player-advanced-metrics’ dataset from fivethiryeight
#load data from Github path
nba_playerdata <- read.csv("https://raw.githubusercontent.com/fivethirtyeight/nba-player-advanced-metrics/master/nba-data-historical.csv", TRUE, ",")
#I will be performing analysis on a subset of the entire raw data
subset_data <- nba_playerdata[c(2,3,5,12,13,25)]
#I am interested in players who are either Shooting Guards or Center position. I am also only interested in the players that play 25 minutes or more in the given season
conditions <- filter (subset_data, pos %in%c("SG","C"), MPG > 25)
#Sorting by highest points scored per 36 minutes descending
df <- conditions[with(conditions, order(-P.36)), ]
#Column Names
names (df) <- c("Player","Year","Position","Minutes","Points", "Usage Rate")
#Top 50 Scorers
df [1:50,1:6]
## Player Year Position Minutes Points Usage Rate
## 2108 Michael Jordan 1987 SG 40.0 34.8 38.3
## 1023 Kobe Bryant 2006 SG 41.0 34.2 38.7
## 2057 Michael Jordan 1988 SG 40.4 32.7 34.1
## 2155 Michael Jordan 1986 SG 25.1 32.7 38.6
## 27 James Harden 2020 SG 36.7 32.6 36.4
## 1790 Michael Jordan 1993 SG 39.3 32.3 34.7
## 1889 Michael Jordan 1991 SG 37.0 32.0 32.9
## 1952 Michael Jordan 1990 SG 39.0 32.0 33.7
## 1613 Michael Jordan 1996 SG 37.7 31.9 33.3
## 2385 George Gervin 1982 SG 35.7 31.8 35.0
## 185 James Harden 2018 SG 35.4 31.7 36.1
## 1219 Tracy McGrady 2003 SG 39.4 31.5 35.2
## 818 Dwyane Wade 2009 SG 38.6 31.3 36.2
## 1553 Michael Jordan 1997 SG 37.9 31.3 33.2
## 295 DeMarcus Cousins 2017 C 34.4 30.7 37.5
## 1504 Shaquille O'Neal 1998 C 36.3 30.1 32.9
## 1481 Michael Jordan 1998 SG 38.8 30.0 33.7
## 1708 Shaquille O'Neal 1995 C 37.0 30.0 31.9
## 2011 Michael Jordan 1989 SG 40.2 30.0 32.1
## 956 Kobe Bryant 2007 SG 40.8 29.8 33.6
## 73 Bradley Beal 2020 SG 36.0 29.7 34.4
## 673 Kobe Bryant 2011 SG 33.9 29.7 35.1
## 1444 Shaquille O'Neal 1999 C 34.8 29.7 32.4
## 1839 Michael Jordan 1992 SG 38.8 29.6 31.7
## 746 Dwyane Wade 2010 SG 36.3 29.4 34.9
## 950 Yao Ming 2007 C 33.8 29.4 33.5
## 1263 Shaquille O'Neal 2002 C 36.1 29.4 31.8
## 1338 Allen Iverson 2001 SG 42.0 29.4 35.9
## 1774 David Robinson 1994 C 40.5 29.4 32.0
## 301 DeMar DeRozan 2017 SG 35.4 29.3 34.3
## 1918 Ricky Pierce 1991 SG 28.8 29.3 30.7
## 1979 Ricky Pierce 1990 SG 29.0 29.2 31.3
## 288 Joel Embiid 2017 C 25.4 29.1 36.0
## 1281 Allen Iverson 2002 SG 43.7 29.1 37.8
## 142 Joel Embiid 2019 C 33.7 29.0 33.3
## 960 Dwyane Wade 2007 SG 37.9 29.0 34.7
## 2491 George Gervin 1980 SG 37.6 29.0 31.7
## 1646 Shaquille O'Neal 1996 C 36.0 28.9 32.8
## 600 Kobe Bryant 2012 SG 38.5 28.8 35.7
## 1382 Shaquille O'Neal 2000 C 40.0 28.6 31.2
## 279 Anthony Davis 2017 C 36.1 28.5 32.6
## 1327 Shaquille O'Neal 2001 C 39.5 28.5 31.6
## 814 Kobe Bryant 2009 SG 36.1 28.4 32.2
## 57 Joel Embiid 2020 C 30.2 28.3 32.6
## 1202 Shaquille O'Neal 2003 C 37.8 28.3 30.2
## 614 Brook Lopez 2012 C 27.2 28.2 32.7
## 1201 Kobe Bryant 2003 SG 41.5 28.2 32.9
## 2494 World B. Free 1980 SG 38.0 28.2 32.7
## 1314 Jerry Stackhouse 2001 SG 40.2 28.1 35.2
## 2439 George Gervin 1981 SG 33.7 28.1 32.3
The model will predict the points a player will score in 36 minutes.
Position will be the dichotomous variable. In this case, I will assign a value of ‘1’ if the player is a shooting guard and a ‘0’ if the player is a center.
Minutes Per Game will be another predictor.
Usage Rate will be the quadratic predictor because a player is likely to score more if they are relied on more to carry an offensive load for their respective team. There have been seasons where a player has had a lack of talent/support surrounding the offense or the other scorers are injured. For these reasons, the star players in the above list had to single handedly carry the offensive load.
#dichotomous
x<- as.numeric(gsub("SG", 1, gsub("C", 0, df$Position)))
x2 <- as.data.frame (x)
df2 <- cbind(df,x2)
#quadratic
usg <- (df$`Usage Rate`)^2
final_df <- cbind (df2, usg)
#data frame for regression
head (final_df)
## Player Year Position Minutes Points Usage Rate x usg
## 2108 Michael Jordan 1987 SG 40.0 34.8 38.3 1 1466.89
## 1023 Kobe Bryant 2006 SG 41.0 34.2 38.7 1 1497.69
## 2057 Michael Jordan 1988 SG 40.4 32.7 34.1 1 1162.81
## 2155 Michael Jordan 1986 SG 25.1 32.7 38.6 1 1489.96
## 27 James Harden 2020 SG 36.7 32.6 36.4 1 1324.96
## 1790 Michael Jordan 1993 SG 39.3 32.3 34.7 1 1204.09
regression_model <- lm (Points ~ x+Minutes+usg, data=final_df)
summary(regression_model)
##
## Call:
## lm(formula = Points ~ x + Minutes + usg, data = final_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -7.3860 -0.9326 0.0246 1.0460 5.5605
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 6.2908008 0.2372435 26.516 <2e-16 ***
## x -0.0426340 0.0619928 -0.688 0.492
## Minutes 0.0678619 0.0081233 8.354 <2e-16 ***
## usg 0.0189545 0.0001506 125.900 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.555 on 2655 degrees of freedom
## Multiple R-squared: 0.8884, Adjusted R-squared: 0.8883
## F-statistic: 7045 on 3 and 2655 DF, p-value: < 2.2e-16
\[ \hat{y} = 6.2908 -0.0426 * position + 0.0679 * minutes + 0.0190 * usage \]
Position
The p-value is high and we should not rely on the model results to be accurate.
Minutes Per Game
The p-value is low.
Usage Rate
The p-value is low.
\(R^2\) is high at 88%.
qqnorm(regression_model$residuals)
qqline(regression_model$residuals)