library(readr)
library(Stat2Data)
library(bestglm)
## Loading required package: leaps
nbaData <- read_csv("nba_logreg.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character()
## )
## See spec(...) for full column specifications.

#A

mod1 = glm(TARGET_5Yrs ~ GP + PTS + MIN, family = binomial, data = nbaData)

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP + PTS + MIN, family = binomial, 
##     data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4972  -1.0137   0.5776   0.8769   1.9326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.422568   0.231676 -10.457  < 2e-16 ***
## GP           0.039893   0.004439   8.987  < 2e-16 ***
## PTS          0.132485   0.041251   3.212  0.00132 ** 
## MIN         -0.016147   0.020097  -0.803  0.42171    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1531.2  on 1336  degrees of freedom
## AIC: 1539.2
## 
## Number of Fisher Scoring iterations: 4
B0 = summary(mod1)$coef[1]
B1 = summary(mod1)$coef[2]
plot(jitter(TARGET_5Yrs, amount = 0.1) ~ GP + PTS + MIN, data = nbaData)

curve(exp(B0+B1*x)/(1+exp(B0+B1*x)), add = TRUE, col = "red")

My predictor variables are the number of games played (GP), the average number of points per game (PTS), and the average number of minutes played per game (MIN) and my response variable was if a player would last 5 years in the league (TARGET_5Yrs) with 0 being career years < 5 and 1 being career years >= 5.

#B

G1 <- 1779.5 - 1531.2
G1
## [1] 248.3
1-pchisq(G1, 3)
## [1] 0

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0 The G-statistic for mod1 (TARGET_5Yrs ~ GP + PTS + MIN) is 248.3. I calculated the G-statistic by substracting the residual deviance from the null deviance. To test the effectiveness of the model, I subtracted the chi squared value with the G-statistic and degrees of freedom from 1. The output is 0, which means the model was rather effective and we can reject the null hypothesis that the coefficients of the predictors equal 0.

#C

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP + PTS + MIN, family = binomial, 
##     data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4972  -1.0137   0.5776   0.8769   1.9326  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.422568   0.231676 -10.457  < 2e-16 ***
## GP           0.039893   0.004439   8.987  < 2e-16 ***
## PTS          0.132485   0.041251   3.212  0.00132 ** 
## MIN         -0.016147   0.020097  -0.803  0.42171    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1531.2  on 1336  degrees of freedom
## AIC: 1539.2
## 
## Number of Fisher Scoring iterations: 4

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0 The null hypothesis states that there does not exist a relationship between the number of games played, the number of points earned per game, and the number of minutes played per game and if a player would last 5 years in the league, while the alternative hypothesis states there exists a relationship between said predictors and if a player would last 5 years. The p-value is not 0, so I can reject the null hypothesis and conclude not only a linear relationship from my model, but also that GP, PTS, and MIN values impact TARGET_5Yrs values. Additionally, the residual deviance appears to be large with a value of 1531.2.

#D

mod2 = glm(TARGET_5Yrs ~ GP, family = binomial, data = nbaData)

mod3 = glm(TARGET_5Yrs ~ PTS, family = binomial, data = nbaData)

summary(mod2)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9163  -1.0413   0.6176   0.8635   1.9361  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.524465   0.226275  -11.16   <2e-16 ***
## GP           0.051059   0.003749   13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1561.3  on 1338  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 4
summary(mod3)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ PTS, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7285  -1.1412   0.6146   1.0190   1.4300  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.78108    0.12283  -6.359 2.03e-10 ***
## PTS          0.20452    0.01897  10.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1620.3  on 1338  degrees of freedom
## AIC: 1624.3
## 
## Number of Fisher Scoring iterations: 4
G1 <- 1779.5 - 1531.2
G1
## [1] 248.3
1-pchisq(G1, 3)
## [1] 0
G2 <- 1779.5 - 1561.3
G2
## [1] 218.2
1-pchisq(G2, 1)
## [1] 0
G3 <- 1779.5 - 1620.3
G3
## [1] 159.2
1-pchisq(G3, 1)
## [1] 0

The p-values for all the models (TARGET_5Yrs ~ GP, TARGET_5Yrs ~ PTS, and Target_5yrs ~ GP + PTS + MIND) are both not equal to 0 and indicate that I can reject the null hypothesis for all three models. All three p-values allow me to conclude that there is a linear relationship between the number of games played, the number of minutes played per game, and the number of points earned per game and if a player makes it 5 years in the league. Additionally, the G-statistic for all three models are quite large, but the largest value (248.3) is the model with the three predictors, which indicates an improvement in the effectiveness of the model. Both G-statistic values for the models with only one predictor allow me to conclude that the number of games played and the average number of points earned per game are not the only predictors that indicate if a player makes it 5 years in the league. Furthermore, these values encourage me to test out more models adding more predictors via a subset method to find the most effective model.

#E

nba.1 = within(nbaData, {Name = NULL})

nba.2 = within(nba.1, {FGM = NULL})

nba.3 = within(nba.2, {FGA = NULL})

nba.4 = within(nba.3, {FTM = NULL})

nba.5 = within(nba.4, {FTA = NULL})

head(nba.5)
nba.5 = as.data.frame(nba.5)

bestglm(nba.5, family = binomial)
## Morgan-Tatar search since family is non-gaussian.
## BIC
## BICq equivalent for q in (0.00305141446401269, 0.611960977328964)
## Best Model:
##                 Estimate  Std. Error    z value     Pr(>|z|)
## (Intercept) -2.711132054 0.242679611 -11.171652 5.612743e-29
## GP           0.041109423 0.004016242  10.235794 1.370665e-24
## `3P%`        0.006520932 0.004172437   1.562859 1.180857e-01
## OREB         0.720709171 0.114522624   6.293160 3.110679e-10

Using bestglm, it appears that the best model to predict if a player layer lasts 5 years in the league is one with predictors gamples played (GP), 3-point percentage (3P%), and offensive rebounds (OREB).