I found my data on data.world, so it was already converted to a CSV file. No manipulation was needed, so the data could be immediately uploaded into RStudio for analysis.

library(readr)
library(Stat2Data)
nbaData <- read_csv("nba_logreg.csv")
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Name = col_character()
## )
## See spec(...) for full column specifications.

Part 1A - Choose a single quantitative predictor and construct a logistic regression model.

mod1 = glm(TARGET_5Yrs ~ GP, family = binomial, data = nbaData)

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9163  -1.0413   0.6176   0.8635   1.9361  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.524465   0.226275  -11.16   <2e-16 ***
## GP           0.051059   0.003749   13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1561.3  on 1338  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 4

My predictor variable was the number of games played (GP) and my response variable was if a player would last 5 years in the league (TARGET_5Yrs) with 0 being career years < 5 and 1 being career years >= 5. The equation for the logistic regression model is: TARGET_5Yrs = 0.051059(GP) - 2.524465.

Part 1B - Plot the raw data and the logistic curve on the same axes.

B0 = summary(mod1)$coef[1]
B1 = summary(mod1)$coef[2]
plot(jitter(TARGET_5Yrs, amount = 0.1) ~ GP, data = nbaData)
curve(exp(B0+B1*x)/(1+exp(B0+B1*x)), add = TRUE, col = "red")

#### The above plot demonstrates mod1 (TARGET_5Yrs ~ GP) and its logistic curve. The plot does not appear to be completely fitting because the data points are gathered either at the bottom or at the top. However, we can see that as the number of games played increased, the likelihood of a player lasting at least 5 years in the league increases.

Part 1C - Construct an empirical logit plot and comment on the linearity of the data.

emplogitplot1(TARGET_5Yrs ~ GP, data = nbaData)

#### The above plot demonstrates the empirical logit plot for TARGET_5Yrs ~ GP. The data appears to be rather linear - the points are not directly on the line, but are pretty close to the fit. With only three data points, it is difficult to tell if it is completely linear for all data points. Slicing was unnecessary because the data had a single binary predictor variable.

Part 1D - Use the summary of your logistic model to perform a hypothesis test to determine if there is significant evidence of a relationship between the response and predictor variable. State your hypotheses and conclusion.

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9163  -1.0413   0.6176   0.8635   1.9361  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.524465   0.226275  -11.16   <2e-16 ***
## GP           0.051059   0.003749   13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1561.3  on 1338  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 4

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0

The null hypothesis states that there does not exist a relationship between the number of games played and if a player would last 5 years in the league, while the alternative hypothesis states there exists a relationship between the number of games played and if a player would last 5 years in the league. The p-value is not 0, so I can reject the null hypothesis and conclude not only a linear relationship from my model, but also that GP values impact TARGET_5Yrs values. Additionally, the residual deviance appears to be large with a value of 1561.3.

Part 1E - Construct a confidence interval for the odds ratio and include a sentence interpreting the interval in the context.

exp(confint.default(mod1))
##                  2.5 %    97.5 %
## (Intercept) 0.05140824 0.1248086
## GP          1.04467976 1.0601467

I am 95% confident that the odds ratio falls within 0.05140824 and 0.1238086 for the intercept and within 1.04467976 and 1.0601467 for the games played. The values for the number of games played are larger than 1, which means as the value of games played increases, the odds of a player making it 5 years in the league increases.

Part 1F - Compute the G-statistic and use it to test the effectiveness of your model.

G1 <- 1779.5 - 1561.3
G1
## [1] 218.2
1-pchisq(G1, 1)
## [1] 0

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0

The G-statistic for mod1 (TARGET_5Yrs ~ GP) is 218.2. To test the effectiveness of the model, I subtracted the chi squared value with the G-statistic and degrees of freedom from 1. The output is 0, which means the model was not entirely effective and I cannot reject the null hypothesis. This conclusion indicates that the number of games played itself is not the only predictor for if a player makes it 5 years in the league.

Part G - Repeat (a)-(f) for a second model with a different single quantitative predictor.

Part 2A - Choose a single quantitative predictor and construct a logistic regression model.

mod2 = glm(TARGET_5Yrs ~ PTS, family = binomial, data = nbaData)

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9163  -1.0413   0.6176   0.8635   1.9361  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.524465   0.226275  -11.16   <2e-16 ***
## GP           0.051059   0.003749   13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1561.3  on 1338  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 4

My predictor variable was the everage number of points earned per game (PTS) and my response variable was if a player would last 5 years in the NBA league (TARGET_5Yrs) with 0 being career years < 5 and 1 being career years >= 5. The equation for the logistic regression model is: TARGET_5Yrs = 0.20452(PTS) - 0.78108.

Part 2B - Plot the raw data and the logistic curve on the same axes.

B0 = summary(mod2)$coef[1]
B1 = summary(mod2)$coef[2]
plot(jitter(TARGET_5Yrs, amount = 0.1) ~ PTS, data = nbaData)
curve(exp(B0+B1*x)/(1+exp(B0+B1*x)), add = TRUE, col = "red")

#### The above plot demonstrates mod2 (TARGET_5Yrs ~ PTS) and its logistic curve. The plot does not appear to be completely fitting because the data points are gathered either at the bottom or at the top. However, we can see that as the average number of points earned per game increased, the likelihood of a player lasting at least 5 years in the league increases.

Part 2C - Construct an empirical logit plot and comment on the linearity of the data.

emplogitplot1(TARGET_5Yrs ~ PTS, data = nbaData)

#### The above plot demonstrates the empirical logit plot for TARGET_5Yrs ~ PTS. The data appears to be rather linear - the points are not directly on the line, but are pretty close to the fit. With only three data points, it is difficult to tell if it is completely linear for all data points. Slicing was unnecessary because the data had a single binary predictor variable.

Part 2D - Use the summary of your logistic model to perform a hypothesis test to determine if there is significant evidence of a relationship between the response and predictor variable. State your hypotheses and conclusion.

summary(mod2)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ PTS, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7285  -1.1412   0.6146   1.0190   1.4300  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.78108    0.12283  -6.359 2.03e-10 ***
## PTS          0.20452    0.01897  10.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1620.3  on 1338  degrees of freedom
## AIC: 1624.3
## 
## Number of Fisher Scoring iterations: 4

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0

The null hypothesis states that there does not exist a relationship between the average number of points earned per game and if a player would last 5 years in the league, while the alternative hypothesis states there exists a relationship between the average number of points earned per game and if a player would last 5 years in the league. The p-value is not 0, so I can reject the null hypothesis and conclude not only a linear relationship from my model, but also that PTS values impact TARGET_5Yrs values. Additionally, the residual deviance appears to be large with a value of 1620.3.

Part 2E - Construct a confidence interval for the odds ratio and include a sentence interpreting the interval in the context.

exp(confint.default(mod2))
##                 2.5 %    97.5 %
## (Intercept) 0.3599377 0.5825534
## PTS         1.1821452 1.2734262

I am 95% confident that the odds ratio falls within 0.3599377 and 0.5825534 for the intercept and within 1.1821452 and 1.2734262 for the games played. The values for the number of games played are larger than 1, which means as the value of average number of points earned per game increases, the odds of a player making it 5 years in the league increases.

Part 2F - Compute the G-statistic and use it to test the effectiveness of your model.

G2 <- 1779.5 - 1620.3
G2
## [1] 159.2
1-pchisq(G2, 1)
## [1] 0

H0: p-valueTARGET_5Yrs = 0 ; H&alpha: p-valueTARGET_5Yrs != 0

The G-statistic for mod1 (TARGET_5Yrs ~ GP) is 159.2. To test the effectiveness of the model, I subtracted the chi squared value with the G-statistic and degrees of freedom from 1. The output is 0, which means the model was not entirely effective and I cannot reject the null hypothesis. This conclusion indicates that the average number of points earned per game is not the only predictor for if a player makes it 5 years in the league.

Part H - Compare the effectiveness of your two models to each other.

summary(mod1)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ GP, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.9163  -1.0413   0.6176   0.8635   1.9361  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.524465   0.226275  -11.16   <2e-16 ***
## GP           0.051059   0.003749   13.62   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1561.3  on 1338  degrees of freedom
## AIC: 1565.3
## 
## Number of Fisher Scoring iterations: 4
summary(mod2)
## 
## Call:
## glm(formula = TARGET_5Yrs ~ PTS, family = binomial, data = nbaData)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7285  -1.1412   0.6146   1.0190   1.4300  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -0.78108    0.12283  -6.359 2.03e-10 ***
## PTS          0.20452    0.01897  10.778  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1779.5  on 1339  degrees of freedom
## Residual deviance: 1620.3  on 1338  degrees of freedom
## AIC: 1624.3
## 
## Number of Fisher Scoring iterations: 4
G1 <- 1779.5 - 1561.3
G1
## [1] 218.2
1-pchisq(G1, 1)
## [1] 0
G2 <- 1779.5 - 1620.3
G2
## [1] 159.2
1-pchisq(G2, 1)
## [1] 0

The p-values for both models (TARGET_5Yrs ~ GP and TARGET_5Yrs ~ PTS) are both not equal to 0 and indicate that I can reject the null hypothesis. Both p-values allow me to conclude that there is a linear relationship between the number of games played and if a player makes it 5 years in the league AND that there is a linear relationship between the average number of points earned per game and if a player makes it 5 years in the league. Additionally, the G-statistic for both models are both equal to 0 and indicates that I cannot reject the null hypothesis and that both models were not entirely effective. Both G-statistic values allow me to conclude that the number of games played and the average number of points earned per game are not the only predictors that indicate if a player makes it 5 years in the league. Furthermore, these values encourage me to test out models either using both of the predictors from the two models or adding even more via a subset method to find the most effective model.