Week 8 Data Dive - Regression

Loading Data

nba <- nba %>%
  distinct(Year, Player, Tm, .keep_all = T)

Create PPG column

nba <- nba %>%
  mutate(PPG = PTS/G, .after = PTS)

Select Variables

Continuous variable - PPG

Categorical variable - Position

Hypotheses:

Ho: The mean PPG is equal across all positions.

Ha: At least one position has a higher mean PPG.

nba %>%
  group_by(Pos) %>%
  summarise(meanPPG = mean(PPG),
            sdPPG = sd(PPG))

## # A tibble: 5 × 3
##   Pos   meanPPG sdPPG
##   <chr>   <dbl> <dbl>
## 1 C        6.94  5.55
## 2 PF       7.72  5.85
## 3 PG       8.10  5.65
## 4 SF       8.75  6.32
## 5 SG       9.00  6.33

ANOVA Test

m<- aov(PPG ~ Pos, data = nba)
summary(m)

##                Df Sum Sq Mean Sq F value Pr(>F)    
## Pos             4  11745  2936.2   82.94 <2e-16 ***
## Residuals   21506 761342    35.4                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the results of the F test we can conclude that there is evidence to suggest that at least one position has a different mean PPG than another position. Below are the results of the comparisons between groups to see which groups are significantly different, with the Bonferroni correction method.

Pairwise Comparisons

pairwise.t.test(nba$PPG, nba$Pos, p.adjust.method = "bonferroni")

## 
##  Pairwise comparisons using t tests with pooled SD 
## 
## data:  nba$PPG and nba$Pos 
## 
##    C       PF      PG      SF   
## PF 5.9e-09 -       -       -    
## PG < 2e-16 0.026   -       -    
## SF < 2e-16 1.1e-14 7.3e-06 -    
## SG < 2e-16 < 2e-16 2.6e-11 0.529
## 
## P value adjustment method: bonferroni

Based on the results of the multiple comparisons, with the Bonferroni correction method we can see that the only groups that did not have significantly different means were the SG/SF. Every other pair of positions has significant evidence that the mean PPG is different between groups.

Linear Regression

m_data <-nba %>%
  filter(G>10)
m_data%>%  
  ggplot() +
  geom_point(mapping = aes(x=`USG%`, y=PPG)) + 
  labs(title = "USG% vs PPG (Includes players who play in min 10 games)")

m <- lm(m_data$PPG ~ m_data$`USG%`)
summary(m)

## 
## Call:
## lm(formula = m_data$PPG ~ m_data$`USG%`)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.3585  -2.6696   0.1286   2.9744  15.8407 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -6.793641   0.124948  -54.37   <2e-16 ***
## m_data$`USG%`  0.830660   0.006428  129.23   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.321 on 18835 degrees of freedom
## Multiple R-squared:   0.47,  Adjusted R-squared:  0.4699 
## F-statistic: 1.67e+04 on 1 and 18835 DF,  p-value: < 2.2e-16

Based on the regression output there is significant evidence to suggest that USG% is a significant factor in predicting PPG. A 1 % increase in USG% is expected to produce a 0.83 increase in PPG. This positive relationship is expected because if you are a player that is being featured more it is likely that you are producing more points.

m_data %>%
  ggplot() +
  geom_point(mapping = aes(x=`USG%`, y=PPG)) +
  geom_abline(slope = m$coefficients[2], intercept = m$coefficients[1], color = "red") +
  labs(title = "Linear Regression - Using USG% to Predict PPG")

Diagnostic Plots

autoplot(m, which = 1:6, ncol = 2, label.size = 3)

Residuals vs Fitted - Here we can see some mild non-linearity which is also seen in the basic scatter plot. This is caused by the points with high USG% and low PPG.

Normal QQ - At the lower left end of the plot the dots appear to fall off the line. The rest of the plot looks good.

Scale-Location - Looks as if the residuals aren’t spread evenly across all fitted values. As the fitted value increases the residuals look to slightly increase as well.

Cook’s distance - A couple points can be seen here to have a larger influence on the model however these values are much lower than the level that would be needed to cause concern.

Residuals vs Leverage - Similar to a couple other plots here there are a few points that appear to have more influence on the model than the rest. Point labeled 7560.

Cook’s dist vs Leverage - Observation 7560 appears to have a much larger influence on the model than any other point.

Linear Regression Pt. 2

Make Position variable a factor

m_data$Pos <- factor(m_data$Pos)
m_data <- m_data %>%
  mutate(MPG = MP/G)

Create model with Position and PER (Player Efficiency Rating)

m <- lm(m_data$PPG ~ m_data$`USG%` + m_data$Pos + m_data$PER)
summary(m)

## 
## Call:
## lm(formula = m_data$PPG ~ m_data$`USG%` + m_data$Pos + m_data$PER)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -30.2695  -1.9038   0.2031   2.2095  10.0455 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   -10.046326   0.104542 -96.099   <2e-16 ***
## m_data$`USG%`   0.404079   0.006148  65.721   <2e-16 ***
## m_data$PosPF    0.738244   0.074796   9.870   <2e-16 ***
## m_data$PosPG    0.748242   0.077172   9.696   <2e-16 ***
## m_data$PosSF    1.966518   0.078064  25.191   <2e-16 ***
## m_data$PosSG    2.046430   0.078425  26.094   <2e-16 ***
## m_data$PER      0.774682   0.006565 117.994   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.27 on 18830 degrees of freedom
## Multiple R-squared:  0.6966, Adjusted R-squared:  0.6965 
## F-statistic:  7204 on 6 and 18830 DF,  p-value: < 2.2e-16

Adding PER and Pos(istion) to the model increased the R-squared and more importantly Adj-R-squared from 0.47 to 0.696, which is a pretty significant increase. Another important thing to point out is that all terms in the model have highly significant p-values, suggest that there is evidence to say that they are are all significant predictors of PPG.

autoplot(m, which = 1:6, ncol = 2, label.size = 3)

Some of the diagnostic plots for this model look similar as the previous model. For example, the Normal QQ plot still has a group of points that fall off the line as you look at the lower quantiles. This leads us to believe that the residuals are likely skewed and not definitively normally distributed. Observation 7560 is still the largest Cooks D value and due to this still has more influence on the model than the average observation. The biggest difference and improvements from the previous model to this model can be seen in the Scale-Location plot and the Residuals vs Leverage plot. In the Scale-Location plot the line is much closer to horizontal meaning that the residuals are evenly spread throughout all values.