Objectives ๐ŸŽฏโœ…

In the following assignment, our objective is to:

  1. predict : PTS

  2. Determine the most important features

  3. Compare Models

  4. Report Finding

Subroutines)

As discussed in the textbook, there is required sequential-subroutines :

Basics ๐Ÿ“š๐Ÿ”ค

Variable Definition
Team Team
Match Up The teams involved in the game
Game Date The date the game was played.
W/L Whether the team won or lost the game.
MIN Total minutes played by the team.
PTS Total points scored by the team.
FGM Field goals made.
FGA Field goals attempted.
FG% Field goal percentage (FGM/FGA * 100).
3PM Three-point shots made.
3PA Three-point shots attempted.
3P% Three-point shooting percentage (3PM/3PA * 100).
FTM Free throws made.
FTA Free throws attempted.
FT% Free throw percentage (FTM/FTA * 100).
OREB Offensive rebounds.
DREB Defensive rebounds.
REB Total rebounds (OREB + DREB).
AST Assists made by the team.
STL Steals made by the team.
BLK Blocks made by the team.
TOV Turnovers committed.
PF Personal fouls committed.
+/- The point differential when this team was on the court.

Facts About the Data :

  • 30 Unique Teams

  • Top 3 Winning Teams :

    • BOS โ†’ Boston Celtics

    • DEN โ†’ Denver Nuggets

    • OKC โ†’ Oklahoma City Thunder

Analysis ๐Ÿ”๐Ÿ“Š

Scatter Plot Analysis)

Pts Vs. Min

plots[1]
## [[1]]

  • Data Clumping / Stacked Data

  • Categorical Variable :

    • There are clear groups; therefore for this sort of variable, it is helpful to encode it as a categorical variable

Pts Vs. FGM

## [[1]]

  • Clear Positive Correlation

  • Data Clumping / Stacked Data

Pts Vs. FGA

## [[1]]

  • Randomly Scattered

  • Data Clumping / Stacked Data

## [[1]]

  • Positive Correlation
## [[1]]

  • Positive correlation

  • High Variance

  • Data Clumping / Stacked Data

## [[1]]

  • Randomly Scattered

  • Centered around (35, 110)

  • Data Clumping / Stacked Data

## [[1]]

  • Positive Correlation

  • High Variance

  • Centralized around average value (38, 110)

## [[1]]

  • Slight positive correlation

  • Randomly scattered

  • Data Clumping / Stacked Data

Who wins? [ 30 Teams ]

## # A tibble: 10 ร— 2
##    Team  who_wins
##    <chr>    <dbl>
##  1 BOS         64
##  2 DEN         57
##  3 OKC         57
##  4 MIN         56
##  5 LAC         51
##  6 DAL         50
##  7 NYK         50
##  8 MIL         49
##  9 NOP         49
## 10 PHX         49

Wins over time [ 2023-10-24 to 2024-04-14 ]

Notice that there is the general trend that teams that win tend to stay winning. So, If we do categorical regression, it is likely which team matters a great deal to predicting the points.

Predicted Distribution )

In our analysis of the data, we put an emphasis of analyzing & modeling the variable we are trying to predict so we know the โ€œins and outsโ€ that make up the essence of that variable.

POINTS : \(Y\) : Discrete Dist.

This is our MOST IMPORTANT variables. This is what we want to predict! So how is it distributed? And what is clear about this distribution?

Prepare your S.O.C.S. โ€“ they will be blown off ๐Ÿงฆ!

Shape :

Note that the following diagram is a pdf not a Histogram.

In other words, the area sums to 1.

## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## โ„น Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

  • The data appears to be roughly normally distributed \(Y \text{ ~ } N(\mu_{y} = 114, \sigma_{y} = 13)\)
  • It appears to be a poor estimate around 90-110

Although the normality of a predictor is not strictly required for linear regressionโ€“it helps ensure valid statistical inference, as it leads to more reliable estimates, accurate confidence intervals, and valid hypothesis tests when using parametric methods like t-tests or F-tests due to the Central Limit Theorem.

Outliers :

Consider a simplistic Measure of outliers :

However, like any tool there are assumptions we make to say it is a good measure. For IQR to work well, we want :

  • Symmetrically Distrubuted Data

  • Uni-Modal Data

  • A sufficiently large sample size

  • Minimal Skew

Centrality :

  • Mean = 114.2

  • Median = 114

Spread :

On average, the data is Standard Deviation ( \(\hat{\sigma}\) ) is about 13.

Predictor Distributions )

We will analyze these roughly, the objective of the assignment is to use regression methods to predict PTS and develop the โ€œbestโ€ model for that.

Field Goals : \(F_g\) : \(X_1\) :

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Shape

  • FGM : Dist. is roughly normal; there is data stacking for certain values.

  • FGA : Looks bi-modal

Outliers

boxplot(df$FGM)

boxplot(df$FGA)

Centrality

Spread

Three Pointers : \(T_p\) : \(X_2\) :

Free Throws : \(F_t\) : \(X_3\) :

## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).

Rebounds : \(R_b\) : \(X_4\) :

Defense & Assistance : \(R_b\) : \(X_4\) :

Negative Performance Metrics : \(N_p\) : \(X_5\) :

Modeling ๐Ÿค–

Stepwise regression is a model selection technique that iteratively adds or removes predictors based on a chosen Goodness of fit criterion & Diagnostics to improve model fit while controlling for complexity, and in this project, I will be using forward stepwise regression to build an optimal predictive model by starting with no predictors and sequentially adding the most statistically significant variables.

First and foremost, consider the underlying model for Predicting Points :
\[ PTS = 3 *\text{ Three Pointers } + 2 * \text{ Two Pointers} + \text{Free Throws} \]

As a result, I will be throwing away these and other irrelavant columns out.

Testing Vs. Training Data)

First and foremost, we want to be able to test the predictive power of our model. So we will be splitting our data into training data and testing data โ€“ 80% Training and 20% Testing:

set.seed(123)  # for reproducibility
sample_indices <- sample(seq_len(nrow(df)), size = 0.8 * nrow(df))

train_data <- df[sample_indices, ]
test_data <- df[-sample_indices, ]

Suppose we only consider the intercept โ€“ ie. the average:

## 
## Call:
## lm(formula = PTS ~ 1, data = test_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.972  -8.972  -0.972   8.028  43.028 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 113.9715     0.5564   204.8   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.34 on 491 degrees of freedom

The average (intercept-only) model on the test set predicts a constant mean score of 113.97 points per game. Itโ€™s statistically significant (p < 2e-16), but the residual standard error of 12.34 indicates large variation in actual scores not captured by this model. It serves as a simple baseline for evaluating future predictive models. Consider the RSE is approximately the Standard Deviation of Points. In fact in this case, mathematically, they should be equivalent.

\[ \text{SD} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n}(y_i - \bar{y})^2} \]

\[ \text{RSE} = \sqrt{\frac{1}{n - p - 1} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]

Consider the Typical Points Per Team and Other Summarized Stats. :

## # A tibble: 10 ร— 3
##    Team  avg_pts sd_pts
##    <chr>   <dbl>  <dbl>
##  1 IND      123.   13.6
##  2 BOS      121.   12.4
##  3 OKC      120.   11.9
##  4 MIL      119.   13.4
##  5 ATL      118.   12.7
##  6 LAL      118.   12.4
##  7 DAL      118.   14.3
##  8 GSW      118.   11.7
##  9 SAC      117.   12.9
## 10 PHX      116.   10.6

So typically a team is making at least 100 Points with a deviation of at least 10.

Variable Selection โœ…

SLR)

Diagnostics, Goodness-Of-Fit Measures & Influential Points) \[ r_{i} \notin [-2, 2] \\ h_{i} โ‰ฅ \frac{2(p+1)}{n} \\ D_{i} โ‰ฅ \frac{4}{n} \]

##   model_name    r_squared      AIC      BIC
## 1  model_ast 3.592584e-01 14800.88 14817.64
## 2  model_tov 3.532159e-02 15606.13 15622.88
## 3 model_dreb 2.852684e-02 15619.94 15636.69
## 4   model_pf 2.493109e-02 15627.21 15643.96
## 5  model_blk 9.655196e-03 15657.80 15674.56
## 6  model_stl 1.579014e-03 15673.79 15690.54
## 7 model_oreb 2.187032e-05 15676.85 15693.61
##   model_name violates_standardized_residuals violates_standardized_leverage
## 1   model_pf                            4.57                           7.72
## 2  model_tov                            4.47                           6.71
## 3 model_dreb                            4.88                           7.88
## 4  model_stl                            4.67                           7.01
## 5 model_oreb                            4.78                           5.39
## 6  model_blk                            4.78                           7.27
## 7  model_ast                            4.27                           7.72
##   violates_cooks
## 1           5.84
## 2           5.74
## 3           5.28
## 4           5.23
## 5           4.83
## 6           4.78
## 7           4.67

Plot)

Based upon the goodness of fit measures, Diagnostics & Analysis of Influential Poins, I believe the best model is model_ast. Specifically, It has the Highest \(R^2\) (ie describes the most variation in \(y_i\) โ€“ Specifically, 35%). Furthermore, consider the Influential Points Analysis : (1) It has both the lowest % of violations of Residuals & Cooks (So Minimal Outliers & Overall Influential Points โ€“ 4% and 5% respectively). However, note that it does have the 2nd most violations of leverage (8%) Points โ€“ So there are many values with extreme X-values (ie there are many values far from the average AST value \(\mu_x = 27\)). Lastly Diagnostically, The model Appears to be roughly normally dist. (QQ-Plot) a high leverage point. There appears to be equal variance approximately with values getting more varied around 115 and less above and below increasingly and decreasingly.

## 
## Call:
## lm(formula = PTS ~ AST, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -29.666  -7.211  -0.231   6.743  41.386 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 73.91870    1.23774   59.72   <2e-16 ***
## AST          1.51299    0.04557   33.20   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.39 on 1966 degrees of freedom
## Multiple R-squared:  0.3593, Adjusted R-squared:  0.3589 
## F-statistic:  1102 on 1 and 1966 DF,  p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'

The model shows that assists (AST) are a strong predictor of points (PTS), with each additional assist contributing approximately 1.51 more points on average; the relationship is highly significant (p < 2e-16) and explains about 36% of the variation in scoring, which aligns with the basketball logic that assists directly facilitate scoring.

MLR โ€“ 2 Variables )

##       model_name r_squared      AIC      BIC    F_stat      p_value
## 1   model_ast_pf 0.3860397 14718.86 14741.20 85.714458 5.267015e-20
## 2  model_ast_tov 0.3812884 14734.03 14756.37 69.966361 1.127900e-16
## 3 model_ast_dreb 0.3651016 14784.85 14807.19 18.084534 2.212248e-05
## 4 model_ast_oreb 0.3635426 14789.68 14812.02 13.227156 2.830307e-04
## 5  model_ast_blk 0.3622656 14793.62 14815.96  9.265864 2.365607e-03
## 6  model_ast_stl 0.3596542 14801.67 14824.01  1.214740 2.705303e-01
##       model_name violates_standardized_residuals violates_standardized_leverage
## 1 model_ast_dreb                            4.07                           7.67
## 2  model_ast_tov                            4.12                           8.08
## 3  model_ast_stl                            4.22                           8.03
## 4  model_ast_blk                            4.07                           7.77
## 5 model_ast_oreb                            3.91                           6.81
## 6   model_ast_pf                            3.76                           8.33
##   violates_cooks
## 1           5.28
## 2           4.88
## 3           4.78
## 4           4.67
## 5           4.52
## 6           4.42

I believe the best performing model is : model_ast_blk. It has the second most significant partial-f-test P-valueโ€“indicating the inclusion of the new variable is significant. Furthermore, it has only about 4% of values violating outlier condition ( \(r_i\) ), about 8% violating extreme x-value conditions ( \(h_{ii}\) ) and about 5% being an overall influential point ( \(D_i\) ). Furthermore, Each model indicates approximately similar values. Lastly, diagnostically : the pattern appears to be normal with constant variance ( Residual Plot ); There appears to be normally dist. residuals (QQ-Plot) with notable exceptions near the extremes.

Visualizing 3D Model)

## 
## Call:
## lm(formula = PTS ~ AST + BLK, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -30.055  -7.096  -0.007   6.712  41.092 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 72.78171    1.29039  56.403  < 2e-16 ***
## AST          1.50290    0.04560  32.962  < 2e-16 ***
## BLK          0.27397    0.09001   3.044  0.00237 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.36 on 1965 degrees of freedom
## Multiple R-squared:  0.3623, Adjusted R-squared:  0.3616 
## F-statistic: 558.1 on 2 and 1965 DF,  p-value: < 2.2e-16
## 
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
## 
##     last_plot
## The following object is masked from 'package:stats':
## 
##     filter
## The following object is masked from 'package:graphics':
## 
##     layout

The model shows that Assists (AST) and Blocks (BLK) are both significant predictors of team points. Each additional assist adds about 1.50 points, while each block adds about 0.27 points. Assists directly lead to made shots, and blocks can create fast-break opportunities or shift momentum, indirectly leading to more scoring. So, it makes sense that both contribute to higher point totals.


So based on the standard deviation, we are gonna see which model has the best accuracy.

##                Model Accuracy_within_13pts
## 1      2D: PTS ~ AST                81.504
## 2 3D: PTS ~ AST + PF                81.504

When the improvement in model accuracy is minimalโ€”such as the 0.4% gain observed with the more complex (3D) modelโ€”it may not justify the added complexity. Considering that one standard deviation in points is around 13 PTS, this small increase in accuracy falls well within expected variability. As the textbook emphasizes, adding predictors that do not significantly improve performance can lead to overfitting, reducing the modelโ€™s generalizability. Given this, it may be best to choose the simpler model for its interpretability and stability.

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## โ„น Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Report ๐Ÿ“„

Assists! As it turns out Team work has a strong relationship to successโ€“How many ASSISTS a team has impacts the PTS a team has. In other words, consider that Assists can lead to either a 3 Pointer or a 2 Pointer โ€“being the Heaviest Pointers; indicating its likely infrequent that people get most their point from Free Throws.