In the following assignment, our objective is to:
predict : PTS
Determine the most important features
Compare Models
Report Finding
As discussed in the textbook, there is required sequential-subroutines :
Variable | Definition |
---|---|
Team | Team |
Match Up | The teams involved in the game |
Game Date | The date the game was played. |
W/L | Whether the team won or lost the game. |
MIN | Total minutes played by the team. |
PTS | Total points scored by the team. |
FGM | Field goals made. |
FGA | Field goals attempted. |
FG% | Field goal percentage (FGM/FGA * 100). |
3PM | Three-point shots made. |
3PA | Three-point shots attempted. |
3P% | Three-point shooting percentage (3PM/3PA * 100). |
FTM | Free throws made. |
FTA | Free throws attempted. |
FT% | Free throw percentage (FTM/FTA * 100). |
OREB | Offensive rebounds. |
DREB | Defensive rebounds. |
REB | Total rebounds (OREB + DREB). |
AST | Assists made by the team. |
STL | Steals made by the team. |
BLK | Blocks made by the team. |
TOV | Turnovers committed. |
PF | Personal fouls committed. |
+/- | The point differential when this team was on the court. |
Facts About the Data :
30 Unique Teams
Top 3 Winning Teams :
BOS โ Boston Celtics
DEN โ Denver Nuggets
OKC โ Oklahoma City Thunder
## [[1]]
Data Clumping / Stacked Data
Categorical Variable :
## [[1]]
Clear Positive Correlation
Data Clumping / Stacked Data
## [[1]]
Randomly Scattered
Data Clumping / Stacked Data
## [[1]]
## [[1]]
Positive correlation
High Variance
Data Clumping / Stacked Data
## [[1]]
Randomly Scattered
Centered around (35, 110)
Data Clumping / Stacked Data
## [[1]]
Positive Correlation
High Variance
Centralized around average value (38, 110)
## [[1]]
Slight positive correlation
Randomly scattered
Data Clumping / Stacked Data
## # A tibble: 10 ร 2
## Team who_wins
## <chr> <dbl>
## 1 BOS 64
## 2 DEN 57
## 3 OKC 57
## 4 MIN 56
## 5 LAC 51
## 6 DAL 50
## 7 NYK 50
## 8 MIL 49
## 9 NOP 49
## 10 PHX 49
Notice that there is the general trend that teams that win tend to stay winning. So, If we do categorical regression, it is likely which team matters a great deal to predicting the points.
In our analysis of the data, we put an emphasis of analyzing & modeling the variable we are trying to predict so we know the โins and outsโ that make up the essence of that variable.
This is our MOST IMPORTANT variables. This is what we want to predict! So how is it distributed? And what is clear about this distribution?
Prepare your S.O.C.S. โ they will be blown off ๐งฆ!
Note that the following diagram is a pdf not a Histogram.
In other words, the area sums to 1.
## Warning: The dot-dot notation (`..density..`) was deprecated in ggplot2 3.4.0.
## โน Please use `after_stat(density)` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Although the normality of a predictor is not strictly required for linear regressionโit helps ensure valid statistical inference, as it leads to more reliable estimates, accurate confidence intervals, and valid hypothesis tests when using parametric methods like t-tests or F-tests due to the Central Limit Theorem.
Consider a simplistic Measure of outliers :
However, like any tool there are assumptions we make to say it is a good measure. For IQR to work well, we want :
Symmetrically Distrubuted Data
Uni-Modal Data
A sufficiently large sample size
Minimal Skew
Mean = 114.2
Median = 114
On average, the data is Standard Deviation ( \(\hat{\sigma}\) ) is about 13.
We will analyze these roughly, the objective of the assignment is to use regression methods to predict PTS and develop the โbestโ model for that.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
FGM : Dist. is roughly normal; there is data stacking for certain values.
FGA : Looks bi-modal
## Warning: Removed 1 row containing non-finite outside the scale range
## (`stat_bin()`).
Stepwise regression is a model selection technique that iteratively adds or removes predictors based on a chosen Goodness of fit criterion & Diagnostics to improve model fit while controlling for complexity, and in this project, I will be using forward stepwise regression to build an optimal predictive model by starting with no predictors and sequentially adding the most statistically significant variables.
First and foremost, consider the underlying model for
Predicting Points :
\[
PTS = 3 *\text{ Three Pointers } + 2 * \text{ Two Pointers} + \text{Free
Throws}
\]
As a result, I will be throwing away these and other irrelavant columns out.
First and foremost, we want to be able to test the predictive power of our model. So we will be splitting our data into training data and testing data โ 80% Training and 20% Testing:
set.seed(123) # for reproducibility
sample_indices <- sample(seq_len(nrow(df)), size = 0.8 * nrow(df))
train_data <- df[sample_indices, ]
test_data <- df[-sample_indices, ]
Suppose we only consider the intercept โ ie. the average:
##
## Call:
## lm(formula = PTS ~ 1, data = test_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.972 -8.972 -0.972 8.028 43.028
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 113.9715 0.5564 204.8 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.34 on 491 degrees of freedom
The average (intercept-only) model on the test set predicts a constant mean score of 113.97 points per game. Itโs statistically significant (p < 2e-16), but the residual standard error of 12.34 indicates large variation in actual scores not captured by this model. It serves as a simple baseline for evaluating future predictive models. Consider the RSE is approximately the Standard Deviation of Points. In fact in this case, mathematically, they should be equivalent.
\[ \text{SD} = \sqrt{\frac{1}{n - 1} \sum_{i=1}^{n}(y_i - \bar{y})^2} \]
\[ \text{RSE} = \sqrt{\frac{1}{n - p - 1} \sum_{i=1}^{n}(y_i - \hat{y}_i)^2} \]
Consider the Typical Points Per Team and Other Summarized Stats. :
## # A tibble: 10 ร 3
## Team avg_pts sd_pts
## <chr> <dbl> <dbl>
## 1 IND 123. 13.6
## 2 BOS 121. 12.4
## 3 OKC 120. 11.9
## 4 MIL 119. 13.4
## 5 ATL 118. 12.7
## 6 LAL 118. 12.4
## 7 DAL 118. 14.3
## 8 GSW 118. 11.7
## 9 SAC 117. 12.9
## 10 PHX 116. 10.6
So typically a team is making at least 100 Points with a deviation of at least 10.
## model_name r_squared AIC BIC
## 1 model_ast 3.592584e-01 14800.88 14817.64
## 2 model_tov 3.532159e-02 15606.13 15622.88
## 3 model_dreb 2.852684e-02 15619.94 15636.69
## 4 model_pf 2.493109e-02 15627.21 15643.96
## 5 model_blk 9.655196e-03 15657.80 15674.56
## 6 model_stl 1.579014e-03 15673.79 15690.54
## 7 model_oreb 2.187032e-05 15676.85 15693.61
## model_name violates_standardized_residuals violates_standardized_leverage
## 1 model_pf 4.57 7.72
## 2 model_tov 4.47 6.71
## 3 model_dreb 4.88 7.88
## 4 model_stl 4.67 7.01
## 5 model_oreb 4.78 5.39
## 6 model_blk 4.78 7.27
## 7 model_ast 4.27 7.72
## violates_cooks
## 1 5.84
## 2 5.74
## 3 5.28
## 4 5.23
## 5 4.83
## 6 4.78
## 7 4.67
Based upon the goodness of fit measures, Diagnostics & Analysis of Influential Poins, I believe the best model is model_ast. Specifically, It has the Highest \(R^2\) (ie describes the most variation in \(y_i\) โ Specifically, 35%). Furthermore, consider the Influential Points Analysis : (1) It has both the lowest % of violations of Residuals & Cooks (So Minimal Outliers & Overall Influential Points โ 4% and 5% respectively). However, note that it does have the 2nd most violations of leverage (8%) Points โ So there are many values with extreme X-values (ie there are many values far from the average AST value \(\mu_x = 27\)). Lastly Diagnostically, The model Appears to be roughly normally dist. (QQ-Plot) a high leverage point. There appears to be equal variance approximately with values getting more varied around 115 and less above and below increasingly and decreasingly.
##
## Call:
## lm(formula = PTS ~ AST, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.666 -7.211 -0.231 6.743 41.386
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 73.91870 1.23774 59.72 <2e-16 ***
## AST 1.51299 0.04557 33.20 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.39 on 1966 degrees of freedom
## Multiple R-squared: 0.3593, Adjusted R-squared: 0.3589
## F-statistic: 1102 on 1 and 1966 DF, p-value: < 2.2e-16
## `geom_smooth()` using formula = 'y ~ x'
The model shows that assists (AST) are a strong predictor of points (PTS), with each additional assist contributing approximately 1.51 more points on average; the relationship is highly significant (p < 2e-16) and explains about 36% of the variation in scoring, which aligns with the basketball logic that assists directly facilitate scoring.
## model_name r_squared AIC BIC F_stat p_value
## 1 model_ast_pf 0.3860397 14718.86 14741.20 85.714458 5.267015e-20
## 2 model_ast_tov 0.3812884 14734.03 14756.37 69.966361 1.127900e-16
## 3 model_ast_dreb 0.3651016 14784.85 14807.19 18.084534 2.212248e-05
## 4 model_ast_oreb 0.3635426 14789.68 14812.02 13.227156 2.830307e-04
## 5 model_ast_blk 0.3622656 14793.62 14815.96 9.265864 2.365607e-03
## 6 model_ast_stl 0.3596542 14801.67 14824.01 1.214740 2.705303e-01
## model_name violates_standardized_residuals violates_standardized_leverage
## 1 model_ast_dreb 4.07 7.67
## 2 model_ast_tov 4.12 8.08
## 3 model_ast_stl 4.22 8.03
## 4 model_ast_blk 4.07 7.77
## 5 model_ast_oreb 3.91 6.81
## 6 model_ast_pf 3.76 8.33
## violates_cooks
## 1 5.28
## 2 4.88
## 3 4.78
## 4 4.67
## 5 4.52
## 6 4.42
I believe the best performing model is : model_ast_blk. It has the second most significant partial-f-test P-valueโindicating the inclusion of the new variable is significant. Furthermore, it has only about 4% of values violating outlier condition ( \(r_i\) ), about 8% violating extreme x-value conditions ( \(h_{ii}\) ) and about 5% being an overall influential point ( \(D_i\) ). Furthermore, Each model indicates approximately similar values. Lastly, diagnostically : the pattern appears to be normal with constant variance ( Residual Plot ); There appears to be normally dist. residuals (QQ-Plot) with notable exceptions near the extremes.
##
## Call:
## lm(formula = PTS ~ AST + BLK, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -30.055 -7.096 -0.007 6.712 41.092
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 72.78171 1.29039 56.403 < 2e-16 ***
## AST 1.50290 0.04560 32.962 < 2e-16 ***
## BLK 0.27397 0.09001 3.044 0.00237 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.36 on 1965 degrees of freedom
## Multiple R-squared: 0.3623, Adjusted R-squared: 0.3616
## F-statistic: 558.1 on 2 and 1965 DF, p-value: < 2.2e-16
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
The model shows that Assists (AST) and Blocks (BLK) are both significant predictors of team points. Each additional assist adds about 1.50 points, while each block adds about 0.27 points. Assists directly lead to made shots, and blocks can create fast-break opportunities or shift momentum, indirectly leading to more scoring. So, it makes sense that both contribute to higher point totals.
So based on the standard deviation, we are gonna see which model has the
best accuracy.
## Model Accuracy_within_13pts
## 1 2D: PTS ~ AST 81.504
## 2 3D: PTS ~ AST + PF 81.504
When the improvement in model accuracy is minimalโsuch as the 0.4% gain observed with the more complex (3D) modelโit may not justify the added complexity. Considering that one standard deviation in points is around 13 PTS, this small increase in accuracy falls well within expected variability. As the textbook emphasizes, adding predictors that do not significantly improve performance can lead to overfitting, reducing the modelโs generalizability. Given this, it may be best to choose the simpler model for its interpretability and stability.
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## โน Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
Assists! As it turns out Team work has a strong relationship to successโHow many ASSISTS a team has impacts the PTS a team has. In other words, consider that Assists can lead to either a 3 Pointer or a 2 Pointer โbeing the Heaviest Pointers; indicating its likely infrequent that people get most their point from Free Throws.