Model Critique

Goal 1: Business Scenario

Customer or Audience:
A professional basketball team’s front office (e.g., General Manager and Data Analytics team) that needs insights into player performance factors to make better contract and roster decisions.

Problem Statement:
Due to budget constraints and increasing player salary demands, the front office needs to identify which key performance metrics (e.g., Points, Steals, Blocks) most significantly impact player performance (as measured by Game Score). This will allow the team to optimize spending and prioritize contract renewals based on reliable predictors of value.

Scope:
We will focus on the variables available in the dataset: Points (PTS), Steals (STL), Blocks (BLK), and Playoff participation (Playoffs). The linear model currently predicts Game Score (GmSc), a comprehensive player performance metric. Assumptions: no major missing data, metrics are recorded consistently across players.

Objective:
Determine which performance statistics significantly affect a player’s Game Score and whether playoff experience provides an additional predictive boost. Success will be measured by building a model with strong adjusted R², minimal multicollinearity, and satisfying linear regression assumptions.

Goal 2: Model Critique

Critique Points:

  1. Limited Set of Predictors

    • Issue: Only PTS, STL, BLK, and Playoffs were included. Other available stats like Assists (AST), Rebounds (REB), or Turnovers (TOV) might have strong explanatory power too.

    • Improvement: Try adding more predictors and perform stepwise selection (AIC or BIC) to find an optimal model.

    library(readr)
    nba_data <- read_csv("C:/Statistics/nba.csv")
    ## Rows: 1703 Columns: 19
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## chr   (4): bbrID, Tm, Opp, Season
    ## dbl  (12): TRB, AST, STL, BLK, PTS, GmSc, Year, GameIndex, GmScMovingZ, GmSc...
    ## lgl   (1): Playoffs
    ## date  (2): Date, Date2
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    # Extended model with more predictors
    model_full <- lm(GmSc ~ PTS + STL + BLK + AST + Playoffs, data = nba_data)
    summary(model_full)
    ## 
    ## Call:
    ## lm(formula = GmSc ~ PTS + STL + BLK + AST + Playoffs, data = nba_data)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -8.7964 -1.5966  0.0059  1.6047 10.3137 
    ## 
    ## Coefficients:
    ##              Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)  2.353000   0.193071   12.19   <2e-16 ***
    ## PTS          0.723468   0.006205  116.60   <2e-16 ***
    ## STL          0.838953   0.044174   18.99   <2e-16 ***
    ## BLK          0.976275   0.048930   19.95   <2e-16 ***
    ## AST          0.442468   0.020268   21.83   <2e-16 ***
    ## PlayoffsTRUE 0.023031   0.381139    0.06    0.952    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.586 on 1697 degrees of freedom
    ## Multiple R-squared:  0.907,  Adjusted R-squared:  0.9067 
    ## F-statistic:  3311 on 5 and 1697 DF,  p-value: < 2.2e-16
  2. Potential Non linearity and Heteroscedasticity

    • Issue: Diagnostic plots indicated slight non-linearity and non-constant variance.

    • Improvement: Apply a log transformation to the dependent variable (GmSc) to stabilize variance.

    nba_data$log_GmSc <- log(nba_data$GmSc + 1)  # Add 1 to handle zeros model_log <- lm(log_GmSc ~ PTS + STL + BLK + Playoffs, data = nba_data) summary(model_log)  par(mfrow=c(2,2)) plot(model_log)
  3. Missing Interaction Terms

    • Issue: Only main effects were modeled. But a player’s scoring (PTS) could interact with their defensive contributions (STL/BLK) to affect GmSc.

    • Improvement: Introduce interaction terms into the model.

    model_interaction <- lm(GmSc ~ PTS * STL + PTS * BLK + Playoffs, data = nba_data)
    summary(model_interaction)
    ## 
    ## Call:
    ## lm(formula = GmSc ~ PTS * STL + PTS * BLK + Playoffs, data = nba_data)
    ## 
    ## Residuals:
    ##     Min      1Q  Median      3Q     Max 
    ## -8.7043 -1.9256 -0.1106  1.7405 11.4574 
    ## 
    ## Coefficients:
    ##               Estimate Std. Error t value Pr(>|t|)    
    ## (Intercept)   3.320287   0.328168  10.118  < 2e-16 ***
    ## PTS           0.741215   0.011743  63.121  < 2e-16 ***
    ## STL           1.116740   0.129814   8.603  < 2e-16 ***
    ## BLK           0.598702   0.139184   4.302 1.79e-05 ***
    ## PlayoffsTRUE -0.039047   0.431672  -0.090    0.928    
    ## PTS:STL      -0.001876   0.004518  -0.415    0.678    
    ## PTS:BLK       0.007906   0.004977   1.589    0.112    
    ## ---
    ## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
    ## 
    ## Residual standard error: 2.925 on 1696 degrees of freedom
    ## Multiple R-squared:  0.8811, Adjusted R-squared:  0.8807 
    ## F-statistic:  2095 on 6 and 1696 DF,  p-value: < 2.2e-16

Goal 3: Ethical and Epistemological Concerns

1.Potential Bias in Player Evaluation

Players who don’t traditionally “stuff the stat sheet” (like defensive specialists) might be undervalued if we rely only on GmSc predictors like PTS or STL.

2. Data Limitations and Context Ignorance

The data might not fully account for intangible factors (e.g., leadership, defensive pressure that doesn’t show up in box scores) that contribute to success, creating an incomplete picture.

3.Risk of Overemphasizing Predictive Metrics

If the model guides financial decisions (contracts, trades), there’s a risk of overlooking player roles that aren’t easily quantifiable, leading to unfair treatment or missed opportunities.

4.Epistemological Concern:

The model treats performance as purely a statistical outcome without considering human factors (injuries, teamwork, morale), which challenges the completeness of “knowing” player value solely through numbers.

Final Thoughts