Customer or Audience:
A professional basketball team’s front office (e.g., General Manager and
Data Analytics team) that needs insights into player performance factors
to make better contract and roster decisions.
Problem Statement:
Due to budget constraints and increasing player salary demands, the
front office needs to identify which key performance metrics (e.g.,
Points, Steals, Blocks) most significantly impact player performance (as
measured by Game Score). This will allow the team to optimize spending
and prioritize contract renewals based on reliable predictors of
value.
Scope:
We will focus on the variables available in the dataset: Points (PTS),
Steals (STL), Blocks (BLK), and Playoff participation (Playoffs). The
linear model currently predicts Game Score (GmSc), a comprehensive
player performance metric. Assumptions: no major missing data, metrics
are recorded consistently across players.
Objective:
Determine which performance statistics significantly affect a player’s
Game Score and whether playoff experience provides an additional
predictive boost. Success will be measured by building a model with
strong adjusted R², minimal multicollinearity, and satisfying linear
regression assumptions.
Critique Points:
Limited Set of Predictors
Issue: Only PTS, STL, BLK, and Playoffs were included. Other available stats like Assists (AST), Rebounds (REB), or Turnovers (TOV) might have strong explanatory power too.
Improvement: Try adding more predictors and perform stepwise selection (AIC or BIC) to find an optimal model.
library(readr)
nba_data <- read_csv("C:/Statistics/nba.csv")
## Rows: 1703 Columns: 19
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (4): bbrID, Tm, Opp, Season
## dbl (12): TRB, AST, STL, BLK, PTS, GmSc, Year, GameIndex, GmScMovingZ, GmSc...
## lgl (1): Playoffs
## date (2): Date, Date2
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# Extended model with more predictors
model_full <- lm(GmSc ~ PTS + STL + BLK + AST + Playoffs, data = nba_data)
summary(model_full)
##
## Call:
## lm(formula = GmSc ~ PTS + STL + BLK + AST + Playoffs, data = nba_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7964 -1.5966 0.0059 1.6047 10.3137
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.353000 0.193071 12.19 <2e-16 ***
## PTS 0.723468 0.006205 116.60 <2e-16 ***
## STL 0.838953 0.044174 18.99 <2e-16 ***
## BLK 0.976275 0.048930 19.95 <2e-16 ***
## AST 0.442468 0.020268 21.83 <2e-16 ***
## PlayoffsTRUE 0.023031 0.381139 0.06 0.952
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.586 on 1697 degrees of freedom
## Multiple R-squared: 0.907, Adjusted R-squared: 0.9067
## F-statistic: 3311 on 5 and 1697 DF, p-value: < 2.2e-16
Potential Non linearity and Heteroscedasticity
Issue: Diagnostic plots indicated slight non-linearity and non-constant variance.
Improvement: Apply a log transformation to the dependent variable (GmSc) to stabilize variance.
nba_data$log_GmSc <- log(nba_data$GmSc + 1) # Add 1 to handle zeros model_log <- lm(log_GmSc ~ PTS + STL + BLK + Playoffs, data = nba_data) summary(model_log) par(mfrow=c(2,2)) plot(model_log)
Missing Interaction Terms
Issue: Only main effects were modeled. But a player’s scoring (PTS) could interact with their defensive contributions (STL/BLK) to affect GmSc.
Improvement: Introduce interaction terms into the model.
model_interaction <- lm(GmSc ~ PTS * STL + PTS * BLK + Playoffs, data = nba_data)
summary(model_interaction)
##
## Call:
## lm(formula = GmSc ~ PTS * STL + PTS * BLK + Playoffs, data = nba_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -8.7043 -1.9256 -0.1106 1.7405 11.4574
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.320287 0.328168 10.118 < 2e-16 ***
## PTS 0.741215 0.011743 63.121 < 2e-16 ***
## STL 1.116740 0.129814 8.603 < 2e-16 ***
## BLK 0.598702 0.139184 4.302 1.79e-05 ***
## PlayoffsTRUE -0.039047 0.431672 -0.090 0.928
## PTS:STL -0.001876 0.004518 -0.415 0.678
## PTS:BLK 0.007906 0.004977 1.589 0.112
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.925 on 1696 degrees of freedom
## Multiple R-squared: 0.8811, Adjusted R-squared: 0.8807
## F-statistic: 2095 on 6 and 1696 DF, p-value: < 2.2e-16
1.Potential Bias in Player Evaluation
Players who don’t traditionally “stuff the stat sheet” (like defensive specialists) might be undervalued if we rely only on GmSc predictors like PTS or STL.
2. Data Limitations and Context Ignorance
The data might not fully account for intangible factors (e.g., leadership, defensive pressure that doesn’t show up in box scores) that contribute to success, creating an incomplete picture.
3.Risk of Overemphasizing Predictive Metrics
If the model guides financial decisions (contracts, trades), there’s a risk of overlooking player roles that aren’t easily quantifiable, leading to unfair treatment or missed opportunities.
4.Epistemological Concern:
The model treats performance as purely a statistical outcome without considering human factors (injuries, teamwork, morale), which challenges the completeness of “knowing” player value solely through numbers.
The assignment does a great job addressing multicollinearity and basic regression diagnostics.
For a business-grade analysis, extending the feature set, addressing heteroscedasticity, and modeling potential interactions would strengthen your insights.
Ethical awareness is crucial when player livelihoods and careers are influenced by data-driven decisions.