Learning Module 6

NBA Dataset

Author

Robin Chavez

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.

  • Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.

  • Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.


Example Code for FIFA with depvar=Wage

DATA

options(scipen=999)
library(socviz) # use for the round_df() function
library(psych)
load("NBA_viz.RData")

Selecting the variables

describe(NBA_viz)
         vars   n    mean     sd  median trimmed     mad min    max  range
PLAYER*     1 530  265.50 153.14  265.50  265.50  196.44   1  530.0  529.0
FORWARD*    2 530    1.49   0.50    1.00    1.49    0.00   1    2.0    1.0
CENTER*     3 530    1.19   0.39    1.00    1.11    0.00   1    2.0    1.0
GUARD*      4 530    1.49   0.50    1.00    1.49    0.00   1    2.0    1.0
ROOKIE*     5 530    1.20   0.40    1.00    1.12    0.00   1    2.0    1.0
TEAM*       6 530   15.41   8.60   15.00   15.38   10.38   1   30.0   29.0
AGE         7 530   26.14   4.18   25.00   25.85    4.45  19   42.0   23.0
GP          8 530   49.25  26.05   56.00   51.00   28.17   1   82.0   81.0
W           9 530   24.67  15.89   24.00   24.30   20.76   0   60.0   60.0
MIN        10 530 1121.63 838.67 1069.00 1078.70 1115.66   1 3028.0 3027.0
PTS        11 530   19.84   7.85   19.05   19.46    5.86   0  100.5  100.5
FGM        12 530    7.42   3.26    7.10    7.30    2.37   0   50.3   50.3
FGA        13 530   16.78   5.37   16.10   16.47    4.30   0   52.7   52.7
FTM        14 530    3.00   2.05    2.70    2.79    1.63   0   17.2   17.2
FTA        15 530    4.08   2.62    3.65    3.85    2.15   0   18.2   18.2
REB        16 530    9.06   5.14    7.90    8.52    3.85   0   52.7   52.7
AST        17 530    4.38   2.85    3.60    4.06    2.22   0   17.1   17.1
TOV        18 530    2.53   1.40    2.30    2.43    1.04   0   12.8   12.8
          skew kurtosis    se
PLAYER*   0.00    -1.21  6.65
FORWARD*  0.05    -2.00  0.02
CENTER*   1.59     0.52  0.02
GUARD*    0.02    -2.00  0.02
ROOKIE*   1.51     0.28  0.02
TEAM*     0.03    -1.18  0.37
AGE       0.65     0.08  0.18
GP       -0.48    -1.13  1.13
W         0.16    -1.12  0.69
MIN       0.28    -1.16 36.43
PTS       2.20    20.63  0.34
FGM       4.31    55.11  0.14
FGA       1.29     6.82  0.23
FTM       1.57     5.55  0.09
FTA       1.26     3.18  0.11
REB       2.20    11.90  0.22
AST       1.09     1.16  0.12
TOV       1.82     9.15  0.06
colnames(NBA_viz)
 [1] "PLAYER"  "FORWARD" "CENTER"  "GUARD"   "ROOKIE"  "TEAM"    "AGE"    
 [8] "GP"      "W"       "MIN"     "PTS"     "FGM"     "FGA"     "FTM"    
[15] "FTA"     "REB"     "AST"     "TOV"    
str(NBA_viz)
'data.frame':   530 obs. of  18 variables:
 $ PLAYER : Factor w/ 530 levels "Aaron Gordon",..: 1 2 3 5 4 6 7 8 9 10 ...
 $ FORWARD: Factor w/ 2 levels "No","Yes": 2 1 2 2 2 2 1 1 1 1 ...
 $ CENTER : Factor w/ 2 levels "No","Yes": 1 1 1 2 1 2 1 1 1 2 ...
 $ GUARD  : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 2 2 2 1 ...
 $ ROOKIE : Factor w/ 2 levels "No","Yes": 1 2 1 1 1 1 1 1 1 1 ...
 $ TEAM   : Factor w/ 30 levels "ATL","BKN","BOS",..: 22 12 21 3 25 2 26 21 14 1 ...
 $ AGE    : int  23 22 25 32 28 26 27 25 25 25 ...
 $ GP     : int  78 50 61 68 81 5 64 31 25 77 ...
 $ W      : int  40 31 38 41 52 1 19 21 8 28 ...
 $ MIN    : int  2633 646 694 1973 2292 26 1375 588 531 1544 ...
 $ PTS    : num  22.7 21.9 16.7 22.5 15.9 33.8 19.6 13.5 20.7 26.5 ...
 $ FGM    : num  8.6 7.8 6.3 9.4 5.4 15 6.7 4.6 7 9.9 ...
 $ FGA    : num  19.1 19.5 14.9 17.6 12.4 24.4 16.5 12.8 15.6 20.1 ...
 $ FTM    : num  3.4 3 1.9 1.9 3.1 3.8 4 1 4.6 4.4 ...
 $ FTA    : num  4.6 3.7 2.5 2.3 3.6 7.5 4.9 1.1 5.8 6.7 ...
 $ REB    : num  10.5 5 8 11.1 12.8 35.7 8.2 3.9 6.1 13.2 ...
 $ AST    : num  5.3 6.5 1.4 6.9 2.2 5.6 4.5 1.6 7 2.7 ...
 $ TOV    : num  3 3 1.8 2.5 1.5 1.9 2.3 1.1 3.8 3 ...
cat ("Data Identified for linear Regression model: ")
Data Identified for linear Regression model: 
model <- lm(PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)
summary(model)

Call:
lm(formula = PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)

Residuals:
    Min      1Q  Median      3Q     Max 
-23.653  -3.750  -0.206   3.275  86.083 

Coefficients:
              Estimate Std. Error t value          Pr(>|t|)    
(Intercept) 15.5128520  2.1371849   7.259 0.000000000001427 ***
AGE         -0.0425469  0.0744910  -0.571          0.568131    
GP          -0.1230499  0.0332959  -3.696          0.000242 ***
W            0.0353093  0.0388266   0.909          0.363553    
MIN          0.0061347  0.0008352   7.345 0.000000000000796 ***
REB          0.1094741  0.0617765   1.772          0.076961 .  
AST          0.3179386  0.1244043   2.556          0.010880 *  
TOV          0.5398697  0.2453718   2.200          0.028229 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.99 on 522 degrees of freedom
Multiple R-squared:  0.2173,    Adjusted R-squared:  0.2068 
F-statistic:  20.7 on 7 and 522 DF,  p-value: < 0.00000000000000022

In this chucnk, it shows the main idea and variables of the dataset.

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
depvar <- NBA_viz$PTS
index <- createDataPartition(depvar, p = 0.6, list = FALSE)
train <- NBA_viz[index, ]
test  <- NBA_viz[-index, ]

Training set number of observations: 320
Test dataset number of observations: 210

The partitioning of the dataset into training (60%) and test (40%) sets is a crucial step in developing and evaluating the predictive model. In this case, the NBA dataset was split, with 320 observations allocated to the training set and 210 observations to the test set. This strategy allows the model to learn patterns and relationships from the training data, and the test set serves as an independent sample to assess the model’s generalization performance. The choice of a 60-40 split strikes a balance between providing sufficient data for training and maintaining an adequately sized test set for robust evaluation. It’s important to note that the random seed is set to 1, ensuring reproducibility in subsequent analyses and model comparisons.


EDA

Descriptive Statistics

# Load necessary libraries
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast = TRUE)
        vars   n    mean     sd min    max  range    se
PLAYER     1 320     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 320     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 320     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 320     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 320     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 320     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 320   26.33   4.36  19   42.0   23.0  0.24
GP         8 320   50.27  26.45   1   82.0   81.0  1.48
W          9 320   24.58  16.06   0   60.0   60.0  0.90
MIN       10 320 1154.53 842.56   1 3028.0 3027.0 47.10
PTS       11 320   19.70   6.98   0   42.8   42.8  0.39
FGM       12 320    7.36   2.70   0   17.1   17.1  0.15
FGA       13 320   16.63   4.88   0   39.5   39.5  0.27
FTM       14 320    2.94   1.84   0   10.2   10.2  0.10
FTA       15 320    4.01   2.38   0   12.3   12.3  0.13
REB       16 320    8.97   4.73   0   35.7   35.7  0.26
AST       17 320    4.30   2.78   0   17.1   17.1  0.16
TOV       18 320    2.47   1.47   0   12.8   12.8  0.08

These are the findings of the dataset: The descriptive statistics offer useful insights into the key variables in the training data set by summarizing the central tendency, variability, and distribution of each feature.

Specifically, the average age of NBA players is 26.33 years with a standard deviation of 4.36 years, indicating a wide range from 19 to 42 years. The average number of games played is 50.27 with substantial variability (standard deviation of 26.45 games). There is also considerable variation across players in terms of wins, minutes played, points scored, field goals, free throws, rebounds, and assists. The metrics provide a quantitative overview of performance in fundamental aspects of basketball.

The wide-ranging distributions and standard deviations showcase the diversity among NBA players for most variables. Visualizing these summary statistics and distributions would further contextualize the spread, skew, outliers, and other characteristics essential for thorough exploratory data analysis. Overall, these measures offer a preliminary numerical synopsis of key input features that influence the output target - points scored. Analyzing the descriptive statistics facilitates initial data comprehension while also informing subsequent modeling decisions regarding transformations, assumptions, and interpretability.

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

Each of the boxplots shows a single variable. It is easy to see that each of the variables presents outliers. AGE: The distribution of player ages is slightly skewed to the right, with a few outliers on the higher end.
AST: The distribution of assists is also slightly skewed to the right, with a few outliers on the higher end.
FGA: The distribution of field goal attempts is fairly symmetrical, with a few outliers on both the lower and higher ends.
FGM: The distribution of field goals made is similar to the distribution of field goal attempts, with a few outliers on both ends.
FTA: The distribution of free throw attempts is slightly skewed to the right, with a few outliers on the higher end.
FTM: The distribution of free throws made is similar to the distribution of free throw attempts, with a few outliers on the higher end.
GP: The distribution of games played is skewed to the right, with a few outliers on the higher end.
MIN: The distribution of minutes played is skewed to the right, with a few outliers on the higher end.
PTS: The distribution of points scored is skewed to the right, with a few outliers on the higher end.
REB: The distribution of rebounds is skewed to the right, with a few outliers on the higher end.
TOV: The distribution of turnovers is skewed to the right, with a few outliers on the higher end.
W: The distribution of wins is skewed to the right, with a few outliers on the higher end.

library(DataExplorer)
plot_boxplot(train, by="AGE") 

These are some observations that can be seen from the boxplots when usingg age as the group variable.
The distribution of points per game is skewed to the right for all age groups, with more players scoring fewer points than higher points.
There is a general trend of increasing median points per game with increasing age, up to around 28 years old. After that, the median points per game appears to level off or even decrease slightly.
There is a fairly wide range of points scored within each age group, as indicated by the length of the boxes. This suggests that there is a lot of variability in scoring ability among players of all ages.
There are a few outliers, represented by points beyond the whiskers of the boxes. These outliers represent players who scored significantly more or fewer points than the majority of players in their age group.
### Histograms

plot_histogram(train)

Observations:
Player Age: The majority of players fall within the young-to-mid twenties age range, with a slight skew towards younger players.

Right-skewed Distributions: Several key statistics, including points scored, rebounds, assists, field goal attempts/makes, free throw attempts/makes, minutes played, and games played, exhibit right-skewed distributions. This indicates that most players fall towards the lower end of the spectrum, with a smaller number of players exceeding the average in these categories.

Wins Distribution: The distribution of team wins is also right-skewed, suggesting that most teams win a moderate number of games, while a few teams stand out with significantly more wins.

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by = "PTS") 

Observations:
There is a weak positive correlation between age and PTS. This means that as players get older, they tend to score slightly more points on average. However, there is a lot of variability in the data, and many young players score more points than some older players.

Players who are forwards tend to score more points than players who are centers or guards. This is likely because forwards typically play a more offensive role than centers or guards.

Correlation Matrix

plot_correlation(train, type="continuous")

Observations:
-Strong Positive Correlations:
PTS (Points Scored) and FGM (Field Goals Made): 0.96. This indicates a very strong positive correlation, meaning players with higher point totals tend to have correspondingly higher numbers of field goals made.
FGA (Field Goal Attempts) and FTM (Free Throws Made): 0.66. This suggests a moderately strong positive correlation, implying players who attempt more field goals generally make more free throws.
MIN (Minutes Played) and GP (Games Played): 0.89. This indicates a strong positive correlation, signifying that players who play more minutes tend to participate in more games.
W (Wins) and GP (Games Played): 0.76. This suggests a moderately strong positive correlation, implying teams with more wins typically play more games.

Strong Negative Correlations:

AGE and MIN (Minutes Played): -0.89. This indicates a very strong negative correlation, meaning younger players tend to play more minutes on average compared to older players.

Moderate Positive Correlations:

PTS (Points Scored) and FGA (Field Goal Attempts): 0.74. This suggests a moderate positive correlation, implying players with higher scores generally attempt more field goals.
REB (Rebounds) and MIN (Minutes Played): 0.37. This indicates a moderate positive correlation, meaning players who play more minutes tend to grab more rebounds on average.


MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

Estimate the following model:
\(model <- lm(PTS ~ AGE + GP + W + MIN + REB + AST + TOV, data = NBA_viz)\)

Estimate Coefficients and show coefficients table

# Check for zero or negative values in the dependent variable
sum(train$PTS <= 0)
[1] 7
train <- train[train$PTS > 0, ]

sum(is.na(train$PTS))
[1] 0
train <- train[complete.cases(train$PTS), ]

# Fit the linear regression model
model <- lm(log(PTS) ~ AGE + I(AGE^2) + GP + W + MIN + FGM + FGA + FTM + FTA + REB + AST + TOV, data = train)
summary(model)

Call:
lm(formula = log(PTS) ~ AGE + I(AGE^2) + GP + W + MIN + FGM + 
    FGA + FTM + FTA + REB + AST + TOV, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.68816 -0.03753  0.00980  0.05225  0.24029 

Coefficients:
               Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)  2.06422801  0.17994391  11.472 < 0.0000000000000002 ***
AGE         -0.00761248  0.01267464  -0.601              0.54856    
I(AGE^2)     0.00016352  0.00022549   0.725              0.46891    
GP           0.00192779  0.00059887   3.219              0.00143 ** 
W            0.00003584  0.00064962   0.055              0.95604    
MIN         -0.00003855  0.00001576  -2.446              0.01502 *  
FGM          0.10867147  0.00440791  24.654 < 0.0000000000000002 ***
FGA          0.00702982  0.00249933   2.813              0.00524 ** 
FTM          0.05660245  0.00933876   6.061        0.00000000405 ***
FTA         -0.01424028  0.00759438  -1.875              0.06175 .  
REB         -0.00874151  0.00140013  -6.243        0.00000000146 ***
AST         -0.00773600  0.00236237  -3.275              0.00118 ** 
TOV         -0.00774367  0.00513955  -1.507              0.13294    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.09309 on 300 degrees of freedom
Multiple R-squared:  0.9248,    Adjusted R-squared:  0.9218 
F-statistic: 307.5 on 12 and 300 DF,  p-value: < 0.00000000000000022
model %>% as_flextable

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

2.064

0.180

11.472

0.0000

***

AGE

-0.008

0.013

-0.601

0.5486

I(AGE^2)

0.000

0.000

0.725

0.4689

GP

0.002

0.001

3.219

0.0014

**

W

0.000

0.001

0.055

0.9560

MIN

-0.000

0.000

-2.446

0.0150

*

FGM

0.109

0.004

24.654

0.0000

***

FGA

0.007

0.002

2.813

0.0052

**

FTM

0.057

0.009

6.061

0.0000

***

FTA

-0.014

0.008

-1.875

0.0617

.

REB

-0.009

0.001

-6.243

0.0000

***

AST

-0.008

0.002

-3.275

0.0012

**

TOV

-0.008

0.005

-1.507

0.1329

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.09309 on 300 degrees of freedom

Multiple R-squared: 0.9248, Adjusted R-squared: 0.9218

F-statistic: 307.5 on 300 and 12 DF, p-value: 0.0000

Observations:
Multiple R-squared: It indicates the proportion of variance in the dependent variable (log-transformed PTS) that is explained by the independent variables. In this case, approximately 92.48% of the variance is explained.
Adjusted R-squared: It adjusts the R-squared value based on the number of predictors in the model. It penalizes the addition of predictors that do not improve the model significantly.
F-statistic: It tests the overall significance of the model. A low p-value (< 0.05) suggests that at least one predictor variable is significantly related to the response variable.

Coefficient Magnitude Plot

# note this simple code using coefplot package produces chart like Figure 6.6 in one short line. 
library(coefplot)
coefplot(model, soft = "magnitude", intercept = FALSE)

Observations:
The visualization confirms the findings from the model summary.
Variables like FGM, FTM, REB, AST, GP, and MIN have coefficients with the same direction and relative strength as indicated by their p-values in the summary.
The error bars provide a visual sense of the uncertainty around the coefficient estimates. For instance, the error bar for the AGE^2 coefficient is relatively large, suggesting that this estimate might be less precise compared to others with shorter bars.

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)
       AGE   I(AGE^2)         GP          W        MIN        FGM        FGA 
111.154997 110.494789   8.539649   3.810795   6.219076   4.335330   4.444139 
       FTM        FTA        REB        AST        TOV 
 10.288880  11.275894   1.483570   1.501982   1.600125 

Observations:

The VIF values reveal potential multicollinearity concerns within the regression model. Notably, AGE and its squared term exhibit high VIF, suggesting a strong correlation. Minutes Played (MIN) and Free Throws Attempted (FTA) also show moderate to high VIF, indicating possible collinearity. This could impact coefficient precision. Variables with acceptable VIF include Games Played (GP), Wins (W), Field Goals Attempted (FGA), Field Goals Made (FGM), Free Throws Made (FTM), Rebounds (REB), Assists (AST), and Turnovers (TOV). Managing multicollinearity is vital for stable coefficients and model robustness, prompting consideration of variable transformations or selection.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))
   0%   25%   50%   75%  100% 
-0.69 -0.04  0.01  0.05  0.24 

Observations:

The residual range, as shown by quantiles, indicates that the majority of residuals are concentrated within a narrow interval, with the central 50% falling between -0.04 and 0.05. This suggests a relatively balanced distribution around the fitted values, supported by a median close to zero. The interquartile range (IQR) is compact, implying consistent variability in the residuals.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Observations:
The plot of residuals versus fitted values reveals revealing patterns in the data. Although randomness is observed in the residuals, a discernible trend, particularly for the higher fitted values, suggests that the model could benefit from additional complexity or adjustments. Deviation from a perfect normal distribution raises considerations about the reliability of certain statistical tests, leading to further examination of the possible non-normality of the data. Heteroscedasticity, indicated by variation in the variance of the residuals, underscores the need to address the assumption of constant variance at different levels of fitted values. These data highlight the potential for refining the model and improving its overall predictive performance. The Q-Q residuals and scale location plots offer complementary insights, which guide further investigations for a more nuanced understanding of the patterns in the data.

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=PTS))
p  + geom_point()

Observations: the model’s ability to foresee higher scores for prolific scorers. Despite some dispersion around the diagonal line, implying imprecise predictions for individual players, the model generally performs well. Outliers, representing players with substantial deviations from predicted points, warrant further investigation to uncover underlying factors influencing these discrepancies.

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Observations:

The scatter plot of residuals versus fitted values reveals a scattered distribution around the zero line, indicating that the residuals are generally centered. However, the presence of curvature suggests a potential non-linearity in the model’s relationship with higher fitted values.

Performance Evaluation

Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(PLAYER, AGE, fit, lwr,upr) %>% 
  as_flextable(show_coltype = FALSE) 

PLAYER

AGE

fit

lwr

upr

Al Horford

32

3.1

2.9

3.3

Alex Abrines

25

2.6

2.4

2.8

Alex Poythress

25

2.7

2.5

2.9

Alfonzo McKinnie

26

2.8

2.6

3.0

Allen Crabbe

27

2.7

2.6

2.9

Amile Jefferson

25

2.9

2.7

3.1

Andre Iguodala

35

2.5

2.4

2.7

Andre Ingram

33

2.1

1.9

2.3

Andrew Bogut

34

2.6

2.4

2.8

Anfernee Simons

19

3.2

3.0

3.4

n: 10

Observations:
The model was applied to the test dataset, generating predicted values and corresponding confidence intervals for the dependent variable (PTS). The results for the first 10 observations include player names (PLAYER), ages (AGE), fitted values (fit), and the lower (lwr) and upper (upr) bounds of the confidence intervals. For instance, Al Horford, aged 32, has a fitted PTS value of 3.1 with a confidence interval ranging from 2.9 to 3.3. Similar information is provided for the other players in the displayed subset.

Plot Actual vs Fitted (test)

plot(test_w_predict$PTS, test_w_predict$fit, pch = 16, cex = 1)

Observations:
In the plot of actual versus fitted values for the test set, a positive correlation is observed, indicating that the model tends to predict higher scores for players who actually score more points and vice versa.

Performance Metrics

library(Metrics)
metric_label <- c("MAE", "RMSE", "MAPE")
metrics <- c(
  round(mae(test_w_predict$PTS, test_w_predict$fit), 4),
  round(rmse(test_w_predict$PTS, test_w_predict$fit), 4),
  round(mape(test_w_predict$PTS, test_w_predict$fit), 4)
)
pmtable <- data.frame(Metric = metric_label, Value = metrics)
flextable(pmtable)

Metric

Value

MAE

17.1601

RMSE

19.1095

MAPE

Inf

Observations:
These metrics provide insights into the accuracy and precision of the model’s predictions. The MAE represents the average absolute difference between the predicted and actual values, with a lower value indicating better performance. The RMSE gives a measure of the model’s prediction error, considering both the bias and variance. The MAPE, in this case, is reported as Inf, which might suggest potential issues in the predictions, such as zero values in the actual PTS.

Model Fit by Team

unique(NBA_viz$TEAM)
 [1] ORL IND OKC BOS POR BKN SAC LAL ATL GSW NYK PHI DET NOP MIN LAC CLE CHI HOU
[20] MEM MIA CHA WAS MIL DEN SAS TOR DAL UTA PHX
30 Levels: ATL BKN BOS CHA CHI CLE DAL DEN DET GSW HOU IND LAC LAL MEM ... WAS
# more in depth plots of performance 
subset_data <- subset(test_w_predict, TEAM %in% c("LAL", "BOS", "NOP"))

# Create the plot
p <- ggplot(data = subset_data,
            aes(x = PTS,
                y = fit, ymin = lwr, ymax = upr,
                color = TEAM, fill = TEAM, group = TEAM))

p + geom_point(alpha = 0.5) + 
  geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
  labs(title = "Actual vs Fitted with Upper and Lower CI",
       subtitle = "Teams: LAL and BOS",
       caption = "NBA_viz") +
  xlab("Actual PTS") + ylab("Fitted PTS") +
  theme(legend.position = "bottom")

Observations:
The plot displays the actual versus fitted points, including upper and lower confidence intervals. This visualization helps assess potential team effects, identifying any systematic differences in the model’s prediction accuracy for players from different teams.


ggplot2 explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(log(PTS) ~ AGE, data = train)
summary(pip)

Call:
lm(formula = log(PTS) ~ AGE, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.46794 -0.19955  0.01565  0.22861  0.82338 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 2.878515   0.114986  25.034 <0.0000000000000002 ***
AGE         0.002732   0.004303   0.635               0.526    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.3332 on 311 degrees of freedom
Multiple R-squared:  0.001295,  Adjusted R-squared:  -0.001917 
F-statistic: 0.4031 on 1 and 311 DF,  p-value: 0.526
library(ggplot2)

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

Observations:
The visualizations explore modeling the relationship between player age (AGE) and log-transformed points scored (log(PTS)) in basketball. The initial scatterplot displays the linear association while robust, spline and quantile regression models account for non-linear patterns and outliers. Together, these graphics reveal subtleties and intricacies within the data that enable selecting the most appropriate model specification based on visible data patterns rather than assumptions alone. They highlight the need for both linear and non-linear considerations when relating age to scoring


SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The linear regression model was constructed to predict points (PTS) in the NBA dataset using several independent variables, including age (AGE), games played (GP), wins (W), minutes played (MIN), rebounds (REB), assists (AST), and turnovers (TOV). The model assumptions and performance were assessed through exploratory data analysis (EDA), visualization, and statistical metrics.

The model’s R-squared value is 0.9248, indicating that approximately 92.48% of the variability in the log-transformed points (log(PTS)) can be explained by the chosen independent variables. This high R-squared value suggests a strong relationship between the predictors and the log-transformed points, supporting the model’s overall explanatory power.

Model Performance

The model’s performance was evaluated using various metrics on the test dataset. Important key performance metrics include:

  • Mean Absolute Error (MAE): 17.1601
  • Root Mean Squared Error (RMSE): 19.1095
  • Mean Absolute Percentage Error (MAPE): inf

The model’s performance metrics reveal valuable insights into its predictive accuracy. The Mean Absolute Error (MAE) of 17.1601 implies an average deviation of approximately 17.16 points between the predicted and actual NBA player points (PTS). The Root Mean Squared Error (RMSE) at 19.1095 indicates the average magnitude of prediction errors, considering both the squared differences. However, the Mean Absolute Percentage Error (MAPE) being infinite suggests potential challenges, possibly arising from instances with zero or near-zero actual PTS values.

Visualization Impact

The visualizations, including boxplots, histograms, scatterplots, and performance plots, played an important role in model development and evaluation. They helped in:

  1. Identifying Relationships: Visualizations allowed for the exploration of relationships between variables, helping in the selection of meaningful predictors.

  2. Assessing Assumptions: Residual plots and diagnostic plots were used to check the model assumptions, ensuring the validity of linear regression.

  3. Performance Evaluation: Plots depicting actual vs. fitted values provided a clear view of the model’s predictive accuracy.

  4. Team Analysis: Visualizing model performance by team helped in understanding variations and potential biases in the dataset.

END