DAT-4313 - Data Viz in Model Development

NBA Data

Author

Aaron Devine

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.

Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.

Code for NBA Data with depvar=GP(games played)

DATA

options(scipen=999)

library(socviz) # use for the round_df() function
library(datasetsICR)

#load("Catholic_viz.RData")
load("NBA_viz.RData")
#load("Murders96_viz.RData")

Here the data is loaded.

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[index,]
test  <- NBA_viz[-index,]

Here we are partitioning the data into a training set and a testing set.

Training set number of observations: 319
Test dataset number of observations: 211

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE)  # (class note --> remove flextable)

        vars   n    mean     sd min    max  range    se
PLAYER     1 319     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 319     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 319     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 319     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 319     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 319     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 319   26.16   4.27  19   39.0   20.0  0.24
GP         8 319   49.33  25.76   1   82.0   81.0  1.44
W          9 319   25.46  16.08   0   60.0   60.0  0.90
MIN       10 319 1098.37 814.22   1 2848.0 2847.0 45.59
PTS       11 319   19.39   6.42   0   42.8   42.8  0.36
FGM       12 319    7.27   2.48   0   17.1   17.1  0.14
FGA       13 319   16.54   4.93   0   52.7   52.7  0.28
FTM       14 319    2.82   1.74   0    9.1    9.1  0.10
FTA       15 319    3.85   2.36   0   18.2   18.2  0.13
REB       16 319    8.81   4.94   0   52.7   52.7  0.28
AST       17 319    4.24   2.84   0   17.1   17.1  0.16
TOV       18 319    2.43   1.19   0    7.2    7.2  0.07

Here are all the key descriptive statistics for our data. Clearly this doesn’t teach us much about the categorical variables but can be helpful for the player characteristics. Based on the means for GP and W it looks like players win about half of their games on average. Additionally, MIN has much higher variability than the other variables.

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

Here we see boxplots of all the numeric variables. Age tends to be low but there are some old outliers. GP doesn’t have any visible outliers and the mean is shifted to the right but the left whisker extends further to the left. This seems to indicate that players play a lot of games but that a few players haven’t played nearly as many games.The player skill attributes tend to have many more top end outliers than the other variables, indicating that a few top level players excel at these traits. Interestingly the W and MIN boxplots looks like the inverse of the GP boxplot.

library(DataExplorer)
plot_boxplot(train, by="ROOKIE")

Here we have boxplots with the value on the x axis and the ROOKIE variable on the y axis. As we would expect, the ROOKIE numbers are shifted to the left for every variable.

Histograms

plot_histogram(train)

Visualized above are histograms of all the continuous variables.There are very few extremely young players but then players become bunched up around 21-25 and then quickly taper off. FGA, FGM, PTS, and TOV all tend to be pretty symmetrical. Rebounds tend to range from 0-20 but there is an extremely high player over the 45 mark. GP appears to slowly increase as we move to the right.

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="GP")

sqrt_W <- sqrt(NBA_viz$W)
plot(x = sqrt_W, y = NBA_viz$GP)

Wins looks like a square root function. There is clearly a pattern visible in the scatter plot for MIN. It is very difficult to discern a a pattern in the player skills. Explored the relationship between the sqrt(W) and GP further.

Correlation Matrix

plot_correlation(train, type="continuous")

Here we see high correlation for MIN, W, and GP which is to be expected. There is also moderate to high correlation among field goal and free throw variables.

MODEL

Linear Regression Model

Estimate Coefficients and show coefficients table

model <- lm(GP ~ sqrt(W) + MIN + FGA + GUARD, data=train)
model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	5.197	2.015	2.579	0.0104	*
sqrt(W)	7.277	0.367	19.815	0.0000	***
MIN	0.016	0.001	18.612	0.0000	***
FGA	-0.393	0.090	-4.383	0.0000	***
GUARDYes	-1.808	0.847	-2.133	0.0337	*
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 7.54 on 314 degrees of freedom
Multiple R-squared: 0.9154, Adjusted R-squared: 0.9143
F-statistic: 849.1 on 314 and 4 DF, p-value: 0.0000

Ended up using a model with 4 predictors the square root of wins, minutes played, field goals attempted, and whether or not the player was a guard. This model explains 91.54% of the variability in games played and all the predictors are significant at a confidence level of 95%. However because MINS and GP measure such similar things, this model is not the most interesting or useful. Addionally, sqrt(W) and MINS have possitive coefficients while FGA and GUARD have negative ones.

Coefficient Magnitude Plot

# note this simple code using coefplot package produces chart like Figure 6.6 in one short line. 
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

Here we are able to visualize the magnitude of the coefficients of each predictor. As we can see, sqrt(W) had the largest impact of any of the individual predictors in the model.

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

 sqrt(W)      MIN      FGA    GUARD 
2.572557 2.673363 1.092382 1.007445

This is a test of the variance inflation factors. All predictors are well below a value of 5 meaning we don’t have to worry about multicollinearity with this model.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

     0%     25%     50%     75%    100% 
-16.960  -5.405  -0.430   4.380  22.800

The median residual is -0.430. This means that, on average, the difference between observed values and predicted values is approximately -0.430 units. Since the median is close to zero, the model tends to predict values accurately.The IQR is -5.405 to 4.380. This range indicates that 50% of the residuals fall within this interval.The range from -16.960 to 22.800 indicates larger errors between the observed and predicted values.These values may warrant further investigation to determine if they are influential points or are errors in the model.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

In the first plot there does appear to be some non random pattern to the residuals. In the second plot, other than at the extremes, the standardized residuals seem to track the Q-Q line closely.The third plot looks good as there does not appear to be a residual pattern. In the final plot there seems to be minimal influential observations with the vast majority falling well within Cook’s distance.

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p  + geom_point()

This plot visualizes the relationship between GP and the fitted values. The relationship appears very linear.

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

This plot depicts the residual vs. the fitted values and it is very hard to discern if there is a random or non random pattern in the plot.

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(PLAYER, GP, fit, lwr,upr) %>% 
  as_flextable(show_coltype = FALSE)

PLAYER	GP	fit	lwr	upr
Alan Williams	5	3.3	-11.8	18.3
Alec Burks	64	50.3	35.5	65.2
Alex Poythress	21	24.2	9.3	39.2
Alize Johnson	14	23.4	8.4	38.3
Allen Crabbe	43	46.8	31.9	61.7
Allonzo Trier	64	46.0	31.1	60.9
Amile Jefferson	12	23.7	8.7	38.6
Amir Johnson	51	49.1	34.2	64.1
Andrew Harrison	17	18.8	3.8	33.7
Andrew Wiggins	73	75.1	60.1	90.1
n: 10

Here we score the test dataset and display the first 10 players with their GP and fitted value.

Plot Actual vs Fitted (`test`)

plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)

Here is the actual vs fitted values for the test data. The points seem to track very closely which is exactly what we want to see.

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
                         round(rmse(test_w_predict$GP, test_w_predict$fit),4),
                         round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	6.4067
RMSE	8.1538
MAPE	0.4155

The Mean Absolute Error measures the average magnitude of errors between predicted and actual values. In this case, on average, the model’s predictions deviate from the actual values by 6.4067 units. The Root Mean Squared Error (RMSE) is similar to MAE but penalizes larger errors more heavily due to the squaring operation. The RMSE estimates that the deviation of the model’s predictions from the actual values is 8.1538 units, while the Mean Absolute Percentage Error (MAPE) suggests that the model’s predictions deviate from the actual values by approximately 41.55%. MAPE measures the percentage difference between predicted and actual values relative to the actual values (used ChatGPT here).

Model Fit for 2 Teams

# more in depth plots of performance 
library(summarytools)
library(psych)
table(NBA_viz$TEAM)


ATL BKN BOS CHA CHI CLE DAL DEN DET GSW HOU IND LAC LAL MEM MIA MIL MIN NOP NYK 
 18  17  16  17  18  21  16  18  17  17  20  17  19  20  22  15  18  18  17  17 
OKC ORL PHI PHX POR SAC SAS TOR UTA WAS 
 17  17  19  17  16  17  15  18  17  19

p <- ggplot(data = subset(test_w_predict, TEAM %in% c("CLE", "MEM")),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = TEAM,
                fill = TEAM,
                group = TEAM))
p +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="BOS and LAL",
              caption="NBA{datasetsICR}") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

Above is a plot of the actual vs. fitted values for Cleveland and Memphis which were the two teams with the most players in the datset. The two teams have a very similar pattern.

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

Data visualization definitely improved my understanding of modeling in this lab. I found the scatter plots to be particularly useful. The scatterplot is what allowed me to see that the relationship between W and GP looked like a square root function. That visual observation led to a direct improvement in the explanatory power of my model.