options(scipen=999)
library(socviz) # use for the round_df() function
library(datasetsICR)
#load("Catholic_viz.RData")
load("NBA_viz.RData")
#load("Murders96_viz.RData")DAT-4313 - Data Viz in Model Development
NBA Data
FURTHER INSTRUCTIONS
Use the dataset assigned to you in the LAB Instructions.
Use the depvar indicated for your dataset.
Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.
Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk.
Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
Code for NBA Data with depvar=GP(games played)
DATA
Here the data is loaded.
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[index,]
test <- NBA_viz[-index,]Here we are partitioning the data into a training set and a testing set.
Training set number of observations: 319
Test dataset number of observations: 211
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) # (class note --> remove flextable) vars n mean sd min max range se
PLAYER 1 319 NaN NA Inf -Inf -Inf NA
FORWARD 2 319 NaN NA Inf -Inf -Inf NA
CENTER 3 319 NaN NA Inf -Inf -Inf NA
GUARD 4 319 NaN NA Inf -Inf -Inf NA
ROOKIE 5 319 NaN NA Inf -Inf -Inf NA
TEAM 6 319 NaN NA Inf -Inf -Inf NA
AGE 7 319 26.16 4.27 19 39.0 20.0 0.24
GP 8 319 49.33 25.76 1 82.0 81.0 1.44
W 9 319 25.46 16.08 0 60.0 60.0 0.90
MIN 10 319 1098.37 814.22 1 2848.0 2847.0 45.59
PTS 11 319 19.39 6.42 0 42.8 42.8 0.36
FGM 12 319 7.27 2.48 0 17.1 17.1 0.14
FGA 13 319 16.54 4.93 0 52.7 52.7 0.28
FTM 14 319 2.82 1.74 0 9.1 9.1 0.10
FTA 15 319 3.85 2.36 0 18.2 18.2 0.13
REB 16 319 8.81 4.94 0 52.7 52.7 0.28
AST 17 319 4.24 2.84 0 17.1 17.1 0.16
TOV 18 319 2.43 1.19 0 7.2 7.2 0.07
Here are all the key descriptive statistics for our data. Clearly this doesn’t teach us much about the categorical variables but can be helpful for the player characteristics. Based on the means for GP and W it looks like players win about half of their games on average. Additionally, MIN has much higher variability than the other variables.
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")Here we see boxplots of all the numeric variables. Age tends to be low but there are some old outliers. GP doesn’t have any visible outliers and the mean is shifted to the right but the left whisker extends further to the left. This seems to indicate that players play a lot of games but that a few players haven’t played nearly as many games.The player skill attributes tend to have many more top end outliers than the other variables, indicating that a few top level players excel at these traits. Interestingly the W and MIN boxplots looks like the inverse of the GP boxplot.
library(DataExplorer)
plot_boxplot(train, by="ROOKIE") Here we have boxplots with the value on the x axis and the ROOKIE variable on the y axis. As we would expect, the ROOKIE numbers are shifted to the left for every variable.
Histograms
plot_histogram(train)Visualized above are histograms of all the continuous variables.There are very few extremely young players but then players become bunched up around 21-25 and then quickly taper off. FGA, FGM, PTS, and TOV all tend to be pretty symmetrical. Rebounds tend to range from 0-20 but there is an extremely high player over the 45 mark. GP appears to slowly increase as we move to the right.
Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="GP")sqrt_W <- sqrt(NBA_viz$W)
plot(x = sqrt_W, y = NBA_viz$GP)Wins looks like a square root function. There is clearly a pattern visible in the scatter plot for MIN. It is very difficult to discern a a pattern in the player skills. Explored the relationship between the sqrt(W) and GP further.
Correlation Matrix
plot_correlation(train, type="continuous")Here we see high correlation for MIN, W, and GP which is to be expected. There is also moderate to high correlation among field goal and free throw variables.
MODEL
Linear Regression Model
Estimate Coefficients and show coefficients table
model <- lm(GP ~ sqrt(W) + MIN + FGA + GUARD, data=train)
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 5.197 | 2.015 | 2.579 | 0.0104 | * |
sqrt(W) | 7.277 | 0.367 | 19.815 | 0.0000 | *** |
MIN | 0.016 | 0.001 | 18.612 | 0.0000 | *** |
FGA | -0.393 | 0.090 | -4.383 | 0.0000 | *** |
GUARDYes | -1.808 | 0.847 | -2.133 | 0.0337 | * |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 7.54 on 314 degrees of freedom | |||||
Multiple R-squared: 0.9154, Adjusted R-squared: 0.9143 | |||||
F-statistic: 849.1 on 314 and 4 DF, p-value: 0.0000 | |||||
Ended up using a model with 4 predictors the square root of wins, minutes played, field goals attempted, and whether or not the player was a guard. This model explains 91.54% of the variability in games played and all the predictors are significant at a confidence level of 95%. However because MINS and GP measure such similar things, this model is not the most interesting or useful. Addionally, sqrt(W) and MINS have possitive coefficients while FGA and GUARD have negative ones.
Coefficient Magnitude Plot
# note this simple code using coefplot package produces chart like Figure 6.6 in one short line.
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)Here we are able to visualize the magnitude of the coefficients of each predictor. As we can see, sqrt(W) had the largest impact of any of the individual predictors in the model.
Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) sqrt(W) MIN FGA GUARD
2.572557 2.673363 1.092382 1.007445
This is a test of the variance inflation factors. All predictors are well below a value of 5 meaning we don’t have to worry about multicollinearity with this model.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-16.960 -5.405 -0.430 4.380 22.800
The median residual is -0.430. This means that, on average, the difference between observed values and predicted values is approximately -0.430 units. Since the median is close to zero, the model tends to predict values accurately.The IQR is -5.405 to 4.380. This range indicates that 50% of the residuals fall within this interval.The range from -16.960 to 22.800 indicates larger errors between the observed and predicted values.These values may warrant further investigation to determine if they are influential points or are errors in the model.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)In the first plot there does appear to be some non random pattern to the residuals. In the second plot, other than at the extremes, the standardized residuals seem to track the Q-Q line closely.The third plot looks good as there does not appear to be a residual pattern. In the final plot there seems to be minimal influential observations with the vast majority falling well within Cook’s distance.
Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p + geom_point()This plot visualizes the relationship between GP and the fitted values. The relationship appears very linear.
Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()This plot depicts the residual vs. the fitted values and it is very hard to discern if there is a random or non random pattern in the plot.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(PLAYER, GP, fit, lwr,upr) %>%
as_flextable(show_coltype = FALSE) PLAYER | GP | fit | lwr | upr |
|---|---|---|---|---|
Alan Williams | 5 | 3.3 | -11.8 | 18.3 |
Alec Burks | 64 | 50.3 | 35.5 | 65.2 |
Alex Poythress | 21 | 24.2 | 9.3 | 39.2 |
Alize Johnson | 14 | 23.4 | 8.4 | 38.3 |
Allen Crabbe | 43 | 46.8 | 31.9 | 61.7 |
Allonzo Trier | 64 | 46.0 | 31.1 | 60.9 |
Amile Jefferson | 12 | 23.7 | 8.7 | 38.6 |
Amir Johnson | 51 | 49.1 | 34.2 | 64.1 |
Andrew Harrison | 17 | 18.8 | 3.8 | 33.7 |
Andrew Wiggins | 73 | 75.1 | 60.1 | 90.1 |
n: 10 | ||||
Here we score the test dataset and display the first 10 players with their GP and fitted value.
Plot Actual vs Fitted (test)
plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)Here is the actual vs fitted values for the test data. The points seem to track very closely which is exactly what we want to see.
Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
round(rmse(test_w_predict$GP, test_w_predict$fit),4),
round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 6.4067 |
RMSE | 8.1538 |
MAPE | 0.4155 |
The Mean Absolute Error measures the average magnitude of errors between predicted and actual values. In this case, on average, the model’s predictions deviate from the actual values by 6.4067 units. The Root Mean Squared Error (RMSE) is similar to MAE but penalizes larger errors more heavily due to the squaring operation. The RMSE estimates that the deviation of the model’s predictions from the actual values is 8.1538 units, while the Mean Absolute Percentage Error (MAPE) suggests that the model’s predictions deviate from the actual values by approximately 41.55%. MAPE measures the percentage difference between predicted and actual values relative to the actual values (used ChatGPT here).
Model Fit for 2 Teams
# more in depth plots of performance
library(summarytools)
library(psych)
table(NBA_viz$TEAM)
ATL BKN BOS CHA CHI CLE DAL DEN DET GSW HOU IND LAC LAL MEM MIA MIL MIN NOP NYK
18 17 16 17 18 21 16 18 17 17 20 17 19 20 22 15 18 18 17 17
OKC ORL PHI PHX POR SAC SAS TOR UTA WAS
17 17 19 17 16 17 15 18 17 19
p <- ggplot(data = subset(test_w_predict, TEAM %in% c("CLE", "MEM")),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = TEAM,
fill = TEAM,
group = TEAM))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="BOS and LAL",
caption="NBA{datasetsICR}") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")Above is a plot of the actual vs. fitted values for Cleveland and Memphis which were the two teams with the most players in the datset. The two teams have a very similar pattern.
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
Data visualization definitely improved my understanding of modeling in this lab. I found the scatter plots to be particularly useful. The scatterplot is what allowed me to see that the relationship between W and GP looked like a square root function. That visual observation led to a direct improvement in the explanatory power of my model.