Lab 6 - Data Viz in Model Development

NBA Dataset

Author

Elise Bosma

Loading the Data

options(scipen=999)

library(socviz)

load("NBA_viz.RData")

Partitioning the Dataset into 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[index,]
test  <- NBA_viz[-index,]

Training set number of observations: 319
Test dataset number of observations: 211


EDA

Descriptive Statistics

library(psych)
library(tidyverse)
psych::describe(train, fast=TRUE)
        vars   n    mean     sd min    max  range    se
PLAYER     1 319     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 319     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 319     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 319     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 319     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 319     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 319   26.16   4.27  19   39.0   20.0  0.24
GP         8 319   49.33  25.76   1   82.0   81.0  1.44
W          9 319   25.46  16.08   0   60.0   60.0  0.90
MIN       10 319 1098.37 814.22   1 2848.0 2847.0 45.59
PTS       11 319   19.39   6.42   0   42.8   42.8  0.36
FGM       12 319    7.27   2.48   0   17.1   17.1  0.14
FGA       13 319   16.54   4.93   0   52.7   52.7  0.28
FTM       14 319    2.82   1.74   0    9.1    9.1  0.10
FTA       15 319    3.85   2.36   0   18.2   18.2  0.13
REB       16 319    8.81   4.94   0   52.7   52.7  0.28
AST       17 319    4.24   2.84   0   17.1   17.1  0.16
TOV       18 319    2.43   1.19   0    7.2    7.2  0.07

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

The horizontal boxplot shows the range of each of the numeric variables within the dataset. It identifies the outliers as well as the interquartile range of the data. For instance, games played (GP) had no outliers because the teams fall under a single league who regulates each of their games. A singular team would not play more games than other teams.

library(DataExplorer)
plot_boxplot(train, by="FORWARD") 

I originally created this boxplot to capture the stats for each team, but decided to focus on a specific position due to the lack of legibility when interpreting the graph. There are 30 basketball teams, which flooded the y-axis with data. By focusing on a position, it compares a Forward’s stats with the rest of the positions in the league. For instance, a FORWARD has a higher rebound average (REB) than other positions in the NBA.

Histograms

plot_histogram(train)

This plot shows the frequency of each variable in the NBA dataset, such as the frequency of age, assists, field goals, etc.

Scatterplots (depvar ~ all x)

plot_scatterplot(test, by="GP")

There is no correlation between the independent variables and the dependent variable. However, there is a small amount of clustering in assists, free throw attempts, turnovers, and points.

Correlation Matrix

plot_correlation(train, type="continuous")

This correlation matrix compares variables within the dataset. Those that are a deeper red are highly correlated such as games played, wins, and minutes. In contrast, those who are blue share a negative correlation such as assists and rebounds.


MODEL

Linear Regression Model

Estimate Coefficients and show coefficients table

model <- lm(log(GP) ~ AGE + I(AGE^2) + MIN + PTS + FGM + FGA + AST + 
                                   TEAM + FORWARD, data=train)
model

Call:
lm(formula = log(GP) ~ AGE + I(AGE^2) + MIN + PTS + FGM + FGA + 
    AST + TEAM + FORWARD, data = train)

Coefficients:
(Intercept)          AGE     I(AGE^2)          MIN          PTS          FGM  
  3.8897228   -0.0994578    0.0019052    0.0008554    0.0101215    0.0356402  
        FGA          AST      TEAMBKN      TEAMBOS      TEAMCHA      TEAMCHI  
 -0.0481303    0.0175352    0.2823618    0.0735260    0.5184827    0.0537738  
    TEAMCLE      TEAMDAL      TEAMDEN      TEAMDET      TEAMGSW      TEAMHOU  
  0.1349207    0.2467046    0.1137331    0.6054708    0.4763176   -0.1752353  
    TEAMIND      TEAMLAC      TEAMLAL      TEAMMEM      TEAMMIA      TEAMMIL  
  0.4424053    0.4684311    0.0622111    0.1542639   -0.0635683    0.3873676  
    TEAMMIN      TEAMNOP      TEAMNYK      TEAMOKC      TEAMORL      TEAMPHI  
  0.3752244    0.4493629    0.1980194    0.5970255    0.0243561    0.1330337  
    TEAMPHX      TEAMPOR      TEAMSAC      TEAMSAS      TEAMTOR      TEAMUTA  
  0.3498505    0.3422611    0.5003870    0.4408121    0.2035228    0.2747049  
    TEAMWAS   FORWARDYes  
  0.0846763    0.0890878  

This model has a p-value of 0.000, which means the model is significant. The multiple r-squared is 0.6325, which means that 63.25% of the variance in the dependent variable is accounted for in the model. Therefore the model fits the dataset and properly represents the data.

Coefficient Magnitude Plot

library(coefplot)
library(ggplot2)
coefplot(model, soft="magnitude", intercept=FALSE)

This coefficient Magnitude Plot is difficult to interpret. I personally would not use this to visualize the data because the outcome is not clear, and it is difficult to understand what is being communicated.

Check for predictor independence

library(car)
vif(model)
               GVIF Df GVIF^(1/(2*Df))
AGE      134.157039  1       11.582618
I(AGE^2) 133.901306  1       11.571573
MIN        1.345889  1        1.160125
PTS       15.933336  1        3.991658
FGM       14.652115  1        3.827808
FGA        1.747796  1        1.322042
AST        1.501471  1        1.225345
TEAM       1.965150 29        1.011716
FORWARD    1.344579  1        1.159560

Age has a incredibly high multicollinearity of 11.58, which is an outlier compared to the other variables. MIN, FGA, AST, TEAM, and FORWARD all have a low multicollinearity surrounding 1. In addition, PTS and FGM still have a low multicollinearity nearing 4, but they are not as low as the others.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))
   0%   25%   50%   75%  100% 
-2.80 -0.29  0.07  0.39  1.11 

When analyzing the quanitles of the dataset, 25% of the data falls below -0.29, 50% falls within 0.07, and 75% is above 0.39.

Residual Plots

plot(model)

There is a random distribution of residuals that are fitted as the values increase. The normal Q-Q plot has values along the line, but it drops off at the larger theoretical quantiles. There is heteroscedasticity in the scale-location line. The minimal influential observation is not influential.

Plot Fitted Value by Actual Value

library(broom)
library(ggplot2)
train2 <- augment(model, data=train)  

p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p  + geom_point()

This model follows a linear trend line and is homoscedastic. However, I am unsure of what I am analyzing.

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

I am not sure what I am evaluating. The graph is unclear and difficult to interpret.

Performance Evaluation

Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict") 
test_w_predict <- cbind(test, pred)  

#test_w_predict %>%
 # head(10) %>%
  #select(MIN, GP, PTS, FGM, FGA, AST, TEAM) %>% 
  #as_flextable(show_coltype = FALSE) 

# code was not working on my mac but it worked in the data lab

Plot Actual vs Fitted (test)

plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)

This plot shows that with more games played, there is a higher fit. It is slightly heteroscedastic but follows a linear trend line.

Performance Metrics

library(Metrics)
#metric_label <- c("MAE","RMSE", "MAPE")
#metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
                         #round(rmse(test_w_predict$GP, test_w_predict$fit),4),
                         #round(mape(test_w_predict$GP, test_w_predict$fit),4))
#pmtable <- data.frame(Metric=metric_label, Value = metrics)
#flextable(pmtable)

# The same issue as above. The code works in the data lab but not on my mac

The mean absolute error is 45.6305, which means that it is on the end of poor performance out of 80. The root mean squared error is also high, indicating a poor performance of the model. The mean absolute percentage error is 0.8965, which means the predictions are off by 89.65%. The model is poor and should not be used.

Model Fit by Games Played and Win Rate

p <- ggplot(data = subset(test_w_predict, ROOKIE %in% c("Yes", "No")),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = TEAM,
                fill = TEAM,
                group = TEAM))
p + geom_point(aes(y = fit), alpha = 0.5) + 
    geom_line() + 
    geom_ribbon(alpha = 0.2, aes(color = NULL)) +
    labs(title="Actual vs Fitted with Upper and Lower CI",
         subtitle="Games Played and Win Rate",
         caption="NBA{datasetsICR}") +
    xlab("Games Played") + ylab("Fitted") +
    theme(legend.position = "bottom")

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The model is organized by games played by NBA teams. However, when analyzing the actual chart, it is incredibly difficult to distinguish each team. It can be assumed they all played the same number of games, but this cannot be determined by the model due to the overlap of colors. In addition, the performance metrics indicate the model is poor and does not accurately represent the data. There was a point in the lab where it became difficult to analyze each graph. This is most likely due to the wrong pairing of variables or the tests not being appropriate for what was being conducted.