options(scipen=999)
library(socviz)
load("NBA_viz.RData")Lab 6 - Data Viz in Model Development
NBA Dataset
Loading the Data
Partitioning the Dataset into 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[index,]
test <- NBA_viz[-index,]Training set number of observations: 319
Test dataset number of observations: 211
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
psych::describe(train, fast=TRUE) vars n mean sd min max range se
PLAYER 1 319 NaN NA Inf -Inf -Inf NA
FORWARD 2 319 NaN NA Inf -Inf -Inf NA
CENTER 3 319 NaN NA Inf -Inf -Inf NA
GUARD 4 319 NaN NA Inf -Inf -Inf NA
ROOKIE 5 319 NaN NA Inf -Inf -Inf NA
TEAM 6 319 NaN NA Inf -Inf -Inf NA
AGE 7 319 26.16 4.27 19 39.0 20.0 0.24
GP 8 319 49.33 25.76 1 82.0 81.0 1.44
W 9 319 25.46 16.08 0 60.0 60.0 0.90
MIN 10 319 1098.37 814.22 1 2848.0 2847.0 45.59
PTS 11 319 19.39 6.42 0 42.8 42.8 0.36
FGM 12 319 7.27 2.48 0 17.1 17.1 0.14
FGA 13 319 16.54 4.93 0 52.7 52.7 0.28
FTM 14 319 2.82 1.74 0 9.1 9.1 0.10
FTA 15 319 3.85 2.36 0 18.2 18.2 0.13
REB 16 319 8.81 4.94 0 52.7 52.7 0.28
AST 17 319 4.24 2.84 0 17.1 17.1 0.16
TOV 18 319 2.43 1.19 0 7.2 7.2 0.07
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")The horizontal boxplot shows the range of each of the numeric variables within the dataset. It identifies the outliers as well as the interquartile range of the data. For instance, games played (GP) had no outliers because the teams fall under a single league who regulates each of their games. A singular team would not play more games than other teams.
library(DataExplorer)
plot_boxplot(train, by="FORWARD") I originally created this boxplot to capture the stats for each team, but decided to focus on a specific position due to the lack of legibility when interpreting the graph. There are 30 basketball teams, which flooded the y-axis with data. By focusing on a position, it compares a Forward’s stats with the rest of the positions in the league. For instance, a FORWARD has a higher rebound average (REB) than other positions in the NBA.
Histograms
plot_histogram(train)This plot shows the frequency of each variable in the NBA dataset, such as the frequency of age, assists, field goals, etc.
Scatterplots (depvar ~ all x)
plot_scatterplot(test, by="GP")There is no correlation between the independent variables and the dependent variable. However, there is a small amount of clustering in assists, free throw attempts, turnovers, and points.
Correlation Matrix
plot_correlation(train, type="continuous")This correlation matrix compares variables within the dataset. Those that are a deeper red are highly correlated such as games played, wins, and minutes. In contrast, those who are blue share a negative correlation such as assists and rebounds.
MODEL
Linear Regression Model
Estimate Coefficients and show coefficients table
model <- lm(log(GP) ~ AGE + I(AGE^2) + MIN + PTS + FGM + FGA + AST +
TEAM + FORWARD, data=train)
model
Call:
lm(formula = log(GP) ~ AGE + I(AGE^2) + MIN + PTS + FGM + FGA +
AST + TEAM + FORWARD, data = train)
Coefficients:
(Intercept) AGE I(AGE^2) MIN PTS FGM
3.8897228 -0.0994578 0.0019052 0.0008554 0.0101215 0.0356402
FGA AST TEAMBKN TEAMBOS TEAMCHA TEAMCHI
-0.0481303 0.0175352 0.2823618 0.0735260 0.5184827 0.0537738
TEAMCLE TEAMDAL TEAMDEN TEAMDET TEAMGSW TEAMHOU
0.1349207 0.2467046 0.1137331 0.6054708 0.4763176 -0.1752353
TEAMIND TEAMLAC TEAMLAL TEAMMEM TEAMMIA TEAMMIL
0.4424053 0.4684311 0.0622111 0.1542639 -0.0635683 0.3873676
TEAMMIN TEAMNOP TEAMNYK TEAMOKC TEAMORL TEAMPHI
0.3752244 0.4493629 0.1980194 0.5970255 0.0243561 0.1330337
TEAMPHX TEAMPOR TEAMSAC TEAMSAS TEAMTOR TEAMUTA
0.3498505 0.3422611 0.5003870 0.4408121 0.2035228 0.2747049
TEAMWAS FORWARDYes
0.0846763 0.0890878
This model has a p-value of 0.000, which means the model is significant. The multiple r-squared is 0.6325, which means that 63.25% of the variance in the dependent variable is accounted for in the model. Therefore the model fits the dataset and properly represents the data.
Coefficient Magnitude Plot
library(coefplot)
library(ggplot2)
coefplot(model, soft="magnitude", intercept=FALSE)This coefficient Magnitude Plot is difficult to interpret. I personally would not use this to visualize the data because the outcome is not clear, and it is difficult to understand what is being communicated.
Check for predictor independence
library(car)
vif(model) GVIF Df GVIF^(1/(2*Df))
AGE 134.157039 1 11.582618
I(AGE^2) 133.901306 1 11.571573
MIN 1.345889 1 1.160125
PTS 15.933336 1 3.991658
FGM 14.652115 1 3.827808
FGA 1.747796 1 1.322042
AST 1.501471 1 1.225345
TEAM 1.965150 29 1.011716
FORWARD 1.344579 1 1.159560
Age has a incredibly high multicollinearity of 11.58, which is an outlier compared to the other variables. MIN, FGA, AST, TEAM, and FORWARD all have a low multicollinearity surrounding 1. In addition, PTS and FGM still have a low multicollinearity nearing 4, but they are not as low as the others.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-2.80 -0.29 0.07 0.39 1.11
When analyzing the quanitles of the dataset, 25% of the data falls below -0.29, 50% falls within 0.07, and 75% is above 0.39.
Residual Plots
plot(model)There is a random distribution of residuals that are fitted as the values increase. The normal Q-Q plot has values along the line, but it drops off at the larger theoretical quantiles. There is heteroscedasticity in the scale-location line. The minimal influential observation is not influential.
Plot Fitted Value by Actual Value
library(broom)
library(ggplot2)
train2 <- augment(model, data=train)
p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p + geom_point()This model follows a linear trend line and is homoscedastic. However, I am unsure of what I am analyzing.
Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()I am not sure what I am evaluating. The graph is unclear and difficult to interpret.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict")
test_w_predict <- cbind(test, pred)
#test_w_predict %>%
# head(10) %>%
#select(MIN, GP, PTS, FGM, FGA, AST, TEAM) %>%
#as_flextable(show_coltype = FALSE)
# code was not working on my mac but it worked in the data labPlot Actual vs Fitted (test)
plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)This plot shows that with more games played, there is a higher fit. It is slightly heteroscedastic but follows a linear trend line.
Performance Metrics
library(Metrics)
#metric_label <- c("MAE","RMSE", "MAPE")
#metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
#round(rmse(test_w_predict$GP, test_w_predict$fit),4),
#round(mape(test_w_predict$GP, test_w_predict$fit),4))
#pmtable <- data.frame(Metric=metric_label, Value = metrics)
#flextable(pmtable)
# The same issue as above. The code works in the data lab but not on my macThe mean absolute error is 45.6305, which means that it is on the end of poor performance out of 80. The root mean squared error is also high, indicating a poor performance of the model. The mean absolute percentage error is 0.8965, which means the predictions are off by 89.65%. The model is poor and should not be used.
Model Fit by Games Played and Win Rate
p <- ggplot(data = subset(test_w_predict, ROOKIE %in% c("Yes", "No")),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = TEAM,
fill = TEAM,
group = TEAM))
p + geom_point(aes(y = fit), alpha = 0.5) +
geom_line() +
geom_ribbon(alpha = 0.2, aes(color = NULL)) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="Games Played and Win Rate",
caption="NBA{datasetsICR}") +
xlab("Games Played") + ylab("Fitted") +
theme(legend.position = "bottom")SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
The model is organized by games played by NBA teams. However, when analyzing the actual chart, it is incredibly difficult to distinguish each team. It can be assumed they all played the same number of games, but this cannot be determined by the model due to the overlap of colors. In addition, the performance metrics indicate the model is poor and does not accurately represent the data. There was a point in the lab where it became difficult to analyze each graph. This is most likely due to the wrong pairing of variables or the tests not being appropriate for what was being conducted.