options(scipen=999)
library(socviz)
load("NBA_viz.RData")Lab 9
NBA
FURTHER INSTRUCTIONS
Use the dataset assigned to you in the LAB Instructions.
Use the depvar indicated for your dataset.
Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.
Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk.
Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
DATA
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[-index,]
test <- NBA_viz[-index,]Training set number of observations: 211 Test dataset number of observations: 211
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
psych::describe(train, fast=TRUE) vars n mean sd min max range se
PLAYER 1 211 NaN NA Inf -Inf -Inf NA
FORWARD 2 211 NaN NA Inf -Inf -Inf NA
CENTER 3 211 NaN NA Inf -Inf -Inf NA
GUARD 4 211 NaN NA Inf -Inf -Inf NA
ROOKIE 5 211 NaN NA Inf -Inf -Inf NA
TEAM 6 211 NaN NA Inf -Inf -Inf NA
AGE 7 211 26.10 4.05 19 42.0 23.0 0.28
GP 8 211 49.13 26.55 1 82.0 81.0 1.83
W 9 211 23.48 15.56 0 56.0 56.0 1.07
MIN 10 211 1156.79 875.12 1 3028.0 3027.0 60.25
PTS 11 211 20.51 9.59 0 100.5 100.5 0.66
FGM 12 211 7.65 4.17 0 50.3 50.3 0.29
FGA 13 211 17.15 5.97 0 50.3 50.3 0.41
FTM 14 211 3.26 2.44 0 17.2 17.2 0.17
FTA 15 211 4.44 2.94 0 17.2 17.2 0.20
REB 16 211 9.44 5.41 0 37.6 37.6 0.37
AST 17 211 4.59 2.85 0 14.3 14.3 0.20
TOV 18 211 2.68 1.65 0 12.8 12.8 0.11
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")library(DataExplorer)
plot_boxplot(train, by="ROOKIE") Histograms
plot_histogram(train)Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="GP")Correlation Matrix
plot_correlation(train, type="continuous")MODEL
Linear Regression Model
From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.
Estimate the following model:
\(log(Wage.n) = Age + Age^2 + Potential + International.Reputation + Value.n + Special + Height + RightFoot + Skill.Moves\)
Estimate Coefficients and show coefficients table
options(repos = c(CRAN = "https://cran.r-project.org/"))
# Install the flextable package
install.packages("flextable")package 'flextable' successfully unpacked and MD5 sums checked
The downloaded binary packages are in
C:\Users\colli\AppData\Local\Temp\RtmpMRRXFE\downloaded_packages
# Load the required libraries
library(flextable)
library(dplyr)
# Assuming 'train' is your data and 'model' is your linear regression model
model <- lm(log(MIN) ~ AGE + I(AGE^2) + GP + W + PTS + FGM + FGA + FTM + FTA, data=train)
# Create a flextable from the model summary
flex_table <- flextable::qflextable(broom::tidy(model))
# Print the flextable
flex_tableterm | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
(Intercept) | 3.2424400119 | 1.510910472 | 2.1460173 | 0.033070399779137414630891811384572065435349941254 |
AGE | 0.0273567729 | 0.109632109 | 0.2495325 | 0.803203737388073157887902198126539587974548339844 |
I(AGE^2) | -0.0003583642 | 0.001954248 | -0.1833771 | 0.854686997233498280124308621452655643224716186523 |
GP | 0.0600872190 | 0.003449075 | 17.4212585 | 0.000000000000000000000000000000000000000004924728 |
W | -0.0135684463 | 0.006017271 | -2.2549171 | 0.025216547351271982602138876927710953168570995331 |
PTS | 0.1462322334 | 0.040670585 | 3.5955282 | 0.000407361219513735007321381109690605626383330673 |
FGM | -0.3044851468 | 0.078308476 | -3.8882783 | 0.000137134159463737562286919646759031365945702419 |
FGA | -0.0234473054 | 0.015334033 | -1.5291023 | 0.127811270588357756006203658216691110283136367798 |
FTM | -0.0889587733 | 0.078681999 | -1.1306115 | 0.259565906267686774544500849515316076576709747314 |
FTA | 0.0094562788 | 0.047656607 | 0.1984253 | 0.842912836300407253276034680311568081378936767578 |
Coefficient Magnitude Plot
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) AGE I(AGE^2) GP W PTS FGM FGA FTM
91.216945 90.172728 3.885533 4.062463 70.535138 49.320031 3.886213 17.046846
FTA
9.124164
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-3.030 -0.315 -0.010 0.425 1.690
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
install.packages("flextable")
library(flextable)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(TEAM, GP, AGE, PTS, AST) %>%
as_flextable(show_coltype = FALSE) TEAM | GP | AGE | PTS | AST |
|---|---|---|---|---|
BKN | 5 | 26 | 33.8 | 5.6 |
SAC | 64 | 27 | 19.6 | 4.5 |
ATL | 21 | 25 | 16.8 | 2.7 |
IND | 14 | 22 | 9.7 | 0.7 |
BKN | 43 | 27 | 17.5 | 1.9 |
NYK | 64 | 23 | 22.9 | 3.9 |
ORL | 12 | 25 | 19.0 | 2.1 |
PHI | 51 | 31 | 18.2 | 5.4 |
NOP | 17 | 24 | 14.0 | 6.2 |
MIN | 73 | 24 | 24.9 | 3.5 |
n: 10 | ||||
Plot Actual vs Fitted (test)
plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
round(rmse(test_w_predict$GP, test_w_predict$fit),4),
round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 43.0012 |
RMSE | 49.5295 |
MAPE | 0.8591 |
Model Fit by Team
selected_teams <- c("ATL", "OKC0", "UTA") # Replace with your actual team names
# Create a ggplot with specified data for selected teams
p <- ggplot(data = subset(test_w_predict, TEAM %in% selected_teams),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = TEAM,
fill = TEAM,
group = TEAM))
p + geom_point(alpha = 0.5) +
# Connect points with lines
geom_line() +
# Add a ribbon for confidence intervals
geom_ribbon(alpha = 0.2, color = FALSE) +
# Set plot titles and captions
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="ATL vs. UTA",
caption="NBA Analysis") +
# Remove x-axis label
xlab(NULL) +
# Set y-axis label
ylab("fitted") +
# Adjust legend position
theme(legend.position = "bottom")ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(log(GP) ~ FGA, data=train)
summary(pip)
Call:
lm(formula = log(GP) ~ FGA, data = train)
Residuals:
Min 1Q Median 3Q Max
-3.6092 -0.2697 0.4903 0.7341 0.8825
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.505687 0.235548 14.883 <0.0000000000000002 ***
FGA 0.002057 0.012975 0.159 0.874
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 1.123 on 209 degrees of freedom
Multiple R-squared: 0.0001203, Adjusted R-squared: -0.004664
F-statistic: 0.02514 on 1 and 209 DF, p-value: 0.8742
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(GP), x=FGM))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))# Scatter Plot with Robust Regression Lines
p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")# Scatter Plot with Polynomial Regression Line
p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)# Scatter Plot with Quantile Regression Lines
p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
The more games played, the more their stats increase. As a players age increases, normally the games they play increases as well. They get more playing time, they make more goals, and they have more assists. The amount of games a player plays does correlate with the other variables. There is a positive linear relationship with minutes and the other variables.