Lab 9

NBA

Author

Katlyn Collins

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.

Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.

DATA

options(scipen=999)

library(socviz) 


load("NBA_viz.RData")

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[-index,]
test  <- NBA_viz[-index,]

Training set number of observations: 211 Test dataset number of observations: 211

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
psych::describe(train, fast=TRUE)

        vars   n    mean     sd min    max  range    se
PLAYER     1 211     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 211     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 211     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 211     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 211     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 211     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 211   26.10   4.05  19   42.0   23.0  0.28
GP         8 211   49.13  26.55   1   82.0   81.0  1.83
W          9 211   23.48  15.56   0   56.0   56.0  1.07
MIN       10 211 1156.79 875.12   1 3028.0 3027.0 60.25
PTS       11 211   20.51   9.59   0  100.5  100.5  0.66
FGM       12 211    7.65   4.17   0   50.3   50.3  0.29
FGA       13 211   17.15   5.97   0   50.3   50.3  0.41
FTM       14 211    3.26   2.44   0   17.2   17.2  0.17
FTA       15 211    4.44   2.94   0   17.2   17.2  0.20
REB       16 211    9.44   5.41   0   37.6   37.6  0.37
AST       17 211    4.59   2.85   0   14.3   14.3  0.20
TOV       18 211    2.68   1.65   0   12.8   12.8  0.11

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="ROOKIE")

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="GP")

Correlation Matrix

plot_correlation(train, type="continuous")

MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

Estimate the following model:
\(log(Wage.n) = Age + Age^2 + Potential + International.Reputation + Value.n + Special + Height + RightFoot + Skill.Moves\)

Estimate Coefficients and show coefficients table

options(repos = c(CRAN = "https://cran.r-project.org/"))

# Install the flextable package
install.packages("flextable")

package 'flextable' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\colli\AppData\Local\Temp\RtmpMRRXFE\downloaded_packages

# Load the required libraries
library(flextable)
library(dplyr)

# Assuming 'train' is your data and 'model' is your linear regression model
model <- lm(log(MIN) ~ AGE + I(AGE^2) + GP + W + PTS + FGM + FGA + FTM + FTA, data=train)

# Create a flextable from the model summary
flex_table <- flextable::qflextable(broom::tidy(model))

# Print the flextable
flex_table

term	estimate	std.error	statistic	p.value
(Intercept)	3.2424400119	1.510910472	2.1460173	0.033070399779137414630891811384572065435349941254
AGE	0.0273567729	0.109632109	0.2495325	0.803203737388073157887902198126539587974548339844
I(AGE^2)	-0.0003583642	0.001954248	-0.1833771	0.854686997233498280124308621452655643224716186523
GP	0.0600872190	0.003449075	17.4212585	0.000000000000000000000000000000000000000004924728
W	-0.0135684463	0.006017271	-2.2549171	0.025216547351271982602138876927710953168570995331
PTS	0.1462322334	0.040670585	3.5955282	0.000407361219513735007321381109690605626383330673
FGM	-0.3044851468	0.078308476	-3.8882783	0.000137134159463737562286919646759031365945702419
FGA	-0.0234473054	0.015334033	-1.5291023	0.127811270588357756006203658216691110283136367798
FTM	-0.0889587733	0.078681999	-1.1306115	0.259565906267686774544500849515316076576709747314
FTA	0.0094562788	0.047656607	0.1984253	0.842912836300407253276034680311568081378936767578

Coefficient Magnitude Plot

library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

      AGE  I(AGE^2)        GP         W       PTS       FGM       FGA       FTM 
91.216945 90.172728  3.885533  4.062463 70.535138 49.320031  3.886213 17.046846 
      FTA 
 9.124164

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

    0%    25%    50%    75%   100% 
-3.030 -0.315 -0.010  0.425  1.690

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

install.packages("flextable")
library(flextable)


pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(TEAM, GP, AGE, PTS, AST) %>% 
  as_flextable(show_coltype = FALSE)

TEAM	GP	AGE	PTS	AST
BKN	5	26	33.8	5.6
SAC	64	27	19.6	4.5
ATL	21	25	16.8	2.7
IND	14	22	9.7	0.7
BKN	43	27	17.5	1.9
NYK	64	23	22.9	3.9
ORL	12	25	19.0	2.1
PHI	51	31	18.2	5.4
NOP	17	24	14.0	6.2
MIN	73	24	24.9	3.5
n: 10

Plot Actual vs Fitted (`test`)

plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
                         round(rmse(test_w_predict$GP, test_w_predict$fit),4),
                         round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	43.0012
RMSE	49.5295
MAPE	0.8591

Model Fit by Team

selected_teams <- c("ATL", "OKC0", "UTA")  # Replace with your actual team names

# Create a ggplot with specified data for selected teams
p <- ggplot(data = subset(test_w_predict, TEAM %in% selected_teams),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = TEAM,
                fill = TEAM,
                group = TEAM))

p +  geom_point(alpha = 0.5) + 
    # Connect points with lines
    geom_line() + 
    # Add a ribbon for confidence intervals
    geom_ribbon(alpha = 0.2, color = FALSE) +
    # Set plot titles and captions
    labs(title="Actual vs Fitted with Upper and Lower CI",
         subtitle="ATL vs. UTA",
         caption="NBA Analysis") +
    # Remove x-axis label
    xlab(NULL) + 
    # Set y-axis label
    ylab("fitted") +
    # Adjust legend position
    theme(legend.position = "bottom")

`ggplot2` explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(log(GP) ~ FGA, data=train)
summary(pip)


Call:
lm(formula = log(GP) ~ FGA, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6092 -0.2697  0.4903  0.7341  0.8825 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 3.505687   0.235548  14.883 <0.0000000000000002 ***
FGA         0.002057   0.012975   0.159               0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.123 on 209 degrees of freedom
Multiple R-squared:  0.0001203, Adjusted R-squared:  -0.004664 
F-statistic: 0.02514 on 1 and 209 DF,  p-value: 0.8742

library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(GP), x=FGM))
p + geom_point(alpha = 0.2) + 
    geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

# Scatter Plot with Robust Regression Lines
p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

# Scatter Plot with Polynomial Regression Line
p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

# Scatter Plot with Quantile Regression Lines
p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The more games played, the more their stats increase. As a players age increases, normally the games they play increases as well. They get more playing time, they make more goals, and they have more assists. The amount of games a player plays does correlate with the other variables. There is a positive linear relationship with minutes and the other variables.