Lab 9

NBA

Author

Katlyn Collins

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.

  • Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.

  • Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.


DATA

options(scipen=999)

library(socviz) 


load("NBA_viz.RData")

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE)
train <- NBA_viz[-index,]
test  <- NBA_viz[-index,]

Training set number of observations: 211 Test dataset number of observations: 211


EDA

Descriptive Statistics

library(psych)
library(tidyverse)
psych::describe(train, fast=TRUE) 
        vars   n    mean     sd min    max  range    se
PLAYER     1 211     NaN     NA Inf   -Inf   -Inf    NA
FORWARD    2 211     NaN     NA Inf   -Inf   -Inf    NA
CENTER     3 211     NaN     NA Inf   -Inf   -Inf    NA
GUARD      4 211     NaN     NA Inf   -Inf   -Inf    NA
ROOKIE     5 211     NaN     NA Inf   -Inf   -Inf    NA
TEAM       6 211     NaN     NA Inf   -Inf   -Inf    NA
AGE        7 211   26.10   4.05  19   42.0   23.0  0.28
GP         8 211   49.13  26.55   1   82.0   81.0  1.83
W          9 211   23.48  15.56   0   56.0   56.0  1.07
MIN       10 211 1156.79 875.12   1 3028.0 3027.0 60.25
PTS       11 211   20.51   9.59   0  100.5  100.5  0.66
FGM       12 211    7.65   4.17   0   50.3   50.3  0.29
FGA       13 211   17.15   5.97   0   50.3   50.3  0.41
FTM       14 211    3.26   2.44   0   17.2   17.2  0.17
FTA       15 211    4.44   2.94   0   17.2   17.2  0.20
REB       16 211    9.44   5.41   0   37.6   37.6  0.37
AST       17 211    4.59   2.85   0   14.3   14.3  0.20
TOV       18 211    2.68   1.65   0   12.8   12.8  0.11

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="ROOKIE") 

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="GP")

Correlation Matrix

plot_correlation(train, type="continuous")


MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

Estimate the following model:
\(log(Wage.n) = Age + Age^2 + Potential + International.Reputation + Value.n + Special + Height + RightFoot + Skill.Moves\)

Estimate Coefficients and show coefficients table

options(repos = c(CRAN = "https://cran.r-project.org/"))

# Install the flextable package
install.packages("flextable")
package 'flextable' successfully unpacked and MD5 sums checked

The downloaded binary packages are in
    C:\Users\colli\AppData\Local\Temp\RtmpMRRXFE\downloaded_packages
# Load the required libraries
library(flextable)
library(dplyr)

# Assuming 'train' is your data and 'model' is your linear regression model
model <- lm(log(MIN) ~ AGE + I(AGE^2) + GP + W + PTS + FGM + FGA + FTM + FTA, data=train)

# Create a flextable from the model summary
flex_table <- flextable::qflextable(broom::tidy(model))

# Print the flextable
flex_table

term

estimate

std.error

statistic

p.value

(Intercept)

3.2424400119

1.510910472

2.1460173

0.033070399779137414630891811384572065435349941254

AGE

0.0273567729

0.109632109

0.2495325

0.803203737388073157887902198126539587974548339844

I(AGE^2)

-0.0003583642

0.001954248

-0.1833771

0.854686997233498280124308621452655643224716186523

GP

0.0600872190

0.003449075

17.4212585

0.000000000000000000000000000000000000000004924728

W

-0.0135684463

0.006017271

-2.2549171

0.025216547351271982602138876927710953168570995331

PTS

0.1462322334

0.040670585

3.5955282

0.000407361219513735007321381109690605626383330673

FGM

-0.3044851468

0.078308476

-3.8882783

0.000137134159463737562286919646759031365945702419

FGA

-0.0234473054

0.015334033

-1.5291023

0.127811270588357756006203658216691110283136367798

FTM

-0.0889587733

0.078681999

-1.1306115

0.259565906267686774544500849515316076576709747314

FTA

0.0094562788

0.047656607

0.1984253

0.842912836300407253276034680311568081378936767578

Coefficient Magnitude Plot

library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)
      AGE  I(AGE^2)        GP         W       PTS       FGM       FGA       FTM 
91.216945 90.172728  3.885533  4.062463 70.535138 49.320031  3.886213 17.046846 
      FTA 
 9.124164 

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))
    0%    25%    50%    75%   100% 
-3.030 -0.315 -0.010  0.425  1.690 

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)

install.packages("flextable")
library(flextable)


pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(TEAM, GP, AGE, PTS, AST) %>% 
  as_flextable(show_coltype = FALSE) 

TEAM

GP

AGE

PTS

AST

BKN

5

26

33.8

5.6

SAC

64

27

19.6

4.5

ATL

21

25

16.8

2.7

IND

14

22

9.7

0.7

BKN

43

27

17.5

1.9

NYK

64

23

22.9

3.9

ORL

12

25

19.0

2.1

PHI

51

31

18.2

5.4

NOP

17

24

14.0

6.2

MIN

73

24

24.9

3.5

n: 10

Plot Actual vs Fitted (test)

plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
                         round(rmse(test_w_predict$GP, test_w_predict$fit),4),
                         round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric

Value

MAE

43.0012

RMSE

49.5295

MAPE

0.8591

Model Fit by Team

selected_teams <- c("ATL", "OKC0", "UTA")  # Replace with your actual team names

# Create a ggplot with specified data for selected teams
p <- ggplot(data = subset(test_w_predict, TEAM %in% selected_teams),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = TEAM,
                fill = TEAM,
                group = TEAM))

p +  geom_point(alpha = 0.5) + 
    # Connect points with lines
    geom_line() + 
    # Add a ribbon for confidence intervals
    geom_ribbon(alpha = 0.2, color = FALSE) +
    # Set plot titles and captions
    labs(title="Actual vs Fitted with Upper and Lower CI",
         subtitle="ATL vs. UTA",
         caption="NBA Analysis") +
    # Remove x-axis label
    xlab(NULL) + 
    # Set y-axis label
    ylab("fitted") +
    # Adjust legend position
    theme(legend.position = "bottom")


ggplot2 explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(log(GP) ~ FGA, data=train)
summary(pip)

Call:
lm(formula = log(GP) ~ FGA, data = train)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.6092 -0.2697  0.4903  0.7341  0.8825 

Coefficients:
            Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 3.505687   0.235548  14.883 <0.0000000000000002 ***
FGA         0.002057   0.012975   0.159               0.874    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.123 on 209 degrees of freedom
Multiple R-squared:  0.0001203, Adjusted R-squared:  -0.004664 
F-statistic: 0.02514 on 1 and 209 DF,  p-value: 0.8742
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(GP), x=FGM))
p + geom_point(alpha = 0.2) + 
    geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

# Scatter Plot with Robust Regression Lines
p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

# Scatter Plot with Polynomial Regression Line
p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

# Scatter Plot with Quantile Regression Lines
p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))


SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The more games played, the more their stats increase. As a players age increases, normally the games they play increases as well. They get more playing time, they make more goals, and they have more assists. The amount of games a player plays does correlate with the other variables. There is a positive linear relationship with minutes and the other variables.

END