DAT-4313 - Data Viz in Model Development

NBA viz Group

Author

Andy Koerner

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select at least 5 independent variables from your dataset as predictors, and estimate a linear regression model.

Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.

Example Code for FIFA with depvar=Wage

DATA

options(scipen=999)

library(socviz)

load("FIFA_viz.RData")
#load("Catholic_viz.RData")
load("NBA_viz.RData")
#load("Murders96_viz.RData")
# NBA_viz

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE) #dependent var, 60%
train <- NBA_viz[index,]
test  <- NBA_viz[-index,]

Training set number of observations: 319
Test dataset number of observations: 211

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
library(flextable)
describe(train, fast=FALSE)   # descriptive statistics

         vars   n    mean     sd median trimmed     mad min    max  range  skew
PLAYER*     1 319  274.47 155.16  276.0  275.49  207.56   1  529.0  528.0 -0.04
FORWARD*    2 319    1.48   0.50    1.0    1.47    0.00   1    2.0    1.0  0.08
CENTER*     3 319    1.18   0.39    1.0    1.11    0.00   1    2.0    1.0  1.62
GUARD*      4 319    1.50   0.50    2.0    1.50    0.00   1    2.0    1.0 -0.01
ROOKIE*     5 319    1.19   0.39    1.0    1.12    0.00   1    2.0    1.0  1.56
TEAM*       6 319   15.79   8.65   16.0   15.85   11.86   1   30.0   29.0 -0.02
AGE         7 319   26.16   4.27   25.0   25.87    4.45  19   39.0   20.0  0.55
GP          8 319   49.33  25.76   55.0   50.98   29.65   1   82.0   81.0 -0.46
W           9 319   25.46  16.08   24.0   25.18   20.76   0   60.0   60.0  0.14
MIN        10 319 1098.37 814.22 1063.0 1059.37 1082.30   1 2848.0 2847.0  0.26
PTS        11 319   19.39   6.42   19.1   19.36    5.78   0   42.8   42.8 -0.02
FGM        12 319    7.27   2.48    7.1    7.27    2.22   0   17.1   17.1 -0.06
FGA        13 319   16.54   4.93   16.0   16.30    4.00   0   52.7   52.7  1.49
FTM        14 319    2.82   1.74    2.7    2.71    1.63   0    9.1    9.1  0.74
FTA        15 319    3.85   2.36    3.6    3.69    2.08   0   18.2   18.2  1.20
REB        16 319    8.81   4.94    7.7    8.34    3.71   0   52.7   52.7  2.60
AST        17 319    4.24   2.84    3.5    3.91    2.08   0   17.1   17.1  1.24
TOV        18 319    2.43   1.19    2.2    2.37    1.04   0    7.2    7.2  0.68
         kurtosis    se
PLAYER*     -1.29  8.69
FORWARD*    -2.00  0.03
CENTER*      0.61  0.02
GUARD*      -2.01  0.03
ROOKIE*      0.44  0.02
TEAM*       -1.19  0.48
AGE         -0.35  0.24
GP          -1.15  1.44
W           -1.16  0.90
MIN         -1.19 45.59
PTS          1.02  0.36
FGM          1.10  0.14
FGA          9.73  0.28
FTM          0.75  0.10
FTA          4.06  0.13
REB         18.17  0.28
AST          1.70  0.16
TOV          1.49  0.07

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

Boxplots by Categorical

library(DataExplorer) # select cat
plot_boxplot(train, by="FORWARD")

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="GP")

Correlation Matrix

plot_correlation(train, type="continuous")

MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

I use the following model:
\(log(GP) = MIN + sqrt(W) +FTA + FGA + FGM\)

Estimate Coefficients and show coefficients table

model <- lm(log(GP) ~ MIN + sqrt(W) +FTA + FGA + FGM, train )
model$coefficients

  (Intercept)           MIN       sqrt(W)           FTA           FGA 
 1.7144146272  0.0002348041  0.3752481578  0.0393318232 -0.0237239660 
          FGM 
 0.0151895529

summary(model)


Call:
lm(formula = log(GP) ~ MIN + sqrt(W) + FTA + FGA + FGM, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.99416 -0.23000  0.01698  0.30089  1.20114 

Coefficients:
               Estimate  Std. Error t value             Pr(>|t|)    
(Intercept)  1.71441463  0.12448105  13.772 < 0.0000000000000002 ***
MIN          0.00023480  0.00005277   4.449             0.000012 ***
sqrt(W)      0.37524816  0.02254041  16.648 < 0.0000000000000002 ***
FTA          0.03933182  0.01237430   3.179             0.001628 ** 
FGA         -0.02372397  0.00650097  -3.649             0.000308 ***
FGM          0.01518955  0.01421066   1.069             0.285945    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4617 on 313 degrees of freedom
Multiple R-squared:  0.7854,    Adjusted R-squared:  0.782 
F-statistic: 229.1 on 5 and 313 DF,  p-value: < 0.00000000000000022

model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	1.714	0.124	13.772	0.0000	***
MIN	0.000	0.000	4.449	0.0000	***
sqrt(W)	0.375	0.023	16.648	0.0000	***
FTA	0.039	0.012	3.179	0.0016	**
FGA	-0.024	0.007	-3.649	0.0003	***
FGM	0.015	0.014	1.069	0.2859
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 0.4617 on 313 degrees of freedom
Multiple R-squared: 0.7854, Adjusted R-squared: 0.782
F-statistic: 229.1 on 313 and 5 DF, p-value: 0.0000

#*** means significant, r squared % expalins

I got the R squared pretty high.

Coefficient Magnitude Plot

# note this simple code using coefplot package produces chart like Figure 6.6 in one short line. 
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

Glad that I got them so close togeather.

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

     MIN  sqrt(W)      FTA      FGA      FGM 
2.754062 2.584590 1.271833 1.530154 1.858678

They are all below 5!

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

   0%   25%   50%   75%  100% 
-1.99 -0.23  0.02  0.30  1.20

They are all in between -2 and 2.

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p  + geom_point()

Great, linear relationship, with the going up and to the top right corner.

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Well this goes up, but then goes down, but it gets more clustered together near the end. But then again, it is residuals, so I don’t focus much on this one.

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select( GP, upr, fit, lwr) %>% 
  as_flextable(show_coltype = FALSE)

GP	upr	fit	lwr
5	3.0	2.0	1.1
64	4.5	3.6	2.7
21	3.7	2.8	1.8
14	3.8	2.9	1.9
43	4.3	3.4	2.5
64	4.3	3.4	2.5
12	3.8	2.9	2.0
51	4.8	3.9	3.0
17	3.5	2.6	1.7
73	5.1	4.2	3.3
n: 10

They are all very close together with upper, average, and lower.

Plot Actual vs Fitted (`test`)

plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)

Glad to see another linear relationship, with the upwards trajectory.

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
                         round(rmse(test_w_predict$GP, test_w_predict$fit),4),
                         round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	45.6339
RMSE	52.3077
MAPE	0.8669

MAE- mean of the absolute error - 45.6

RMSE- Standard Deviation 52.3077

MAPE- Mean Absolute Percentage Error- 0.8669

Model Fit by Position

# more in depth plots of performance 
p <- ggplot(data = subset(test_w_predict, FORWARD %in% c("Yes", "No")),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = FORWARD,
                fill = FORWARD,
                group = FORWARD))
p +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="FORWARD POSITION",
              caption="NBA_viz") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

q <- ggplot(data = subset(test_w_predict, GUARD %in% c("Yes", "No")),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = GUARD,
                fill = GUARD,
                group = GUARD))
q +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="GUARD POSITION",
              caption="NBA_viz") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

s <- ggplot(data = subset(test_w_predict, CENTER %in% c("Yes", "No")),
            aes(x = GP,
                y = fit, ymin = lwr, ymax = upr,
                color = CENTER,
                fill = CENTER,
                group = CENTER))
s +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="CENTER POSITION",
              caption="NBA_viz") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

I decided to do a graph for each of the positions of Forward, Center, and Guard. They are all the same graph except with a few differences, but now I know. To make sense of that, the spots are at the same place, but one graph connects them to “yes”, the other two to “no”. Altough, Center seems to have the less dots connected.

`ggplot2` explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(log(GP) ~ GP, data=train)
summary(pip)


Call:
lm(formula = log(GP) ~ GP, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.92668 -0.19969  0.04506  0.32651  0.46855 

Coefficients:
             Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 1.8919642  0.0512151   36.94 <0.0000000000000002 ***
GP          0.0347165  0.0009207   37.71 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.4229 on 317 degrees of freedom
Multiple R-squared:  0.8177,    Adjusted R-squared:  0.8171 
F-statistic:  1422 on 1 and 317 DF,  p-value: < 0.00000000000000022

# knitr does not like autoplot
#autoplot(lm(log(Wage.n) ~ Potential, data=train), data=train, colour = 'Nationality_Continent')


library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=AGE, x=GP))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

I decided to use age(which goes from 20-40 ish). The charts mainly show very little relationship between Age and GP(mostly straight across), which is what the correlation matrix said also. But this is surprising because I would think Age would have a relationship with GP.

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

None of the categorials were good in the model, so I did not add them. I also saw that sqrt on W would make the R squared better from 60’s to 70’s, so I included that. I went through many others like FTM, but I decided to include the 5 that were the most significant. The most significant are MIN and W, which I saw on the correlation matrix. All the ones I chose for my equation have a p value less then 1, so most are significant. The rest all made the R squared go up more. It went to 0.782, which is 78.2% of the data.

The first few charts showed a nice linear relationship, and the VIF were all below the number 5. This all leads me to think my equation is good. However, the MAE, RMSE,and MAPE seem to all be very high with 50, 40, and 80%. This puts a pep in my step to think maybe there is a problem.

I also checked the 3 positions with Forward, Guard, or Center, and the charts showed if each was a yes or a no for the three. They looked very much the same, but the dots were all different because one would be yes, and the other two would be no. But the range for them all, was the same.

If I were to do anything different then I would try the sqrt on some more variables or try some other variables such as “AST” or “TOV”.