options(scipen=999)
library(socviz)
load("FIFA_viz.RData")
#load("Catholic_viz.RData")
load("NBA_viz.RData")
#load("Murders96_viz.RData")
# NBA_vizDAT-4313 - Data Viz in Model Development
NBA viz Group
FURTHER INSTRUCTIONS
Use the dataset assigned to you in the LAB Instructions.
Use the depvar indicated for your dataset.
Select at least 5 independent variables from your dataset as predictors, and estimate a linear regression model.
Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk.
Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
Example Code for FIFA with depvar=Wage
DATA
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(NBA_viz$GP, p=0.6, list=FALSE) #dependent var, 60%
train <- NBA_viz[index,]
test <- NBA_viz[-index,]Training set number of observations: 319
Test dataset number of observations: 211
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
describe(train, fast=FALSE) # descriptive statistics vars n mean sd median trimmed mad min max range skew
PLAYER* 1 319 274.47 155.16 276.0 275.49 207.56 1 529.0 528.0 -0.04
FORWARD* 2 319 1.48 0.50 1.0 1.47 0.00 1 2.0 1.0 0.08
CENTER* 3 319 1.18 0.39 1.0 1.11 0.00 1 2.0 1.0 1.62
GUARD* 4 319 1.50 0.50 2.0 1.50 0.00 1 2.0 1.0 -0.01
ROOKIE* 5 319 1.19 0.39 1.0 1.12 0.00 1 2.0 1.0 1.56
TEAM* 6 319 15.79 8.65 16.0 15.85 11.86 1 30.0 29.0 -0.02
AGE 7 319 26.16 4.27 25.0 25.87 4.45 19 39.0 20.0 0.55
GP 8 319 49.33 25.76 55.0 50.98 29.65 1 82.0 81.0 -0.46
W 9 319 25.46 16.08 24.0 25.18 20.76 0 60.0 60.0 0.14
MIN 10 319 1098.37 814.22 1063.0 1059.37 1082.30 1 2848.0 2847.0 0.26
PTS 11 319 19.39 6.42 19.1 19.36 5.78 0 42.8 42.8 -0.02
FGM 12 319 7.27 2.48 7.1 7.27 2.22 0 17.1 17.1 -0.06
FGA 13 319 16.54 4.93 16.0 16.30 4.00 0 52.7 52.7 1.49
FTM 14 319 2.82 1.74 2.7 2.71 1.63 0 9.1 9.1 0.74
FTA 15 319 3.85 2.36 3.6 3.69 2.08 0 18.2 18.2 1.20
REB 16 319 8.81 4.94 7.7 8.34 3.71 0 52.7 52.7 2.60
AST 17 319 4.24 2.84 3.5 3.91 2.08 0 17.1 17.1 1.24
TOV 18 319 2.43 1.19 2.2 2.37 1.04 0 7.2 7.2 0.68
kurtosis se
PLAYER* -1.29 8.69
FORWARD* -2.00 0.03
CENTER* 0.61 0.02
GUARD* -2.01 0.03
ROOKIE* 0.44 0.02
TEAM* -1.19 0.48
AGE -0.35 0.24
GP -1.15 1.44
W -1.16 0.90
MIN -1.19 45.59
PTS 1.02 0.36
FGM 1.10 0.14
FGA 9.73 0.28
FTM 0.75 0.10
FTA 4.06 0.13
REB 18.17 0.28
AST 1.70 0.16
TOV 1.49 0.07
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")Boxplots by Categorical
library(DataExplorer) # select cat
plot_boxplot(train, by="FORWARD") Histograms
plot_histogram(train)Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="GP")Correlation Matrix
plot_correlation(train, type="continuous")MODEL
Linear Regression Model
From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.
I use the following model:
\(log(GP) = MIN + sqrt(W) +FTA + FGA + FGM\)
Estimate Coefficients and show coefficients table
model <- lm(log(GP) ~ MIN + sqrt(W) +FTA + FGA + FGM, train )
model$coefficients (Intercept) MIN sqrt(W) FTA FGA
1.7144146272 0.0002348041 0.3752481578 0.0393318232 -0.0237239660
FGM
0.0151895529
summary(model)
Call:
lm(formula = log(GP) ~ MIN + sqrt(W) + FTA + FGA + FGM, data = train)
Residuals:
Min 1Q Median 3Q Max
-1.99416 -0.23000 0.01698 0.30089 1.20114
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.71441463 0.12448105 13.772 < 0.0000000000000002 ***
MIN 0.00023480 0.00005277 4.449 0.000012 ***
sqrt(W) 0.37524816 0.02254041 16.648 < 0.0000000000000002 ***
FTA 0.03933182 0.01237430 3.179 0.001628 **
FGA -0.02372397 0.00650097 -3.649 0.000308 ***
FGM 0.01518955 0.01421066 1.069 0.285945
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4617 on 313 degrees of freedom
Multiple R-squared: 0.7854, Adjusted R-squared: 0.782
F-statistic: 229.1 on 5 and 313 DF, p-value: < 0.00000000000000022
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 1.714 | 0.124 | 13.772 | 0.0000 | *** |
MIN | 0.000 | 0.000 | 4.449 | 0.0000 | *** |
sqrt(W) | 0.375 | 0.023 | 16.648 | 0.0000 | *** |
FTA | 0.039 | 0.012 | 3.179 | 0.0016 | ** |
FGA | -0.024 | 0.007 | -3.649 | 0.0003 | *** |
FGM | 0.015 | 0.014 | 1.069 | 0.2859 |
|
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 0.4617 on 313 degrees of freedom | |||||
Multiple R-squared: 0.7854, Adjusted R-squared: 0.782 | |||||
F-statistic: 229.1 on 313 and 5 DF, p-value: 0.0000 | |||||
#*** means significant, r squared % expalinsI got the R squared pretty high.
Coefficient Magnitude Plot
# note this simple code using coefplot package produces chart like Figure 6.6 in one short line.
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)Glad that I got them so close togeather.
Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) MIN sqrt(W) FTA FGA FGM
2.754062 2.584590 1.271833 1.530154 1.858678
They are all below 5!
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-1.99 -0.23 0.02 0.30 1.20
They are all in between -2 and 2.
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=GP))
p + geom_point()Great, linear relationship, with the going up and to the top right corner.
Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Well this goes up, but then goes down, but it gets more clustered together near the end. But then again, it is residuals, so I don’t focus much on this one.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select( GP, upr, fit, lwr) %>%
as_flextable(show_coltype = FALSE) GP | upr | fit | lwr |
|---|---|---|---|
5 | 3.0 | 2.0 | 1.1 |
64 | 4.5 | 3.6 | 2.7 |
21 | 3.7 | 2.8 | 1.8 |
14 | 3.8 | 2.9 | 1.9 |
43 | 4.3 | 3.4 | 2.5 |
64 | 4.3 | 3.4 | 2.5 |
12 | 3.8 | 2.9 | 2.0 |
51 | 4.8 | 3.9 | 3.0 |
17 | 3.5 | 2.6 | 1.7 |
73 | 5.1 | 4.2 | 3.3 |
n: 10 | |||
They are all very close together with upper, average, and lower.
Plot Actual vs Fitted (test)
plot(test_w_predict$GP, test_w_predict$fit, pch=16, cex=1)Glad to see another linear relationship, with the upwards trajectory.
Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$GP, test_w_predict$fit),4),
round(rmse(test_w_predict$GP, test_w_predict$fit),4),
round(mape(test_w_predict$GP, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 45.6339 |
RMSE | 52.3077 |
MAPE | 0.8669 |
MAE- mean of the absolute error - 45.6
RMSE- Standard Deviation 52.3077
MAPE- Mean Absolute Percentage Error- 0.8669
Model Fit by Position
# more in depth plots of performance
p <- ggplot(data = subset(test_w_predict, FORWARD %in% c("Yes", "No")),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = FORWARD,
fill = FORWARD,
group = FORWARD))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="FORWARD POSITION",
caption="NBA_viz") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")q <- ggplot(data = subset(test_w_predict, GUARD %in% c("Yes", "No")),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = GUARD,
fill = GUARD,
group = GUARD))
q + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="GUARD POSITION",
caption="NBA_viz") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")s <- ggplot(data = subset(test_w_predict, CENTER %in% c("Yes", "No")),
aes(x = GP,
y = fit, ymin = lwr, ymax = upr,
color = CENTER,
fill = CENTER,
group = CENTER))
s + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="CENTER POSITION",
caption="NBA_viz") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")I decided to do a graph for each of the positions of Forward, Center, and Guard. They are all the same graph except with a few differences, but now I know. To make sense of that, the spots are at the same place, but one graph connects them to “yes”, the other two to “no”. Altough, Center seems to have the less dots connected.
ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(log(GP) ~ GP, data=train)
summary(pip)
Call:
lm(formula = log(GP) ~ GP, data = train)
Residuals:
Min 1Q Median 3Q Max
-1.92668 -0.19969 0.04506 0.32651 0.46855
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.8919642 0.0512151 36.94 <0.0000000000000002 ***
GP 0.0347165 0.0009207 37.71 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.4229 on 317 degrees of freedom
Multiple R-squared: 0.8177, Adjusted R-squared: 0.8171
F-statistic: 1422 on 1 and 317 DF, p-value: < 0.00000000000000022
# knitr does not like autoplot
#autoplot(lm(log(Wage.n) ~ Potential, data=train), data=train, colour = 'Nationality_Continent')
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=AGE, x=GP))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))I decided to use age(which goes from 20-40 ish). The charts mainly show very little relationship between Age and GP(mostly straight across), which is what the correlation matrix said also. But this is surprising because I would think Age would have a relationship with GP.
SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
None of the categorials were good in the model, so I did not add them. I also saw that sqrt on W would make the R squared better from 60’s to 70’s, so I included that. I went through many others like FTM, but I decided to include the 5 that were the most significant. The most significant are MIN and W, which I saw on the correlation matrix. All the ones I chose for my equation have a p value less then 1, so most are significant. The rest all made the R squared go up more. It went to 0.782, which is 78.2% of the data.
The first few charts showed a nice linear relationship, and the VIF were all below the number 5. This all leads me to think my equation is good. However, the MAE, RMSE,and MAPE seem to all be very high with 50, 40, and 80%. This puts a pep in my step to think maybe there is a problem.
I also checked the 3 positions with Forward, Guard, or Center, and the charts showed if each was a yes or a no for the three. They looked very much the same, but the dots were all different because one would be yes, and the other two would be no. But the range for them all, was the same.
If I were to do anything different then I would try the sqrt on some more variables or try some other variables such as “AST” or “TOV”.