DAT-4313 - Data Viz in Model Development

FIFA Example – This is code example – not full report

Author

Dr. V

FURTHER INSTRUCTIONS

Use the dataset assigned to you in the LAB Instructions.

Use the depvar indicated for your dataset.

Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.

Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk.

Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.

Example Code for FIFA with depvar=Wage

DATA

options(scipen=999)

library(socviz) # use for the round_df() function

# load("FIFA_viz.RData")
#load("Catholic_viz.RData")
#load("NBA_viz.RData")
load("Murders96_viz.RData")

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test  <- Murders96_viz[-index,]
# unique(train$region)

Training set number of observations: 1320
Test dataset number of observations: 877

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE)  # (class note --> remove flextable)

            vars    n     mean        sd     min        max      range      se
arrests        1 1320     5.21     18.76    0.00     320.00     320.00    0.52
countyid       2 1320 32550.57  15488.80 1001.00   56043.00   55042.00  426.32
density        3 1320   265.34   1817.83    0.08   54058.77   54058.69   50.03
popul          4 1320 94031.70 223537.24  141.00 3126966.00 3126825.00 6152.66
perc1019       5 1320    15.11      1.88    7.60      24.54      16.94    0.05
perc2029       6 1320    12.05      3.19    5.62      34.75      29.14    0.09
percblack      7 1320     8.34     13.93    0.00      86.28      86.28    0.38
percmale       8 1320    49.33      1.63   44.83      61.64      16.82    0.04
rpcincmaint    9 1320   207.78    116.83   19.14    1281.10    1261.96    3.22
rpcpersinc    10 1320 12569.38   3031.90 5322.20   41094.22   35772.02   83.45
rpcunemins    11 1320    52.90     34.85    0.00     287.10     287.10    0.96
year          12 1320  1996.00      0.00 1996.00    1996.00       0.00    0.00
murders       13 1320     5.88     26.27    0.00     420.00     420.00    0.72
murdrate      14 1320     0.42      0.77    0.00      11.01      11.01    0.02
arrestrate    15 1320     0.48      0.94    0.00      13.99      13.99    0.03
statefips     16 1320    32.45     15.46    1.00      56.00      55.00    0.43
countyfips    17 1320    99.81    109.51    1.00     840.00     839.00    3.01
execs         18 1320     0.01      0.14    0.00       3.00       3.00    0.00
lpopul        19 1320    10.40      1.35    4.95      14.96      10.01    0.04
execrate      20 1320     0.00      0.04    0.00       1.18       1.18    0.00
statename     21 1320      NaN        NA     Inf       -Inf       -Inf      NA
statecode     22 1320      NaN        NA     Inf       -Inf       -Inf      NA
region        23 1320      NaN        NA     Inf       -Inf       -Inf      NA

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

This dataset appears to contain many outliers in all the variables except countyid, statefips, and year. The large presence of these outliers indicates they should be further researched before determining how to appropriately deal with them.

library(DataExplorer)
plot_boxplot(train, by="region")

Many of these boxplots indicate they are dependent on the region. ### Histograms

plot_histogram(train)

Many of these appear to either have an unknown polynomial degree or a second degree polynomial. I am uncertain because of the shape of the histograms with values clumped towards zero since they are probably misrepresented due to outliers.

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="arrests")

There are many outliers so it is difficult to notice a pattern. However, popul and lpopul show an exponential increase in their outliers. ### Correlation Matrix

plot_correlation(train, type="continuous")

The density, popul, murders, and lpopul appear to be highly correlated with arrests. Since the naming indicates popul and lpopul are different measures of the same thing I chose to use popul since its correlation is slightly stronger.

MODEL

Linear Regression Model

From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.

Estimate the following model:
\(arrests = popul + density + murders\)

Estimate Coefficients and show coefficients table

model <- lm(arrests ~ popul + density + murders + (popul^2), data=train)
model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	0.333	0.270	1.233	0.2180
popul	0.000	0.000	13.231	0.0000	***
density	0.002	0.000	10.510	0.0000	***
murders	0.434	0.013	32.184	0.0000	***
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 9.012 on 1316 degrees of freedom
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7692
F-statistic: 1466 on 1316 and 3 DF, p-value: 0.0000

All the variables are statistically significant and 76.92% of the variance is accounted for according to the Adjusted R-squared.

Coefficient Magnitude Plot

# note this simple code using coefplot package produces chart like Figure 6.6 in one short line. 
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

The murders variable appears to have the largest impact on the slope of the models line with the other variables further fine-tuning it. ### Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

   popul  density  murders 
1.887517 1.296957 2.033553

There does not appear to be any multicollinearity since the values are all well beneath five. ### Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

       0%       25%       50%       75%      100% 
-197.8600   -1.1925   -0.5700    0.1625  147.9700

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

The outliers appear to significantly impact the model’s performance. Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(region, arrests, fit, lwr,upr) %>% 
  as_flextable(show_coltype = FALSE)

region	arrests	fit	lwr	upr
South	6	5.5	-12.1	23.2
South	1	1.3	-16.3	19.0
South	0	2.2	-15.5	19.9
South	20	9.0	-8.7	26.7
South	2	0.8	-16.9	18.5
South	0	1.1	-16.6	18.8
South	2	0.6	-17.0	18.3
South	8	3.9	-13.8	21.6
South	1	0.6	-17.0	18.3
South	5	2.0	-15.7	19.7
n: 10

Plot Actual vs Fitted (`test`)

plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
                         round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
                         round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	3.7458
RMSE	30.1103
MAPE	Inf

Model Fit by Region

# more in depth plots of performance 
p <- ggplot(data = subset(test_w_predict, region %in% c("South", "West", "Northeast", "Midwest")),
            aes(x = arrests,
                y = fit, ymin = lwr, ymax = upr,
                color = region,
                fill = region,
                group = region))
p +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="Certain regions had larger outliers compared to the rest.",
              caption="FIFA{datasetsICR}") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

The independent variables popul, density, and murders were selected based on their high correlation with the dependent variable, arrests. This was observed in the correlation matrix.
Each of these variables had outliers as shown in the scatter plots. The popul scatterplot indicated it was not linear with what appeared to be an exponential trend.
The histograms for these variables did not appear very helpful since most of the values were clumped at zero.
After training the model the residual plots did surprisingly well considering the outliers in the data were not addressed. The Residuals vs Leverage plot indicated the model was affected by nonlinear data. Based on the similarity between the variables popul and lpopul I decided to treat popul as a second degree polynomial in an attempt to address this since this pattern appeared in the lpopul boxplot.
The model performed decently with an Adjusted R-squared of 76.92%. The combination of the coefficients determines the slope of the line used for prediction.
Future improvement model, specifically by cleaning the data would be beneficial as the models RMSE was 30.11. Perhaps the model could be refined by creating a seperate model for each region since there appears to be a relationship between the outliers of different regions regarding arrests.