options(scipen=999)
library(socviz) # use for the round_df() function
# load("FIFA_viz.RData")
#load("Catholic_viz.RData")
#load("NBA_viz.RData")
load("Murders96_viz.RData")DAT-4313 - Data Viz in Model Development
FIFA Example – This is code example – not full report
FURTHER INSTRUCTIONS
Use the dataset assigned to you in the LAB Instructions.
Use the depvar indicated for your dataset.
Select 5-8 independent variables from your dataset as predictors, and estimate a linear regression model.
Note that not all variables in your dataset make sense as predictors. For example, categorical variables with many levels or all components of a derived variable are unreasonable.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk.
Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
Example Code for FIFA with depvar=Wage
DATA
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(Murders96_viz$arrests, p=0.6, list=FALSE)
train <- Murders96_viz[index,]
test <- Murders96_viz[-index,]
# unique(train$region)Training set number of observations: 1320
Test dataset number of observations: 877
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) # (class note --> remove flextable) vars n mean sd min max range se
arrests 1 1320 5.21 18.76 0.00 320.00 320.00 0.52
countyid 2 1320 32550.57 15488.80 1001.00 56043.00 55042.00 426.32
density 3 1320 265.34 1817.83 0.08 54058.77 54058.69 50.03
popul 4 1320 94031.70 223537.24 141.00 3126966.00 3126825.00 6152.66
perc1019 5 1320 15.11 1.88 7.60 24.54 16.94 0.05
perc2029 6 1320 12.05 3.19 5.62 34.75 29.14 0.09
percblack 7 1320 8.34 13.93 0.00 86.28 86.28 0.38
percmale 8 1320 49.33 1.63 44.83 61.64 16.82 0.04
rpcincmaint 9 1320 207.78 116.83 19.14 1281.10 1261.96 3.22
rpcpersinc 10 1320 12569.38 3031.90 5322.20 41094.22 35772.02 83.45
rpcunemins 11 1320 52.90 34.85 0.00 287.10 287.10 0.96
year 12 1320 1996.00 0.00 1996.00 1996.00 0.00 0.00
murders 13 1320 5.88 26.27 0.00 420.00 420.00 0.72
murdrate 14 1320 0.42 0.77 0.00 11.01 11.01 0.02
arrestrate 15 1320 0.48 0.94 0.00 13.99 13.99 0.03
statefips 16 1320 32.45 15.46 1.00 56.00 55.00 0.43
countyfips 17 1320 99.81 109.51 1.00 840.00 839.00 3.01
execs 18 1320 0.01 0.14 0.00 3.00 3.00 0.00
lpopul 19 1320 10.40 1.35 4.95 14.96 10.01 0.04
execrate 20 1320 0.00 0.04 0.00 1.18 1.18 0.00
statename 21 1320 NaN NA Inf -Inf -Inf NA
statecode 22 1320 NaN NA Inf -Inf -Inf NA
region 23 1320 NaN NA Inf -Inf -Inf NA
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")This dataset appears to contain many outliers in all the variables except countyid, statefips, and year. The large presence of these outliers indicates they should be further researched before determining how to appropriately deal with them.
library(DataExplorer)
plot_boxplot(train, by="region") Many of these boxplots indicate they are dependent on the region. ### Histograms
plot_histogram(train)Many of these appear to either have an unknown polynomial degree or a second degree polynomial. I am uncertain because of the shape of the histograms with values clumped towards zero since they are probably misrepresented due to outliers.
Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="arrests")There are many outliers so it is difficult to notice a pattern. However, popul and lpopul show an exponential increase in their outliers. ### Correlation Matrix
plot_correlation(train, type="continuous")The density, popul, murders, and lpopul appear to be highly correlated with arrests. Since the naming indicates popul and lpopul are different measures of the same thing I chose to use popul since its correlation is slightly stronger.
MODEL
Linear Regression Model
From the above EDA, we chose the following variables to start. Note that, given the shape of the relationship between Wage and Age, we entered Age as a quadratic.
From experience and prior research, it is common to specify a dependent variable that is a currency variable (e.g., sales, revenue, wages) in log form.
Estimate the following model:
\(arrests = popul + density + murders\)
Estimate Coefficients and show coefficients table
model <- lm(arrests ~ popul + density + murders + (popul^2), data=train)
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 0.333 | 0.270 | 1.233 | 0.2180 |
|
popul | 0.000 | 0.000 | 13.231 | 0.0000 | *** |
density | 0.002 | 0.000 | 10.510 | 0.0000 | *** |
murders | 0.434 | 0.013 | 32.184 | 0.0000 | *** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 9.012 on 1316 degrees of freedom | |||||
Multiple R-squared: 0.7697, Adjusted R-squared: 0.7692 | |||||
F-statistic: 1466 on 1316 and 3 DF, p-value: 0.0000 | |||||
All the variables are statistically significant and 76.92% of the variance is accounted for according to the Adjusted R-squared.
Coefficient Magnitude Plot
# note this simple code using coefplot package produces chart like Figure 6.6 in one short line.
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)The murders variable appears to have the largest impact on the slope of the models line with the other variables further fine-tuning it. ### Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) popul density murders
1.887517 1.296957 2.033553
There does not appear to be any multicollinearity since the values are all well beneath five. ### Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-197.8600 -1.1925 -0.5700 0.1625 147.9700
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)The outliers appear to significantly impact the model’s performance. Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=arrests))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(region, arrests, fit, lwr,upr) %>%
as_flextable(show_coltype = FALSE) region | arrests | fit | lwr | upr |
|---|---|---|---|---|
South | 6 | 5.5 | -12.1 | 23.2 |
South | 1 | 1.3 | -16.3 | 19.0 |
South | 0 | 2.2 | -15.5 | 19.9 |
South | 20 | 9.0 | -8.7 | 26.7 |
South | 2 | 0.8 | -16.9 | 18.5 |
South | 0 | 1.1 | -16.6 | 18.8 |
South | 2 | 0.6 | -17.0 | 18.3 |
South | 8 | 3.9 | -13.8 | 21.6 |
South | 1 | 0.6 | -17.0 | 18.3 |
South | 5 | 2.0 | -15.7 | 19.7 |
n: 10 | ||||
Plot Actual vs Fitted (test)
plot(test_w_predict$arrests, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$arrests, test_w_predict$fit),4),
round(rmse(test_w_predict$arrests, test_w_predict$fit),4),
round(mape(test_w_predict$arrests, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 3.7458 |
RMSE | 30.1103 |
MAPE | Inf |
Model Fit by Region
# more in depth plots of performance
p <- ggplot(data = subset(test_w_predict, region %in% c("South", "West", "Northeast", "Midwest")),
aes(x = arrests,
y = fit, ymin = lwr, ymax = upr,
color = region,
fill = region,
group = region))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="Certain regions had larger outliers compared to the rest.",
caption="FIFA{datasetsICR}") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
- The independent variables popul, density, and murders were selected based on their high correlation with the dependent variable, arrests. This was observed in the correlation matrix.
- Each of these variables had outliers as shown in the scatter plots. The popul scatterplot indicated it was not linear with what appeared to be an exponential trend.
- The histograms for these variables did not appear very helpful since most of the values were clumped at zero.
- After training the model the residual plots did surprisingly well considering the outliers in the data were not addressed. The Residuals vs Leverage plot indicated the model was affected by nonlinear data. Based on the similarity between the variables popul and lpopul I decided to treat popul as a second degree polynomial in an attempt to address this since this pattern appeared in the lpopul boxplot.
- The model performed decently with an Adjusted R-squared of 76.92%. The combination of the coefficients determines the slope of the line used for prediction.
- Future improvement model, specifically by cleaning the data would be beneficial as the models RMSE was 30.11. Perhaps the model could be refined by creating a seperate model for each region since there appears to be a relationship between the outliers of different regions regarding arrests.