options(scipen=999)
library(socviz) # use for the round_df() function
load("Catholic_viz.RData")
Catholic_viz$female <- as.factor(Catholic_viz$female)Lab 6
Make a Model - Catholic
DATA
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test <- Catholic_viz[-index,]Training set number of observations: 3583
Test dataset number of observations: 2387
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) vars n mean sd min max range
id 1 3583 4642399.08 2719725.86 124915.00 7979086.00 7854171.00
read12 2 3583 51.56 9.39 29.15 68.09 38.94
math12 3 3583 52.00 9.59 30.14 71.37 41.23
female 4 3583 NaN NA Inf -Inf -Inf
asian 5 3583 0.06 0.25 0.00 1.00 1.00
hispan 6 3583 0.11 0.31 0.00 1.00 1.00
black 7 3583 0.08 0.26 0.00 1.00 1.00
motheduc 8 3583 13.24 2.03 8.00 18.00 10.00
fatheduc 9 3583 13.55 2.26 8.00 18.00 10.00
lfaminc 10 3583 10.31 0.80 6.21 12.35 6.13
hsgrad 11 3583 0.93 0.26 0.00 1.00 1.00
cathhs 12 3583 0.07 0.26 0.00 1.00 1.00
parcath 13 3583 0.36 0.48 0.00 1.00 1.00
se
id 45436.17
read12 0.16
math12 0.16
female NA
asian 0.00
hispan 0.01
black 0.00
motheduc 0.03
fatheduc 0.04
lfaminc 0.01
hsgrad 0.00
cathhs 0.00
parcath 0.01
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")library(DataExplorer)
plot_boxplot(train, by="female") # 1 = femaleHistograms
plot_histogram(train)Scatterplots (depvar ~ all x)
Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.
plot_scatterplot(test, by="read12")Correlation Matrix
plot_correlation(train, type="continuous")MODEL
Linear Regression Model
Estimate Coefficients and show coefficients table
model <- lm(log(read12) ~ math12 + female + black + hispan + asian + hsgrad + cathhs + parcath + motheduc + fatheduc + lfaminc, data=train)
model$coefficients (Intercept) math12 female1 black hispan
3.0162102040 0.0141195323 0.0571091486 -0.0293918461 -0.0013069061
asian hsgrad cathhs parcath motheduc
-0.0205451112 0.0064540163 0.0053699613 -0.0068586548 0.0005271622
fatheduc lfaminc
0.0036438408 0.0085077593
summary(model)
Call:
lm(formula = log(read12) ~ math12 + female + black + hispan +
asian + hsgrad + cathhs + parcath + motheduc + fatheduc +
lfaminc, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.62360 -0.07779 0.00994 0.08293 0.44132
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.0162102 0.0304290 99.123 < 0.0000000000000002 ***
math12 0.0141195 0.0002665 52.979 < 0.0000000000000002 ***
female1 0.0571091 0.0044395 12.864 < 0.0000000000000002 ***
black -0.0293918 0.0087965 -3.341 0.000842 ***
hispan -0.0013069 0.0077623 -0.168 0.866306
asian -0.0205451 0.0091094 -2.255 0.024170 *
hsgrad 0.0064540 0.0088232 0.731 0.464532
cathhs 0.0053700 0.0090470 0.594 0.552841
parcath -0.0068587 0.0051693 -1.327 0.184662
motheduc 0.0005272 0.0013773 0.383 0.701927
fatheduc 0.0036438 0.0012791 2.849 0.004416 **
lfaminc 0.0085078 0.0032172 2.644 0.008218 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1315 on 3571 degrees of freedom
Multiple R-squared: 0.5421, Adjusted R-squared: 0.5407
F-statistic: 384.3 on 11 and 3571 DF, p-value: < 0.00000000000000022
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 3.016 | 0.030 | 99.123 | 0.0000 | *** |
math12 | 0.014 | 0.000 | 52.979 | 0.0000 | *** |
female1 | 0.057 | 0.004 | 12.864 | 0.0000 | *** |
black | -0.029 | 0.009 | -3.341 | 0.0008 | *** |
hispan | -0.001 | 0.008 | -0.168 | 0.8663 |
|
asian | -0.021 | 0.009 | -2.255 | 0.0242 | * |
hsgrad | 0.006 | 0.009 | 0.731 | 0.4645 |
|
cathhs | 0.005 | 0.009 | 0.594 | 0.5528 |
|
parcath | -0.007 | 0.005 | -1.327 | 0.1847 |
|
motheduc | 0.001 | 0.001 | 0.383 | 0.7019 |
|
fatheduc | 0.004 | 0.001 | 2.849 | 0.0044 | ** |
lfaminc | 0.009 | 0.003 | 2.644 | 0.0082 | ** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 0.1315 on 3571 degrees of freedom | |||||
Multiple R-squared: 0.5421, Adjusted R-squared: 0.5407 | |||||
F-statistic: 384.3 on 3571 and 11 DF, p-value: 0.0000 | |||||
Coefficient Magnitude Plot
# note this simple code using coefplot package produces chart like Figure 6.6 in one short line.
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)Check for predictor independence
Using Variance Inflation Factors (VIF)
library(car)
vif(model) math12 female black hispan asian hsgrad cathhs parcath
1.353957 1.018173 1.117319 1.216667 1.037221 1.104961 1.157689 1.280675
motheduc fatheduc lfaminc
1.620753 1.736261 1.389016
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-0.62 -0.08 0.01 0.08 0.44
Residual Plots
We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance
plot(model)Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset with model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(read12, fit, lwr, upr) %>%
as_flextable(show_coltype = FALSE) read12 | fit | lwr | upr |
|---|---|---|---|
61.4 | 3.9 | 3.6 | 4.1 |
59.3 | 3.9 | 3.7 | 4.2 |
57.6 | 4.0 | 3.7 | 4.2 |
52.5 | 4.0 | 3.8 | 4.3 |
42.1 | 3.9 | 3.7 | 4.2 |
64.0 | 4.1 | 3.9 | 4.4 |
38.5 | 4.1 | 3.8 | 4.3 |
60.3 | 4.1 | 3.8 | 4.4 |
63.6 | 4.0 | 3.7 | 4.2 |
52.6 | 4.0 | 3.8 | 4.3 |
n: 10 | |||
Plot Actual vs Fitted (test)
plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
round(rmse(test_w_predict$read12, test_w_predict$fit),4),
round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 47.6060 |
RMSE | 48.5057 |
MAPE | 0.9214 |
Model Fit by Gender
# more in depth plots of performance
p <- ggplot(data = subset(test_w_predict, female %in% c("1", "0")),
aes(x = read12,
y = fit, ymin = lwr, ymax = upr,
color = female,
fill = female,
group = female))
p + geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.3, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="Males (0) and Females (1)",
caption="catholic {wooldridge}") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")ggplot2 explore (Intro Section)
This is just replication of the first section within Chapter 6 (as FYI)
pip <- lm(log(read12) ~ read12, data=train)
summary(pip)
Call:
lm(formula = log(read12) ~ read12, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.091797 -0.010654 0.005758 0.015870 0.019701
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 2.86525755 0.00182481 1570.2 <0.0000000000000002 ***
read12 0.02054868 0.00003482 590.2 <0.0000000000000002 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.01957 on 3581 degrees of freedom
Multiple R-squared: 0.9898, Adjusted R-squared: 0.9898
F-statistic: 3.483e+05 on 1 and 3581 DF, p-value: < 0.00000000000000022
# knitr does not like autoplot
#autoplot(lm(log(read12) ~ Potential, data=train), data=train, colour = 'female')
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(read12), x=lfaminc))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
geom_smooth(color = "steelblue", fill="steelblue", method = "lm")p + geom_point(alpha=0.1) +
geom_smooth(color = "tomato", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)p + geom_point(alpha=0.1) +
geom_quantile(color = "tomato", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL
When looking at the exploratory data analysis, we can uncover a few things about our data.
The first being that our boxplots containing just the numeric data show little to no outliers on most variables. The only variable containing more than 2 outliers would be the log of family income. When looking at the boxplots associated by whether that observation was a female, we don’t see much difference between genders. The only big difference we can pinpoint would be the years of education that their father received.
The histograms provide us insight to the frequency of the respective variables and that the standardized score for reading showed to be higher than the math scores. The scatterplots, however, are much too difficult to interpret due to the different kinds of categorical variables throughout the data.
The strongest correlation shows to be between the reading and math standardized scores as I am sure they each play a role in how one may affect the other.
When estimating coefficients for our linear regression model, we could see that the coefficients of math scores, female, black, asian, father’s years of education, and log of family income show to be the most significant of the bunch. The coefficient plot just below provides that visual representation.
Our model shows a moderate correlation along with residuals that indicate that the model should fit the data well.
The mean absolute error equals 47.61, root mean squared error equals 48.51, and the mean absolute percentage error came out to be 0.92.
I then provided a model that is fit by gender to show the actual vs fitted with upper and lower CI.
I believe these visualizations have enhanced our understanding of the patterns and relationships within our data. Along with this, we are able to identify, evaluate, and modify any errors in order to improve the overall comprehensive analysis of our graph.