options(scipen=999)
library(socviz) # use for the round_df() function
library(wooldridge)
load("Catholic_viz.RData")Lab 6
Linear Regression
Select 5-8 independent variable from your dataset as predictors, and estimate a linear regression model.
Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.
Include interpretations and comments after each chunk Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.
Partition 60% Train / 40% Test
library(caret)
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test <- Catholic_viz[-index,]Training set number of observations: 3583
Test dataset number of observations: 2387
EDA
Descriptive Statistics
library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) vars n mean sd median min max
id 1 3583 4642399.08 2719725.86 4634400.00 124915.00 7979086.00
read12 2 3583 51.56 9.39 52.85 29.15 68.09
math12 3 3583 52.00 9.59 52.49 30.14 71.37
female 4 3583 0.53 0.50 1.00 0.00 1.00
asian 5 3583 0.06 0.25 0.00 0.00 1.00
hispan 6 3583 0.11 0.31 0.00 0.00 1.00
black 7 3583 0.08 0.26 0.00 0.00 1.00
motheduc 8 3583 13.24 2.03 14.00 8.00 18.00
fatheduc 9 3583 13.55 2.26 14.00 8.00 18.00
lfaminc 10 3583 10.31 0.80 10.31 6.21 12.35
hsgrad 11 3583 0.93 0.26 1.00 0.00 1.00
cathhs 12 3583 0.07 0.26 0.00 0.00 1.00
parcath 13 3583 0.36 0.48 0.00 0.00 1.00
range skew kurtosis se
id 7854171.00 -0.22 -1.57 45436.17
read12 38.94 -0.34 -0.85 0.16
math12 41.23 -0.15 -0.90 0.16
female 1.00 -0.11 -1.99 0.01
asian 1.00 3.55 10.57 0.00
hispan 1.00 2.50 4.26 0.01
black 1.00 3.22 8.35 0.00
motheduc 10.00 -0.34 0.39 0.03
fatheduc 10.00 -0.10 -0.15 0.04
lfaminc 6.13 -1.21 3.88 0.01
hsgrad 1.00 -3.25 8.59 0.00
cathhs 1.00 3.26 8.65 0.00
parcath 1.00 0.57 -1.68 0.01
Boxplots – All Numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")The majority of the variables are binaries indicating a 1 or 0 answer depending on the variable that it is. These include asian, black, cathhs, hispan, and hsgrad. Math12 and read12 seem to be even distributed with no outliers. Motheduc has what appears to eb one outlier for each side. lfaminc has a lot of lower outliers with a couple of high outliers.
library(DataExplorer)
train <- train %>%
mutate(female = factor(female))
plot_boxplot(train, by="female") These Boxplots are done by Female and value. We can see that motheduc doesn’t have that much of a difference in gender and neither does read12 the differences are very small. The same goes for lfaminc and math12. The only noticeable difference is in fatheduc wherein the females have outliers and a smaller IQR.
Histograms
plot_histogram(train)All of the histograms appear to be evenly distributed especially read12 and math12. They might possibly have a very triong relationship. lfaminc also appears to peak around where read12 and math12 are peaking this might be able to predict. Motheduc and fatheduc appear to have a lot of similarities this might help in making the model as well.
Scatterplots (depvar ~ all x)
plot_scatterplot(test, by= "lfaminc")Math 12 and read 12 very obviously have a linear relationship, I believe that matheduc and fatheduc have a linear relationship as well.
Correlation Matrix
plot_correlation(train, type="continuous")Again we can see that read12 and math12 are very highly positively correlated and lfaminc, fatheduc, motheduc are slightly positively correlated with read 12. While the binary variables of asian, hispan, black, parcath, and cathhs are not correlated.
Model
Linear Regression Model
Estimate the following model:
\(read12 = math12 + lfaminc + fatheduc + motheduc\)
Estimate Coefficients and show coefficients table
model <- lm(read12 ~ math12 + lfaminc + fatheduc + motheduc, data=train)
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 8.830 | 1.397 | 6.321 | 0.0000 | *** |
math12 | 0.679 | 0.012 | 54.788 | 0.0000 | *** |
lfaminc | 0.413 | 0.154 | 2.682 | 0.0074 | ** |
fatheduc | 0.208 | 0.062 | 3.354 | 0.0008 | *** |
motheduc | 0.025 | 0.066 | 0.371 | 0.7104 |
|
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 6.432 on 3578 degrees of freedom | |||||
Multiple R-squared: 0.5317, Adjusted R-squared: 0.5312 | |||||
F-statistic: 1016 on 3578 and 4 DF, p-value: 0.0000 | |||||
All of the variables except for motheduc are statistically significant with a 53% ability to explain the variability of the model.
Coefficient Magnitude Plot
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)The largest coefficient relationship is with math 12 then followed by lfaminc and fatheduc. Motheduc is about to go under the 0.0 mark and might be un-useful in our model.
Check for predictor independence
library(car)
vif(model) math12 lfaminc fatheduc motheduc
1.224744 1.333364 1.705319 1.579102
There is very low multicollinearity.
Residual Analysis
Residual Range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-28.630 -4.090 0.370 4.255 21.300
Residual Plots
plot(model)For the plots, there is a normal distribution of the residuals and the Q-Q plot values all follow along the line. There is a consistent shape across the two axis.In the cooks distance all points are within the boundries.
Plot Fitted Value by Actual Value
library(broom)
train2 <- augment(model, data=train) # this appends predicted to original dataset to build plots on your own
p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p + geom_point()Plot Residuals by Fitted Values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Both plots have a linear and evenly distributed relationship.
Performance Evaluation
Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)
pred <- predict(model, newdata = test, interval="predict") #score test dataset wtih model
test_w_predict <- cbind(test, pred) # append score to original test dataset
test_w_predict %>%
head(10) %>%
select(id, read12, fit, lwr,upr) %>%
as_flextable(show_coltype = FALSE) id | read12 | fit | lwr | upr |
|---|---|---|---|---|
124,902 | 61.4 | 49.7 | 37.1 | 62.4 |
124,916 | 59.3 | 49.9 | 37.3 | 62.6 |
124,944 | 57.6 | 52.9 | 40.3 | 65.5 |
124,947 | 52.5 | 54.5 | 41.9 | 67.1 |
124,972 | 42.1 | 53.0 | 40.4 | 65.6 |
124,990 | 64.0 | 59.3 | 46.7 | 71.9 |
124,992 | 38.5 | 56.9 | 44.3 | 69.6 |
175,585 | 60.3 | 58.9 | 46.2 | 71.5 |
180,607 | 63.6 | 55.6 | 43.0 | 68.2 |
180,660 | 52.6 | 57.3 | 44.7 | 69.9 |
n: 10 | ||||
Plot Actual vs Fitted (test)
plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)Performance Metrics
library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
round(rmse(test_w_predict$read12, test_w_predict$fit),4),
round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 5.4024 |
RMSE | 6.7832 |
MAPE | 0.1129 |
Model Fit by Gender
p <- ggplot(data=test_w_predict , aes(x= read12, y= fit, ymin= lwr, ymax= upr, color= female, fill= female, group= female))
p+ geom_point(alpha = 0.5) +
geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
labs(title="Actual vs Fitted with Upper and Lower CI",
subtitle="Male and Female",
caption="Catholic {datasetsICR}") +
xlab(NULL) + ylab("fitted") +
theme(legend.position = "bottom")The model definitely fits but I don’t think that it fits perfectly, if there was a better categorical variable that could be gathered it might be more beneficial than just a yes or no for female.
Summary Assesment and Evaluation of the Model
Based on the fit above as well as the MAE, RMSE, and MAPE. The model didn’t perfectly fit the data it was applied to. I think that overall it does an ok job of predicting but there is a chance that the values are off by at lead 5 units or 11.29% off of the actual values. A lot of the plots themselves had a linear relationship and for example the Q-Q plot fit the line perfectly. In conclusion I think the model and the plot are good but could be made better through furthur testing and possibly different variables.