Lab 6

Make a Model - Catholic

Author

Julie Milligan

DATA

options(scipen=999)

library(socviz) # use for the round_df() function

load("Catholic_viz.RData")
Catholic_viz$female <- as.factor(Catholic_viz$female)

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE) 
train <- Catholic_viz[index,]
test  <- Catholic_viz[-index,]

Training set number of observations: 3583
Test dataset number of observations: 2387

EDA

Descriptive Statistics

library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE)

         vars    n       mean         sd       min        max      range
id          1 3583 4642399.08 2719725.86 124915.00 7979086.00 7854171.00
read12      2 3583      51.56       9.39     29.15      68.09      38.94
math12      3 3583      52.00       9.59     30.14      71.37      41.23
female      4 3583        NaN         NA       Inf       -Inf       -Inf
asian       5 3583       0.06       0.25      0.00       1.00       1.00
hispan      6 3583       0.11       0.31      0.00       1.00       1.00
black       7 3583       0.08       0.26      0.00       1.00       1.00
motheduc    8 3583      13.24       2.03      8.00      18.00      10.00
fatheduc    9 3583      13.55       2.26      8.00      18.00      10.00
lfaminc    10 3583      10.31       0.80      6.21      12.35       6.13
hsgrad     11 3583       0.93       0.26      0.00       1.00       1.00
cathhs     12 3583       0.07       0.26      0.00       1.00       1.00
parcath    13 3583       0.36       0.48      0.00       1.00       1.00
               se
id       45436.17
read12       0.16
math12       0.16
female         NA
asian        0.00
hispan       0.01
black        0.00
motheduc     0.03
fatheduc     0.04
lfaminc      0.01
hsgrad       0.00
cathhs       0.00
parcath      0.01

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="female") # 1 = female

Histograms

plot_histogram(train)

Scatterplots (depvar ~ all x)

Look at the shape of the relationship between the dependent variable and all of the continuous potential independent variables.

plot_scatterplot(test, by="read12")

Correlation Matrix

plot_correlation(train, type="continuous")

MODEL

Linear Regression Model

Estimate Coefficients and show coefficients table

model <- lm(log(read12) ~ math12 + female + black + hispan + asian + hsgrad + cathhs + parcath + motheduc + fatheduc + lfaminc, data=train)

model$coefficients

  (Intercept)        math12       female1         black        hispan 
 3.0162102040  0.0141195323  0.0571091486 -0.0293918461 -0.0013069061 
        asian        hsgrad        cathhs       parcath      motheduc 
-0.0205451112  0.0064540163  0.0053699613 -0.0068586548  0.0005271622 
     fatheduc       lfaminc 
 0.0036438408  0.0085077593

summary(model)


Call:
lm(formula = log(read12) ~ math12 + female + black + hispan + 
    asian + hsgrad + cathhs + parcath + motheduc + fatheduc + 
    lfaminc, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.62360 -0.07779  0.00994  0.08293  0.44132 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept)  3.0162102  0.0304290  99.123 < 0.0000000000000002 ***
math12       0.0141195  0.0002665  52.979 < 0.0000000000000002 ***
female1      0.0571091  0.0044395  12.864 < 0.0000000000000002 ***
black       -0.0293918  0.0087965  -3.341             0.000842 ***
hispan      -0.0013069  0.0077623  -0.168             0.866306    
asian       -0.0205451  0.0091094  -2.255             0.024170 *  
hsgrad       0.0064540  0.0088232   0.731             0.464532    
cathhs       0.0053700  0.0090470   0.594             0.552841    
parcath     -0.0068587  0.0051693  -1.327             0.184662    
motheduc     0.0005272  0.0013773   0.383             0.701927    
fatheduc     0.0036438  0.0012791   2.849             0.004416 ** 
lfaminc      0.0085078  0.0032172   2.644             0.008218 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1315 on 3571 degrees of freedom
Multiple R-squared:  0.5421,    Adjusted R-squared:  0.5407 
F-statistic: 384.3 on 11 and 3571 DF,  p-value: < 0.00000000000000022

model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	3.016	0.030	99.123	0.0000	***
math12	0.014	0.000	52.979	0.0000	***
female1	0.057	0.004	12.864	0.0000	***
black	-0.029	0.009	-3.341	0.0008	***
hispan	-0.001	0.008	-0.168	0.8663
asian	-0.021	0.009	-2.255	0.0242	*
hsgrad	0.006	0.009	0.731	0.4645
cathhs	0.005	0.009	0.594	0.5528
parcath	-0.007	0.005	-1.327	0.1847
motheduc	0.001	0.001	0.383	0.7019
fatheduc	0.004	0.001	2.849	0.0044	**
lfaminc	0.009	0.003	2.644	0.0082	**
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 0.1315 on 3571 degrees of freedom
Multiple R-squared: 0.5421, Adjusted R-squared: 0.5407
F-statistic: 384.3 on 3571 and 11 DF, p-value: 0.0000

Coefficient Magnitude Plot

# note this simple code using coefplot package produces chart like Figure 6.6 in one short line. 
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

Check for predictor independence

Using Variance Inflation Factors (VIF)

library(car)
vif(model)

  math12   female    black   hispan    asian   hsgrad   cathhs  parcath 
1.353957 1.018173 1.117319 1.216667 1.037221 1.104961 1.157689 1.280675 
motheduc fatheduc  lfaminc 
1.620753 1.736261 1.389016

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))

   0%   25%   50%   75%  100% 
-0.62 -0.08  0.01  0.08  0.44

Residual Plots

We are looking for: - Random distribution of residuals vs fitted values - Normally distributed residuals : Normal Q-Q plot with values along line - Homoskedasticity with a Scale-Location line that is horizontal and no residual pattern - Minimal influential obs - that is, those outside the borders of Cook’s distance

plot(model)

Plot Fitted Value by Actual Value

library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance Evaluation

Use Model to Score `test` dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset with model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(read12, fit, lwr, upr) %>% 
  as_flextable(show_coltype = FALSE)

read12	fit	lwr	upr
61.4	3.9	3.6	4.1
59.3	3.9	3.7	4.2
57.6	4.0	3.7	4.2
52.5	4.0	3.8	4.3
42.1	3.9	3.7	4.2
64.0	4.1	3.9	4.4
38.5	4.1	3.8	4.3
60.3	4.1	3.8	4.4
63.6	4.0	3.7	4.2
52.6	4.0	3.8	4.3
n: 10

Plot Actual vs Fitted (`test`)

plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
                         round(rmse(test_w_predict$read12, test_w_predict$fit),4),
                         round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	47.6060
RMSE	48.5057
MAPE	0.9214

Model Fit by Gender

# more in depth plots of performance 
p <- ggplot(data = subset(test_w_predict, female %in% c("1", "0")),
            aes(x = read12,
                y = fit, ymin = lwr, ymax = upr,
                color = female,
                fill = female,
                group = female))
p +  geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.3, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="Males (0) and Females (1)",
              caption="catholic {wooldridge}") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

`ggplot2` explore (Intro Section)

This is just replication of the first section within Chapter 6 (as FYI)

pip <- lm(log(read12) ~ read12, data=train)
summary(pip)


Call:
lm(formula = log(read12) ~ read12, data = train)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.091797 -0.010654  0.005758  0.015870  0.019701 

Coefficients:
              Estimate Std. Error t value            Pr(>|t|)    
(Intercept) 2.86525755 0.00182481  1570.2 <0.0000000000000002 ***
read12      0.02054868 0.00003482   590.2 <0.0000000000000002 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.01957 on 3581 degrees of freedom
Multiple R-squared:  0.9898,    Adjusted R-squared:  0.9898 
F-statistic: 3.483e+05 on 1 and 3581 DF,  p-value: < 0.00000000000000022

# knitr does not like autoplot
#autoplot(lm(log(read12) ~ Potential, data=train), data=train, colour = 'female')


library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(read12), x=lfaminc))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", fill="tomato", method = MASS::rlm) +
    geom_smooth(color = "steelblue", fill="steelblue", method = "lm")

p + geom_point(alpha=0.1) +
    geom_smooth(color = "tomato", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

p + geom_point(alpha=0.1) +
    geom_quantile(color = "tomato", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

SUMMARY ASSESSMENT AND EVALUATION OF THE MODEL

When looking at the exploratory data analysis, we can uncover a few things about our data.
The first being that our boxplots containing just the numeric data show little to no outliers on most variables. The only variable containing more than 2 outliers would be the log of family income. When looking at the boxplots associated by whether that observation was a female, we don’t see much difference between genders. The only big difference we can pinpoint would be the years of education that their father received.
The histograms provide us insight to the frequency of the respective variables and that the standardized score for reading showed to be higher than the math scores. The scatterplots, however, are much too difficult to interpret due to the different kinds of categorical variables throughout the data.
The strongest correlation shows to be between the reading and math standardized scores as I am sure they each play a role in how one may affect the other.

When estimating coefficients for our linear regression model, we could see that the coefficients of math scores, female, black, asian, father’s years of education, and log of family income show to be the most significant of the bunch. The coefficient plot just below provides that visual representation.
Our model shows a moderate correlation along with residuals that indicate that the model should fit the data well.
The mean absolute error equals 47.61, root mean squared error equals 48.51, and the mean absolute percentage error came out to be 0.92.
I then provided a model that is fit by gender to show the actual vs fitted with upper and lower CI.

I believe these visualizations have enhanced our understanding of the patterns and relationships within our data. Along with this, we are able to identify, evaluate, and modify any errors in order to improve the overall comprehensive analysis of our graph.