Lab 6

Linear Regression

Author

Abigail Russell

Select 5-8 independent variable from your dataset as predictors, and estimate a linear regression model.

Work through the EDA to identified good candidates. Your model should be linear in the parameters. Consider alternative functional forms for your depvar or continuous variables.

Include interpretations and comments after each chunk Conclude with a final assessment of the model and a reflection of the value of data visualization to aid model development.

options(scipen=999)

library(socviz) # use for the round_df() function
library(wooldridge)
load("Catholic_viz.RData")

Partition 60% Train / 40% Test

library(caret)
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test  <- Catholic_viz[-index,]

Training set number of observations: 3583
Test dataset number of observations: 2387


EDA

Descriptive Statistics

 library(psych)
library(tidyverse)
library(flextable)
psych::describe(train, fast=TRUE) 
         vars    n       mean         sd     median       min        max
id          1 3583 4642399.08 2719725.86 4634400.00 124915.00 7979086.00
read12      2 3583      51.56       9.39      52.85     29.15      68.09
math12      3 3583      52.00       9.59      52.49     30.14      71.37
female      4 3583       0.53       0.50       1.00      0.00       1.00
asian       5 3583       0.06       0.25       0.00      0.00       1.00
hispan      6 3583       0.11       0.31       0.00      0.00       1.00
black       7 3583       0.08       0.26       0.00      0.00       1.00
motheduc    8 3583      13.24       2.03      14.00      8.00      18.00
fatheduc    9 3583      13.55       2.26      14.00      8.00      18.00
lfaminc    10 3583      10.31       0.80      10.31      6.21      12.35
hsgrad     11 3583       0.93       0.26       1.00      0.00       1.00
cathhs     12 3583       0.07       0.26       0.00      0.00       1.00
parcath    13 3583       0.36       0.48       0.00      0.00       1.00
              range  skew kurtosis       se
id       7854171.00 -0.22    -1.57 45436.17
read12        38.94 -0.34    -0.85     0.16
math12        41.23 -0.15    -0.90     0.16
female         1.00 -0.11    -1.99     0.01
asian          1.00  3.55    10.57     0.00
hispan         1.00  2.50     4.26     0.01
black          1.00  3.22     8.35     0.00
motheduc      10.00 -0.34     0.39     0.03
fatheduc      10.00 -0.10    -0.15     0.04
lfaminc        6.13 -1.21     3.88     0.01
hsgrad         1.00 -3.25     8.59     0.00
cathhs         1.00  3.26     8.65     0.00
parcath        1.00  0.57    -1.68     0.01

Boxplots – All Numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

The majority of the variables are binaries indicating a 1 or 0 answer depending on the variable that it is. These include asian, black, cathhs, hispan, and hsgrad. Math12 and read12 seem to be even distributed with no outliers. Motheduc has what appears to eb one outlier for each side. lfaminc has a lot of lower outliers with a couple of high outliers.

library(DataExplorer)
train <- train %>%
  mutate(female = factor(female))
plot_boxplot(train, by="female") 

These Boxplots are done by Female and value. We can see that motheduc doesn’t have that much of a difference in gender and neither does read12 the differences are very small. The same goes for lfaminc and math12. The only noticeable difference is in fatheduc wherein the females have outliers and a smaller IQR.

Histograms

plot_histogram(train)

All of the histograms appear to be evenly distributed especially read12 and math12. They might possibly have a very triong relationship. lfaminc also appears to peak around where read12 and math12 are peaking this might be able to predict. Motheduc and fatheduc appear to have a lot of similarities this might help in making the model as well.

Scatterplots (depvar ~ all x)

plot_scatterplot(test, by= "lfaminc")

Math 12 and read 12 very obviously have a linear relationship, I believe that matheduc and fatheduc have a linear relationship as well.

Correlation Matrix

plot_correlation(train, type="continuous")

Again we can see that read12 and math12 are very highly positively correlated and lfaminc, fatheduc, motheduc are slightly positively correlated with read 12. While the binary variables of asian, hispan, black, parcath, and cathhs are not correlated.


Model

Linear Regression Model

Estimate the following model:
\(read12 = math12 + lfaminc + fatheduc + motheduc\)

Estimate Coefficients and show coefficients table

model <- lm(read12 ~ math12 + lfaminc + fatheduc + motheduc, data=train)

model %>% as_flextable

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

8.830

1.397

6.321

0.0000

***

math12

0.679

0.012

54.788

0.0000

***

lfaminc

0.413

0.154

2.682

0.0074

**

fatheduc

0.208

0.062

3.354

0.0008

***

motheduc

0.025

0.066

0.371

0.7104

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 6.432 on 3578 degrees of freedom

Multiple R-squared: 0.5317, Adjusted R-squared: 0.5312

F-statistic: 1016 on 3578 and 4 DF, p-value: 0.0000

All of the variables except for motheduc are statistically significant with a 53% ability to explain the variability of the model.

Coefficient Magnitude Plot

library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

The largest coefficient relationship is with math 12 then followed by lfaminc and fatheduc. Motheduc is about to go under the 0.0 mark and might be un-useful in our model.

Check for predictor independence

library(car)
vif(model)
  math12  lfaminc fatheduc motheduc 
1.224744 1.333364 1.705319 1.579102 

There is very low multicollinearity.

Residual Analysis

Residual Range

quantile(round(residuals(model, type = "deviance"),2))
     0%     25%     50%     75%    100% 
-28.630  -4.090   0.370   4.255  21.300 

Residual Plots

plot(model)

For the plots, there is a normal distribution of the residuals and the Q-Q plot values all follow along the line. There is a consistent shape across the two axis.In the cooks distance all points are within the boundries.

Plot Fitted Value by Actual Value

 library(broom)
train2 <- augment(model, data=train)  # this appends predicted to original dataset to build plots on your own

p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p  + geom_point()

Plot Residuals by Fitted Values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Both plots have a linear and evenly distributed relationship.

Performance Evaluation

Use Model to Score test dataset (Display First 10 values - depvar and fitted values only)

pred <- predict(model, newdata = test, interval="predict")  #score test dataset wtih model
test_w_predict <- cbind(test, pred)  # append score to original test dataset

test_w_predict %>%
  head(10) %>%
    select(id, read12, fit, lwr,upr) %>% 
  as_flextable(show_coltype = FALSE) 

id

read12

fit

lwr

upr

124,902

61.4

49.7

37.1

62.4

124,916

59.3

49.9

37.3

62.6

124,944

57.6

52.9

40.3

65.5

124,947

52.5

54.5

41.9

67.1

124,972

42.1

53.0

40.4

65.6

124,990

64.0

59.3

46.7

71.9

124,992

38.5

56.9

44.3

69.6

175,585

60.3

58.9

46.2

71.5

180,607

63.6

55.6

43.0

68.2

180,660

52.6

57.3

44.7

69.9

n: 10

Plot Actual vs Fitted (test)

plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)

Performance Metrics

library(Metrics)
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
                         round(rmse(test_w_predict$read12, test_w_predict$fit),4),
                         round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric

Value

MAE

5.4024

RMSE

6.7832

MAPE

0.1129

Model Fit by Gender

p <-  ggplot(data=test_w_predict , aes(x= read12, y= fit, ymin= lwr, ymax= upr, color= female, fill= female, group= female))

p+ geom_point(alpha = 0.5) + 
       geom_line() + geom_ribbon(alpha = 0.2, color = FALSE) +
       labs(title="Actual vs Fitted with Upper and Lower CI",
              subtitle="Male and Female",
              caption="Catholic {datasetsICR}") +
       xlab(NULL) + ylab("fitted") +
       theme(legend.position = "bottom")

The model definitely fits but I don’t think that it fits perfectly, if there was a better categorical variable that could be gathered it might be more beneficial than just a yes or no for female.

Summary Assesment and Evaluation of the Model

Based on the fit above as well as the MAE, RMSE, and MAPE. The model didn’t perfectly fit the data it was applied to. I think that overall it does an ok job of predicting but there is a chance that the values are off by at lead 5 units or 11.29% off of the actual values. A lot of the plots themselves had a linear relationship and for example the Q-Q plot fit the line perfectly. In conclusion I think the model and the plot are good but could be made better through furthur testing and possibly different variables.

END