Lab 6 (Ismael Hdz)

Author

Ismael Hernandez

Data

options(scipen=999)
library(socviz)
load("Catholic_viz.RData")

Partitionning (60-40)

library(caret)
Loading required package: ggplot2
Loading required package: lattice
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test  <- Catholic_viz[-index,]

EDA

Descriptive statistics

library(psych)

Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':

    %+%, alpha
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ psych::%+%()    masks ggplot2::%+%()
✖ psych::alpha()  masks ggplot2::alpha()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::lift()   masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
psych::describe(train, fast=TRUE)  # (class note --> remove flextable)
         vars    n       mean         sd       min        max      range
id          1 3583 4642399.08 2719725.86 124915.00 7979086.00 7854171.00
read12      2 3583      51.56       9.39     29.15      68.09      38.94
math12      3 3583      52.00       9.59     30.14      71.37      41.23
female      4 3583       0.53       0.50      0.00       1.00       1.00
asian       5 3583       0.06       0.25      0.00       1.00       1.00
hispan      6 3583       0.11       0.31      0.00       1.00       1.00
black       7 3583       0.08       0.26      0.00       1.00       1.00
motheduc    8 3583      13.24       2.03      8.00      18.00      10.00
fatheduc    9 3583      13.55       2.26      8.00      18.00      10.00
lfaminc    10 3583      10.31       0.80      6.21      12.35       6.13
hsgrad     11 3583       0.93       0.26      0.00       1.00       1.00
cathhs     12 3583       0.07       0.26      0.00       1.00       1.00
parcath    13 3583       0.36       0.48      0.00       1.00       1.00
               se
id       45436.17
read12       0.16
math12       0.16
female       0.01
asian        0.00
hispan       0.01
black        0.00
motheduc     0.03
fatheduc     0.04
lfaminc      0.01
hsgrad       0.00
cathhs       0.00
parcath      0.01

Boxplots - numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="female") 

Histogram

plot_histogram(train)

Scatterplots (depvar ~ all x)

plot_scatterplot(test, by="read12")

The data analysis showed a connection between students’ reading and math scores, suggesting students who perform well in reading tend to also do well in math. Specifically, the correlation matrix showed a 0.73 correlation coefficient between the reading (independent) and math (dependent) variables. This high positive coefficient points to a robust positive relationship, meaning if students score high in reading they are likely to also score high in math. Additionally, a moderate positive correlation (0.58) was found between father’s and mother’s education levels (two independent variables), indicating a slight tendency for students whose fathers have higher education levels to also have mothers with higher education levels.

Correlation matrix

plot_correlation(train, type="continuous")

Model

Linear regression model

Estimate coefficients and show coefficients table

install.packages("/path/to/flextable.tar.gz", repos = NULL, type = "source")
Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
Warning in install.packages("/path/to/flextable.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/flextable.tar.gz' had non-zero
exit status
library(flextable)

Attaching package: 'flextable'
The following object is masked from 'package:purrr':

    compose
library(magrittr)

Attaching package: 'magrittr'
The following object is masked from 'package:purrr':

    set_names
The following object is masked from 'package:tidyr':

    extract
model <- lm(log(read12) ~ female + asian + black + motheduc + fatheduc + hispan, data = train)

model %>% as_flextable

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

3.494

0.023

153.141

0.0000

***

female

0.033

0.006

5.556

0.0000

***

asian

0.021

0.012

1.722

0.0851

.

black

-0.122

0.011

-10.611

0.0000

***

motheduc

0.012

0.002

6.583

0.0000

***

fatheduc

0.020

0.002

11.839

0.0000

***

hispan

-0.028

0.010

-2.825

0.0048

**

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 0.1795 on 3576 degrees of freedom

Multiple R-squared: 0.1453, Adjusted R-squared: 0.1439

F-statistic: 101.4 on 3576 and 6 DF, p-value: 0.0000

Student sex, race, and parental education are statistically significant predictors (p<0.05) of the modeled outcome.

Coefficient magnitude plot

install.packages("/path/to/coefplot.tar.gz", repos = NULL, type = "source")
Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
Warning in install.packages("/path/to/coefplot.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/coefplot.tar.gz' had non-zero exit
status
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

The coefficient plot analysis identified key demographic variables having possible positive or negative correlations with student reading achievement. Students from Asian or female gender backgrounds tend to show higher reading scores. Additionally, higher family income levels and parental education levels exhibit positive coefficients, indicating potential positive influence on reading achievement. Conversely, identifying as Black or Hispanic demonstrates significant negative coefficients in the model, suggesting those groups on average score lower in reading assessment outcomes. Overall, the coefficient examination confirms that student ethnicity, gender, and socioeconomic traits related to wealth and education are primary explanatory factors that may impact reading performance.

Check for predictor independence

library(car)
Loading required package: carData

Attaching package: 'car'
The following object is masked from 'package:dplyr':

    recode
The following object is masked from 'package:purrr':

    some
The following object is masked from 'package:psych':

    logit
vif(model)
  female    asian    black motheduc fatheduc   hispan 
1.004622 1.027676 1.024268 1.542893 1.549068 1.087508 

Residual analysis

Residual range

quantile(round(residuals(model, type = "deviance"),2))
   0%   25%   50%   75%  100% 
-0.59 -0.12  0.03  0.13  0.45 

Residual plots

plot(model)

The residual plots showed no collinearity among the predictor variables.

Plot Fitted Value by Actual Value

library(ggplot2)
library(broom)
train2 <- augment(model, data=train)  

p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p  + geom_point()

Plot residuals by fitted values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance evaluation

Use Model to Score test dataset (Display First 10 Values - depvar and fitted values only)

library(flextable)

set_flextable_defaults(col_keys = list(col_key(read12 = ""), col_key(fit = ""), col_key(lwr = ""), col_key(upr = "")))
pred <- predict(model, newdata = test, interval = "predict")
test_w_predict <- cbind(test, pred)
flextable(head(test_w_predict[, c("read12", "fit", "lwr", "upr")], 10))

read12

fit

lwr

upr

61.41

3.897027

3.544883

4.249170

59.33

3.910947

3.558734

4.263160

57.62

3.906321

3.554221

4.258420

52.53

3.886793

3.534655

4.238932

42.09

3.936082

3.584004

4.288160

63.95

3.930474

3.578337

4.282611

38.51

3.910947

3.558734

4.263160

60.28

3.993683

3.641497

4.345868

63.57

3.872873

3.520761

4.224985

52.56

3.975137

3.623006

4.327269

Plot Actual vs Fitted - test

plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)

Prediction aligned closely with observation.

Performance Metrics

library(Metrics)

Attaching package: 'Metrics'
The following objects are masked from 'package:caret':

    precision, recall
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
                         round(rmse(test_w_predict$read12, test_w_predict$fit),4),
                         round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric

Value

MAE

47.6018

RMSE

48.5153

MAPE

0.9210

The model evaluation metrics showed decent but imperfect performance. The mean absolute error was 47.6, root mean squared error was 48.5, and mean absolute percentage error was 0.92. While satisfactory, these figures indicate room for additional improvements to model accuracy.

Replication of first section (chapter 6)

pip <- lm(log(read12) ~ female, data=train)
summary(pip)

Call:
lm(formula = log(read12) ~ female, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54485 -0.13729  0.04129  0.15693  0.30883 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 3.912001   0.004704 831.656 < 0.0000000000000002 ***
female      0.024337   0.006480   3.756             0.000176 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1937 on 3581 degrees of freedom
Multiple R-squared:  0.003924,  Adjusted R-squared:  0.003645 
F-statistic: 14.11 on 1 and 3581 DF,  p-value: 0.0001756
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(read12), x=female))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))
`geom_smooth()` using formula = 'y ~ x'

p + geom_point(alpha=0.1) +
    geom_smooth(color = "skyblue", fill="skyblue", method = MASS::rlm) +
    geom_smooth(color = "orange", fill="orange", method = "lm")
`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

p + geom_point(alpha=0.1) +
    geom_smooth(color = "skyblue", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)
Warning in predict.lm(model, newdata = data_frame0(x = xseq), se.fit = se, :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

p + geom_point(alpha=0.1) +
    geom_quantile(color = "skyblue", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
Warning: Computation failed in `stat_quantile()`
Caused by error in `validObject()`:
! invalid class "matrix.csr" object: ra and ja don't have equal lengths

Summary assessment and evaluation of the model

While showing some predictive capability, the model has limitations. Additional refinement through expanded testing and variables is needed to strengthen accuracy and reliability. However, the analysis did reveal meaningful relationships between key student attributes and reading achievement. Reading scores appear impacted by student math proficiency, gender, race, father’s education, and family income levels. Though not yet robust, this initial modeling and factor identification provide a useful starting point for developing a well-fitted predictive solution. Further model optimization could yield more dependable performance estimation.

Based on the model output, several key variables have a statistically significant relationship with the dependent reading score variable (read12) at the p<0.05 or higher level:

  • Female gender has a positive coefficient (0.033), indicating that being female correlates with slightly higher reading scores on average compared to males.
  • Identifying as Asian correlates with a minor positive impact on reading scores (0.021).
  • Being Black has a large negative coefficient (-0.122), relating to substantially lower average reading scores.
  • Higher mother’s and father’s education levels show positive coefficients, suggesting parental schooling attainment influences higher achievement.
  • Hispanic ethnicity also demonstrates a smaller but still significant negative coefficient (-0.028) with reading outcomes.

With an Adjusted R-squared of 0.1439, 14.4% of the variation in reading scores can be attributed to differences in these demographic and socioeconomic variables. The model appears relatively meaningful but still leaves considerable variability unexplained. Overall, race, gender, and parental education emerge as salient factors shaping academic reading performance. Further refinement of predictive variables could improve the model.

End