Lab 6 (Ismael Hdz)

Author

Ismael Hernandez

Data

options(scipen=999)
library(socviz)
load("Catholic_viz.RData")

Partitionning (60-40)

library(caret)

Loading required package: ggplot2

Loading required package: lattice

set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test  <- Catholic_viz[-index,]

EDA

Descriptive statistics

library(psych)


Attaching package: 'psych'

The following objects are masked from 'package:ggplot2':

    %+%, alpha

library(tidyverse)

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.4
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ lubridate 1.9.3     ✔ tibble    3.2.1
✔ purrr     1.0.2     ✔ tidyr     1.3.0

── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ psych::%+%()    masks ggplot2::%+%()
✖ psych::alpha()  masks ggplot2::alpha()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
✖ purrr::lift()   masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

psych::describe(train, fast=TRUE)  # (class note --> remove flextable)

         vars    n       mean         sd       min        max      range
id          1 3583 4642399.08 2719725.86 124915.00 7979086.00 7854171.00
read12      2 3583      51.56       9.39     29.15      68.09      38.94
math12      3 3583      52.00       9.59     30.14      71.37      41.23
female      4 3583       0.53       0.50      0.00       1.00       1.00
asian       5 3583       0.06       0.25      0.00       1.00       1.00
hispan      6 3583       0.11       0.31      0.00       1.00       1.00
black       7 3583       0.08       0.26      0.00       1.00       1.00
motheduc    8 3583      13.24       2.03      8.00      18.00      10.00
fatheduc    9 3583      13.55       2.26      8.00      18.00      10.00
lfaminc    10 3583      10.31       0.80      6.21      12.35       6.13
hsgrad     11 3583       0.93       0.26      0.00       1.00       1.00
cathhs     12 3583       0.07       0.26      0.00       1.00       1.00
parcath    13 3583       0.36       0.48      0.00       1.00       1.00
               se
id       45436.17
read12       0.16
math12       0.16
female       0.01
asian        0.00
hispan       0.01
black        0.00
motheduc     0.03
fatheduc     0.04
lfaminc      0.01
hsgrad       0.00
cathhs       0.00
parcath      0.01

Boxplots - numeric

data_long <- train %>%
  select(where(is.numeric)) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")

ggplot(data_long, aes(x = Variable, y = Value)) +
  geom_boxplot() +
  coord_flip() + 
  facet_wrap(~ Variable, scales = "free", ncol=3) +
  theme_minimal() +
  theme(axis.text.x = element_blank()) + 
  labs(title = "Horizontal Boxplot for Each Numeric Variable")

library(DataExplorer)
plot_boxplot(train, by="female")

Histogram

plot_histogram(train)

Scatterplots (depvar ~ all x)

plot_scatterplot(test, by="read12")

The data analysis showed a connection between students’ reading and math scores, suggesting students who perform well in reading tend to also do well in math. Specifically, the correlation matrix showed a 0.73 correlation coefficient between the reading (independent) and math (dependent) variables. This high positive coefficient points to a robust positive relationship, meaning if students score high in reading they are likely to also score high in math. Additionally, a moderate positive correlation (0.58) was found between father’s and mother’s education levels (two independent variables), indicating a slight tendency for students whose fathers have higher education levels to also have mothers with higher education levels.

Correlation matrix

plot_correlation(train, type="continuous")

Model

Linear regression model

Estimate coefficients and show coefficients table

install.packages("/path/to/flextable.tar.gz", repos = NULL, type = "source")

Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)

Warning in install.packages("/path/to/flextable.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/flextable.tar.gz' had non-zero
exit status

library(flextable)


Attaching package: 'flextable'

The following object is masked from 'package:purrr':

    compose

library(magrittr)


Attaching package: 'magrittr'

The following object is masked from 'package:purrr':

    set_names

The following object is masked from 'package:tidyr':

    extract

model <- lm(log(read12) ~ female + asian + black + motheduc + fatheduc + hispan, data = train)

model %>% as_flextable

	Estimate	Standard Error	t value	Pr(>\|t\|)
(Intercept)	3.494	0.023	153.141	0.0000	***
female	0.033	0.006	5.556	0.0000	***
asian	0.021	0.012	1.722	0.0851	.
black	-0.122	0.011	-10.611	0.0000	***
motheduc	0.012	0.002	6.583	0.0000	***
fatheduc	0.020	0.002	11.839	0.0000	***
hispan	-0.028	0.010	-2.825	0.0048	**
Signif. codes: 0 <= '*' < 0.001 < '' < 0.01 < '*' < 0.05

Residual standard error: 0.1795 on 3576 degrees of freedom
Multiple R-squared: 0.1453, Adjusted R-squared: 0.1439
F-statistic: 101.4 on 3576 and 6 DF, p-value: 0.0000

Student sex, race, and parental education are statistically significant predictors (p<0.05) of the modeled outcome.

Coefficient magnitude plot

install.packages("/path/to/coefplot.tar.gz", repos = NULL, type = "source")

Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)

Warning in install.packages("/path/to/coefplot.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/coefplot.tar.gz' had non-zero exit
status

library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)

The coefficient plot analysis identified key demographic variables having possible positive or negative correlations with student reading achievement. Students from Asian or female gender backgrounds tend to show higher reading scores. Additionally, higher family income levels and parental education levels exhibit positive coefficients, indicating potential positive influence on reading achievement. Conversely, identifying as Black or Hispanic demonstrates significant negative coefficients in the model, suggesting those groups on average score lower in reading assessment outcomes. Overall, the coefficient examination confirms that student ethnicity, gender, and socioeconomic traits related to wealth and education are primary explanatory factors that may impact reading performance.

Check for predictor independence

library(car)

Loading required package: carData


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

The following object is masked from 'package:psych':

    logit

vif(model)

  female    asian    black motheduc fatheduc   hispan 
1.004622 1.027676 1.024268 1.542893 1.549068 1.087508

Residual analysis

Residual range

quantile(round(residuals(model, type = "deviance"),2))

   0%   25%   50%   75%  100% 
-0.59 -0.12  0.03  0.13  0.45

Residual plots

plot(model)

The residual plots showed no collinearity among the predictor variables.

Plot Fitted Value by Actual Value

library(ggplot2)
library(broom)
train2 <- augment(model, data=train)  

p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p  + geom_point()

Plot residuals by fitted values

p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p  + geom_point()

Performance evaluation

Use Model to Score test dataset (Display First 10 Values - depvar and fitted values only)

library(flextable)

set_flextable_defaults(col_keys = list(col_key(read12 = ""), col_key(fit = ""), col_key(lwr = ""), col_key(upr = "")))
pred <- predict(model, newdata = test, interval = "predict")
test_w_predict <- cbind(test, pred)
flextable(head(test_w_predict[, c("read12", "fit", "lwr", "upr")], 10))

read12	fit	lwr	upr
61.41	3.897027	3.544883	4.249170
59.33	3.910947	3.558734	4.263160
57.62	3.906321	3.554221	4.258420
52.53	3.886793	3.534655	4.238932
42.09	3.936082	3.584004	4.288160
63.95	3.930474	3.578337	4.282611
38.51	3.910947	3.558734	4.263160
60.28	3.993683	3.641497	4.345868
63.57	3.872873	3.520761	4.224985
52.56	3.975137	3.623006	4.327269

Plot Actual vs Fitted - test

plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)

Prediction aligned closely with observation.

Performance Metrics

library(Metrics)


Attaching package: 'Metrics'

The following objects are masked from 'package:caret':

    precision, recall

metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
                         round(rmse(test_w_predict$read12, test_w_predict$fit),4),
                         round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)

Metric	Value
MAE	47.6018
RMSE	48.5153
MAPE	0.9210

The model evaluation metrics showed decent but imperfect performance. The mean absolute error was 47.6, root mean squared error was 48.5, and mean absolute percentage error was 0.92. While satisfactory, these figures indicate room for additional improvements to model accuracy.

Replication of first section (chapter 6)

pip <- lm(log(read12) ~ female, data=train)
summary(pip)


Call:
lm(formula = log(read12) ~ female, data = train)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.54485 -0.13729  0.04129  0.15693  0.30883 

Coefficients:
            Estimate Std. Error t value             Pr(>|t|)    
(Intercept) 3.912001   0.004704 831.656 < 0.0000000000000002 ***
female      0.024337   0.006480   3.756             0.000176 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1937 on 3581 degrees of freedom
Multiple R-squared:  0.003924,  Adjusted R-squared:  0.003645 
F-statistic: 14.11 on 1 and 3581 DF,  p-value: 0.0001756

library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(read12), x=female))

p +  geom_point(alpha = 0.2) + 
       geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))

`geom_smooth()` using formula = 'y ~ x'

p + geom_point(alpha=0.1) +
    geom_smooth(color = "skyblue", fill="skyblue", method = MASS::rlm) +
    geom_smooth(color = "orange", fill="orange", method = "lm")

`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'

p + geom_point(alpha=0.1) +
    geom_smooth(color = "skyblue", method = "lm", linewidth = 1.2, 
                formula = y ~ splines::bs(x, 3), se = FALSE)

Warning in predict.lm(model, newdata = data_frame0(x = xseq), se.fit = se, :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases

p + geom_point(alpha=0.1) +
    geom_quantile(color = "skyblue", size = 1.2, method = "rqss",
                  lambda = 1, quantiles = c(0.20, 0.5, 0.85))

Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.

Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated

Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)

Warning: Computation failed in `stat_quantile()`
Caused by error in `validObject()`:
! invalid class "matrix.csr" object: ra and ja don't have equal lengths

Summary assessment and evaluation of the model

While showing some predictive capability, the model has limitations. Additional refinement through expanded testing and variables is needed to strengthen accuracy and reliability. However, the analysis did reveal meaningful relationships between key student attributes and reading achievement. Reading scores appear impacted by student math proficiency, gender, race, father’s education, and family income levels. Though not yet robust, this initial modeling and factor identification provide a useful starting point for developing a well-fitted predictive solution. Further model optimization could yield more dependable performance estimation.

Based on the model output, several key variables have a statistically significant relationship with the dependent reading score variable (read12) at the p<0.05 or higher level:

Female gender has a positive coefficient (0.033), indicating that being female correlates with slightly higher reading scores on average compared to males.
Identifying as Asian correlates with a minor positive impact on reading scores (0.021).
Being Black has a large negative coefficient (-0.122), relating to substantially lower average reading scores.
Higher mother’s and father’s education levels show positive coefficients, suggesting parental schooling attainment influences higher achievement.
Hispanic ethnicity also demonstrates a smaller but still significant negative coefficient (-0.028) with reading outcomes.

With an Adjusted R-squared of 0.1439, 14.4% of the variation in reading scores can be attributed to differences in these demographic and socioeconomic variables. The model appears relatively meaningful but still leaves considerable variability unexplained. Overall, race, gender, and parental education emerge as salient factors shaping academic reading performance. Further refinement of predictive variables could improve the model.