options(scipen=999)
library(socviz)
load("Catholic_viz.RData")Lab 6 (Ismael Hdz)
Data
Partitionning (60-40)
library(caret)Loading required package: ggplot2
Loading required package: lattice
set.seed(1)
index <- createDataPartition(Catholic_viz$read12, p=0.6, list=FALSE)
train <- Catholic_viz[index,]
test <- Catholic_viz[-index,]EDA
Descriptive statistics
library(psych)
Attaching package: 'psych'
The following objects are masked from 'package:ggplot2':
%+%, alpha
library(tidyverse)── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ lubridate 1.9.3 ✔ tibble 3.2.1
✔ purrr 1.0.2 ✔ tidyr 1.3.0
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ psych::%+%() masks ggplot2::%+%()
✖ psych::alpha() masks ggplot2::alpha()
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
✖ purrr::lift() masks caret::lift()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
psych::describe(train, fast=TRUE) # (class note --> remove flextable) vars n mean sd min max range
id 1 3583 4642399.08 2719725.86 124915.00 7979086.00 7854171.00
read12 2 3583 51.56 9.39 29.15 68.09 38.94
math12 3 3583 52.00 9.59 30.14 71.37 41.23
female 4 3583 0.53 0.50 0.00 1.00 1.00
asian 5 3583 0.06 0.25 0.00 1.00 1.00
hispan 6 3583 0.11 0.31 0.00 1.00 1.00
black 7 3583 0.08 0.26 0.00 1.00 1.00
motheduc 8 3583 13.24 2.03 8.00 18.00 10.00
fatheduc 9 3583 13.55 2.26 8.00 18.00 10.00
lfaminc 10 3583 10.31 0.80 6.21 12.35 6.13
hsgrad 11 3583 0.93 0.26 0.00 1.00 1.00
cathhs 12 3583 0.07 0.26 0.00 1.00 1.00
parcath 13 3583 0.36 0.48 0.00 1.00 1.00
se
id 45436.17
read12 0.16
math12 0.16
female 0.01
asian 0.00
hispan 0.01
black 0.00
motheduc 0.03
fatheduc 0.04
lfaminc 0.01
hsgrad 0.00
cathhs 0.00
parcath 0.01
Boxplots - numeric
data_long <- train %>%
select(where(is.numeric)) %>%
pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value")
ggplot(data_long, aes(x = Variable, y = Value)) +
geom_boxplot() +
coord_flip() +
facet_wrap(~ Variable, scales = "free", ncol=3) +
theme_minimal() +
theme(axis.text.x = element_blank()) +
labs(title = "Horizontal Boxplot for Each Numeric Variable")library(DataExplorer)
plot_boxplot(train, by="female") Histogram
plot_histogram(train)Scatterplots (depvar ~ all x)
plot_scatterplot(test, by="read12")The data analysis showed a connection between students’ reading and math scores, suggesting students who perform well in reading tend to also do well in math. Specifically, the correlation matrix showed a 0.73 correlation coefficient between the reading (independent) and math (dependent) variables. This high positive coefficient points to a robust positive relationship, meaning if students score high in reading they are likely to also score high in math. Additionally, a moderate positive correlation (0.58) was found between father’s and mother’s education levels (two independent variables), indicating a slight tendency for students whose fathers have higher education levels to also have mothers with higher education levels.
Correlation matrix
plot_correlation(train, type="continuous")Model
Linear regression model
Estimate coefficients and show coefficients table
install.packages("/path/to/flextable.tar.gz", repos = NULL, type = "source")Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
Warning in install.packages("/path/to/flextable.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/flextable.tar.gz' had non-zero
exit status
library(flextable)
Attaching package: 'flextable'
The following object is masked from 'package:purrr':
compose
library(magrittr)
Attaching package: 'magrittr'
The following object is masked from 'package:purrr':
set_names
The following object is masked from 'package:tidyr':
extract
model <- lm(log(read12) ~ female + asian + black + motheduc + fatheduc + hispan, data = train)
model %>% as_flextableEstimate | Standard Error | t value | Pr(>|t|) | ||
|---|---|---|---|---|---|
(Intercept) | 3.494 | 0.023 | 153.141 | 0.0000 | *** |
female | 0.033 | 0.006 | 5.556 | 0.0000 | *** |
asian | 0.021 | 0.012 | 1.722 | 0.0851 | . |
black | -0.122 | 0.011 | -10.611 | 0.0000 | *** |
motheduc | 0.012 | 0.002 | 6.583 | 0.0000 | *** |
fatheduc | 0.020 | 0.002 | 11.839 | 0.0000 | *** |
hispan | -0.028 | 0.010 | -2.825 | 0.0048 | ** |
Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05 | |||||
Residual standard error: 0.1795 on 3576 degrees of freedom | |||||
Multiple R-squared: 0.1453, Adjusted R-squared: 0.1439 | |||||
F-statistic: 101.4 on 3576 and 6 DF, p-value: 0.0000 | |||||
Student sex, race, and parental education are statistically significant predictors (p<0.05) of the modeled outcome.
Coefficient magnitude plot
install.packages("/path/to/coefplot.tar.gz", repos = NULL, type = "source")Installing package into 'C:/Users/zayam/AppData/Local/R/win-library/4.3'
(as 'lib' is unspecified)
Warning in install.packages("/path/to/coefplot.tar.gz", repos = NULL, type =
"source"): installation of package '/path/to/coefplot.tar.gz' had non-zero exit
status
library(coefplot)
coefplot(model, soft="magnitude", intercept=FALSE)The coefficient plot analysis identified key demographic variables having possible positive or negative correlations with student reading achievement. Students from Asian or female gender backgrounds tend to show higher reading scores. Additionally, higher family income levels and parental education levels exhibit positive coefficients, indicating potential positive influence on reading achievement. Conversely, identifying as Black or Hispanic demonstrates significant negative coefficients in the model, suggesting those groups on average score lower in reading assessment outcomes. Overall, the coefficient examination confirms that student ethnicity, gender, and socioeconomic traits related to wealth and education are primary explanatory factors that may impact reading performance.
Check for predictor independence
library(car)Loading required package: carData
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
The following object is masked from 'package:psych':
logit
vif(model) female asian black motheduc fatheduc hispan
1.004622 1.027676 1.024268 1.542893 1.549068 1.087508
Residual analysis
Residual range
quantile(round(residuals(model, type = "deviance"),2)) 0% 25% 50% 75% 100%
-0.59 -0.12 0.03 0.13 0.45
Residual plots
plot(model)The residual plots showed no collinearity among the predictor variables.
Plot Fitted Value by Actual Value
library(ggplot2)
library(broom)
train2 <- augment(model, data=train)
p <- ggplot(train2, mapping = aes(y=.fitted, x=read12))
p + geom_point()Plot residuals by fitted values
p <- ggplot(train2, mapping = aes(y=.resid, x=.fitted))
p + geom_point()Performance evaluation
Use Model to Score test dataset (Display First 10 Values - depvar and fitted values only)
library(flextable)
set_flextable_defaults(col_keys = list(col_key(read12 = ""), col_key(fit = ""), col_key(lwr = ""), col_key(upr = "")))
pred <- predict(model, newdata = test, interval = "predict")
test_w_predict <- cbind(test, pred)
flextable(head(test_w_predict[, c("read12", "fit", "lwr", "upr")], 10))read12 | fit | lwr | upr |
|---|---|---|---|
61.41 | 3.897027 | 3.544883 | 4.249170 |
59.33 | 3.910947 | 3.558734 | 4.263160 |
57.62 | 3.906321 | 3.554221 | 4.258420 |
52.53 | 3.886793 | 3.534655 | 4.238932 |
42.09 | 3.936082 | 3.584004 | 4.288160 |
63.95 | 3.930474 | 3.578337 | 4.282611 |
38.51 | 3.910947 | 3.558734 | 4.263160 |
60.28 | 3.993683 | 3.641497 | 4.345868 |
63.57 | 3.872873 | 3.520761 | 4.224985 |
52.56 | 3.975137 | 3.623006 | 4.327269 |
Plot Actual vs Fitted - test
plot(test_w_predict$read12, test_w_predict$fit, pch=16, cex=1)Prediction aligned closely with observation.
Performance Metrics
library(Metrics)
Attaching package: 'Metrics'
The following objects are masked from 'package:caret':
precision, recall
metric_label <- c("MAE","RMSE", "MAPE")
metrics <- c(round(mae(test_w_predict$read12, test_w_predict$fit),4),
round(rmse(test_w_predict$read12, test_w_predict$fit),4),
round(mape(test_w_predict$read12, test_w_predict$fit),4))
pmtable <- data.frame(Metric=metric_label, Value = metrics)
flextable(pmtable)Metric | Value |
|---|---|
MAE | 47.6018 |
RMSE | 48.5153 |
MAPE | 0.9210 |
The model evaluation metrics showed decent but imperfect performance. The mean absolute error was 47.6, root mean squared error was 48.5, and mean absolute percentage error was 0.92. While satisfactory, these figures indicate room for additional improvements to model accuracy.
Replication of first section (chapter 6)
pip <- lm(log(read12) ~ female, data=train)
summary(pip)
Call:
lm(formula = log(read12) ~ female, data = train)
Residuals:
Min 1Q Median 3Q Max
-0.54485 -0.13729 0.04129 0.15693 0.30883
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.912001 0.004704 831.656 < 0.0000000000000002 ***
female 0.024337 0.006480 3.756 0.000176 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 0.1937 on 3581 degrees of freedom
Multiple R-squared: 0.003924, Adjusted R-squared: 0.003645
F-statistic: 14.11 on 1 and 3581 DF, p-value: 0.0001756
library(ggplot2)
p <- ggplot(data=train, mapping=aes(y=log(read12), x=female))
p + geom_point(alpha = 0.2) +
geom_smooth(method = "lm", aes(color = "OLS", fill = "OLS"))`geom_smooth()` using formula = 'y ~ x'
p + geom_point(alpha=0.1) +
geom_smooth(color = "skyblue", fill="skyblue", method = MASS::rlm) +
geom_smooth(color = "orange", fill="orange", method = "lm")`geom_smooth()` using formula = 'y ~ x'
`geom_smooth()` using formula = 'y ~ x'
p + geom_point(alpha=0.1) +
geom_smooth(color = "skyblue", method = "lm", linewidth = 1.2,
formula = y ~ splines::bs(x, 3), se = FALSE)Warning in predict.lm(model, newdata = data_frame0(x = xseq), se.fit = se, :
prediction from rank-deficient fit; attr(*, "non-estim") has doubtful cases
p + geom_point(alpha=0.1) +
geom_quantile(color = "skyblue", size = 1.2, method = "rqss",
lambda = 1, quantiles = c(0.20, 0.5, 0.85))Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Warning in .recacheSubclasses(def@className, def, env): undefined subclass
"ndiMatrix" of class "replValueSp"; definition not updated
Smoothing formula not specified. Using: y ~ qss(x, lambda = 1)
Warning: Computation failed in `stat_quantile()`
Caused by error in `validObject()`:
! invalid class "matrix.csr" object: ra and ja don't have equal lengths
Summary assessment and evaluation of the model
While showing some predictive capability, the model has limitations. Additional refinement through expanded testing and variables is needed to strengthen accuracy and reliability. However, the analysis did reveal meaningful relationships between key student attributes and reading achievement. Reading scores appear impacted by student math proficiency, gender, race, father’s education, and family income levels. Though not yet robust, this initial modeling and factor identification provide a useful starting point for developing a well-fitted predictive solution. Further model optimization could yield more dependable performance estimation.
Based on the model output, several key variables have a statistically significant relationship with the dependent reading score variable (read12) at the p<0.05 or higher level:
- Female gender has a positive coefficient (0.033), indicating that being female correlates with slightly higher reading scores on average compared to males.
- Identifying as Asian correlates with a minor positive impact on reading scores (0.021).
- Being Black has a large negative coefficient (-0.122), relating to substantially lower average reading scores.
- Higher mother’s and father’s education levels show positive coefficients, suggesting parental schooling attainment influences higher achievement.
- Hispanic ethnicity also demonstrates a smaller but still significant negative coefficient (-0.028) with reading outcomes.
With an Adjusted R-squared of 0.1439, 14.4% of the variation in reading scores can be attributed to differences in these demographic and socioeconomic variables. The model appears relatively meaningful but still leaves considerable variability unexplained. Overall, race, gender, and parental education emerge as salient factors shaping academic reading performance. Further refinement of predictive variables could improve the model.