This project investigates: What family engagement and school choice factors predict a child’s academic success, and how does parental involvement relate to school participation frequency?
We analyze data from the 2019 Parent and Family Involvement (PFI) in Education Survey, collected by the US Census Bureau for the Department of Education. The dataset contains 15,500 observations, each representing one K-12 student, with variables covering school type, family engagement behaviors, child academic outcomes, and household demographics.
library(MASS)
library(tidyverse)
library(tidymodels)
library(readxl)
library(GGally)
library(ggformula)
library(discrim)
library(nnet)
library(themis)
We load MASS first intentionally — it has a
select() function that conflicts with
dplyr::select(). Loading it before tidyverse
ensures that dplyr::select() takes precedence throughout
the document. We use dplyr::select() explicitly wherever
needed as an additional safeguard.
pfi <- read_excel("pfi-data-2019.xlsx")
cat("Rows:", nrow(pfi), "| Cols:", ncol(pfi), "\n")
## Rows: 15500 | Cols: 75
We use only the 2019 survey year. The dataset contains 15,500 observations and 75 variables. We read it directly from the Excel file as required by the project instructions.
pfi <- pfi %>%
dplyr::select(
BASMID,
SEGRADES, SEGRADEQ, SEENJOY, SEABSNT, SEFUTUREX,
FSFREQ,
FSSPORTX, FSVOL, FSMTNG, FSPTMTNG,
FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR,
FSNOTESX, FSMEMO,
FCSCHOOL, FCTEACHR, FCSTDS, FCORDER, FCSUPPRT,
FHHOME, FHWKHRS, FHCHECKX, FHHELP,
FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
FOSPORT, FORESPON, FOHISTX,
FODINNERX, FOLIBRAYX, FOBOOKSTX,
EDCPUB, SPUBCHOIX, SCONSIDR, SCCHOICE,
CSEX, RACEETH, PARGRADEX, TTLHHINC,
HHPARN19_BRD, DSBLTY, INTACC, CENREG,
ALLGRADEX, P1MRSTA, HHTOTALXX, NUMSIBSX
)
# Recode all negative values as NA
# In PFI data, negative codes (-1, -9) represent valid skips or missing
# These are not real values and must be treated as NA
pfi <- pfi %>%
mutate(across(everything(), ~ ifelse(. < 0, NA, .)))
# Missingness summary
missing_tbl <- pfi %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(),
names_to = "variable",
values_to = "n_missing") %>%
mutate(pct_missing = round(n_missing / nrow(pfi) * 100, 1)) %>%
arrange(desc(pct_missing))
print(missing_tbl, n = 25)
## # A tibble: 52 × 3
## variable n_missing pct_missing
## <chr> <int> <dbl>
## 1 FHWKHRS 1047 6.8
## 2 FHCHECKX 1047 6.8
## 3 FHHELP 1047 6.8
## 4 FSFREQ 126 0.8
## 5 FSSPORTX 126 0.8
## 6 FSVOL 126 0.8
## 7 FSMTNG 126 0.8
## 8 FSPTMTNG 126 0.8
## 9 FSATCNFN 126 0.8
## 10 FSFUNDRS 126 0.8
## 11 FSCOMMTE 126 0.8
## 12 FSCOUNSLR 126 0.8
## 13 FSNOTESX 53 0.3
## 14 FSMEMO 53 0.3
## 15 FCSCHOOL 53 0.3
## 16 FCTEACHR 53 0.3
## 17 FCSTDS 53 0.3
## 18 FCORDER 53 0.3
## 19 FCSUPPRT 53 0.3
## 20 FHHOME 53 0.3
## 21 BASMID 0 0
## 22 SEGRADES 2 0
## 23 SEGRADEQ 2 0
## 24 SEENJOY 2 0
## 25 SEABSNT 2 0
## # ℹ 27 more rows
Variable selection rationale: We selected variables covering five domains: (1) child academic outcomes as response variables, (2) school involvement activities as key predictors, (3) home engagement behaviors, (4) school type and choice, and (5) household demographics as controls.
Missing value treatment: In the PFI dataset,
negative values are special codes — -1 means “valid skip”
(question not applicable to this respondent) and -9 means
missing. Neither represents a real data value, so we recode all negative
entries to NA before any analysis.
Missingness assessment: The three homework-related
variables (FHWKHRS, FHCHECKX,
FHHELP) have 6.8% missing — these apply only to students
who do homework, so valid skips account for this. All other variables
have less than 1% missing. Overall missingness is low and will not
substantially affect our results.
pfi <- pfi %>% filter(!is.na(FCSCHOOL))
cat("After filtering to school students:", nrow(pfi), "\n")
## After filtering to school students: 15447
pfi <- pfi %>%
filter(!is.na(SEGRADES), !is.na(FSFREQ), !is.na(SEGRADEQ))
cat("After dropping missing outcomes:", nrow(pfi), "\n")
## After dropping missing outcomes: 15374
We filter to students who have school satisfaction data
(FCSCHOOL not missing), which effectively removes
homeschool-only students who were not enrolled in any school. These
students have valid skips on all school-related variables and would
introduce systematic bias if included. We then remove the small number
of rows (73) missing key response variables.
pfi <- pfi %>%
mutate(
SEGRADES = ifelse(SEGRADES == 5, NA, SEGRADES),
SEGRADES = factor(SEGRADES, levels = 1:4,
labels = c("Mostly_As", "Mostly_Bs",
"Mostly_Cs", "Mostly_Ds_lower")),
SEGRADEQ = factor(SEGRADEQ, levels = 1:5,
labels = c("Excellent", "Above_avg", "Average",
"Below_avg", "Failing")),
SEENJOY = factor(SEENJOY, levels = 1:4,
labels = c("Strongly_agree", "Agree",
"Disagree", "Strongly_disagree")),
RACEETH = factor(RACEETH, levels = 1:5,
labels = c("White", "Black", "Hispanic",
"Asian_PI", "Other")),
CSEX = factor(CSEX, levels = 1:2,
labels = c("Male", "Female")),
HHPARN19_BRD = factor(HHPARN19_BRD, levels = 1:2,
labels = c("Two_parents", "Single_parent")),
DSBLTY = factor(DSBLTY, levels = 1:2,
labels = c("Has_disability", "No_disability")),
PARGRADEX = factor(PARGRADEX, levels = 1:5,
labels = c("Less_than_HS", "HS_grad",
"Some_college", "College_grad",
"Grad_school")),
EDCPUB = factor(EDCPUB, levels = 1:2,
labels = c("Public", "Not_public")),
INTACC = factor(INTACC, levels = 1:4,
labels = c("Home_and_cell", "Home_only",
"Cell_only", "No_internet")),
CENREG = factor(CENREG, levels = 1:4,
labels = c("Northeast", "South", "Midwest", "West")),
across(c(FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN,
FSFUNDRS, FSCOMMTE, FSCOUNSLR,
FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
FOSPORT, FORESPON, FOHISTX,
FOLIBRAYX, FOBOOKSTX),
~ ifelse(. == 1, 1L, ifelse(. == 2, 0L, NA_integer_)))
)
Key recoding decisions:
SEGRADES code 5 (“School does not give these grades”)
is excluded as NA — these students attend schools without
traditional grading systems and cannot be meaningfully compared on this
outcome. This removes 1,957 observations.PARGRADEX, the reference
level is “Less_than_HS” — all education coefficients are interpreted
relative to parents without a high school diploma.pfi <- pfi %>%
mutate(
# Sum of 8 binary school activity indicators (0-8 scale)
school_involve_score = rowSums(
dplyr::select(., FSSPORTX, FSVOL, FSMTNG, FSPTMTNG,
FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR),
na.rm = TRUE),
# Sum of 7 binary home engagement indicators (0-7 scale)
home_engage_score = rowSums(
dplyr::select(., FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
FOSPORT, FORESPON, FOHISTX),
na.rm = TRUE),
# Binary high achiever outcome: Yes = Mostly A's, No = B's or below
# Levels set as c("Yes", "No") so "Yes" is the positive class
high_achiever = factor(
ifelse(SEGRADES == "Mostly_As", "Yes", "No"),
levels = c("Yes", "No")
)
)
cat("SEGRADES distribution:\n")
## SEGRADES distribution:
print(table(pfi$SEGRADES, useNA = "ifany"))
##
## Mostly_As Mostly_Bs Mostly_Cs Mostly_Ds_lower <NA>
## 7574 4320 1280 243 1957
cat("\nHigh achiever distribution:\n")
##
## High achiever distribution:
print(table(pfi$high_achiever, useNA = "ifany"))
##
## Yes No <NA>
## 7574 5843 1957
cat("\nFSFREQ summary:\n")
##
## FSFREQ summary:
print(summary(pfi$FSFREQ))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 4.000 7.018 8.000 99.000
cat("\nSchool involvement score:\n")
##
## School involvement score:
print(summary(pfi$school_involve_score))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 3.000 4.000 4.322 6.000 8.000
cat("\nFinal dataset:", nrow(pfi), "rows x", ncol(pfi), "cols\n")
##
## Final dataset: 15374 rows x 55 cols
Composite score rationale: We created two summary
scores rather than including all individual binary items as separate
predictors. This reduces dimensionality, avoids multicollinearity among
highly related items, and produces more interpretable coefficients.
school_involve_score (0-8) counts how many of the 8 school
activities a household participated in. home_engage_score
(0-7) counts how many of 7 home engagement activities occurred in the
past week.
Binary outcome: high_achiever was
created by dichotomizing SEGRADES — Mostly A’s vs
everything else. This provides a clean binary outcome for logistic
regression and LDA/QDA while preserving the full 4-level outcome for
multinomial regression. The “Yes” level is set first so it serves as the
positive class in all classification metrics.
Note on GLM direction: Since
high_achiever has levels c("Yes", "No"), the
GLM models the probability of the second level (“No”). This means odds
ratios less than 1 indicate higher probability of being a high
achiever, and odds ratios greater than 1 indicate lower
probability. We note this explicitly when interpreting coefficients.
pfi %>%
ggplot(aes(x = FSFREQ)) +
geom_histogram(bins = 40, fill = "steelblue", color = "white") +
labs(title = "Distribution of School Participation Count (FSFREQ)",
subtitle = "Right-skewed count variable — candidate for Poisson regression",
x = "Times participated in school activities", y = "Count")
cat("FSFREQ mean: ", mean(pfi$FSFREQ, na.rm = TRUE), "\n")
## FSFREQ mean: 7.018017
cat("FSFREQ variance: ", var(pfi$FSFREQ, na.rm = TRUE), "\n")
## FSFREQ variance: 85.42458
cat("Ratio (var/mean):",
var(pfi$FSFREQ, na.rm = TRUE) /
mean(pfi$FSFREQ, na.rm = TRUE), "\n")
## Ratio (var/mean): 12.17218
Overdispersion finding: The variance-to-mean ratio
for FSFREQ is approximately 12.2 — far greater than 1.
Standard Poisson regression assumes variance equals the mean (ratio =
1). A ratio this high indicates severe overdispersion, meaning standard
Poisson will underestimate standard errors and produce falsely
significant results. We address this by using Negative Binomial
regression instead, which adds a dispersion parameter to
account for the excess variance.
pfi %>%
filter(!is.na(SEGRADES)) %>%
ggplot(aes(x = SEGRADES)) +
geom_bar(fill = "steelblue") +
labs(title = "Distribution of Child Grades (2019 PFI)",
subtitle = "Imbalanced — Mostly A's dominates",
x = "Grade Category", y = "Count")
The grade distribution is heavily right-skewed: 7,574 students receive mostly A’s compared to only 243 receiving mostly D’s or lower. This class imbalance is a modeling concern for multinomial regression — a naive model will learn to ignore the minority classes. We address this with upsampling in Section 10.
pfi %>%
filter(!is.na(SEGRADES)) %>%
ggplot(aes(x = SEGRADES, y = school_involve_score)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "School Involvement Score by Grade",
x = "Grade Category", y = "School Involvement Score (0-8)")
pfi %>%
filter(!is.na(SEGRADES)) %>%
ggplot(aes(x = SEGRADES, y = home_engage_score)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "Home Engagement Score by Grade",
x = "Grade Category", y = "Home Engagement Score (0-7)")
pfi %>%
filter(!is.na(SEGRADES)) %>%
ggplot(aes(x = SEGRADES, y = FSFREQ)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "School Participation Frequency by Grade",
x = "Grade Category", y = "Times Participated (FSFREQ)")
Higher-achieving students tend to have parents with higher school
involvement scores. The relationship with home engagement is weaker but
still present. School participation frequency (FSFREQ)
shows a similar pattern — parents of A students participate more often.
These patterns motivate including both scores as predictors in all
models.
pfi %>%
filter(!is.na(SEGRADES), !is.na(PARGRADEX)) %>%
ggplot(aes(x = PARGRADEX, fill = SEGRADES)) +
geom_bar(position = "fill") +
labs(title = "Grade Distribution by Parent Education",
x = "Parent Education", y = "Proportion") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
pfi %>%
filter(!is.na(SEGRADES), !is.na(RACEETH)) %>%
ggplot(aes(x = RACEETH, fill = SEGRADES)) +
geom_bar(position = "fill") +
labs(title = "Grade Distribution by Race/Ethnicity",
x = "Race/Ethnicity", y = "Proportion")
pfi %>%
filter(!is.na(SEGRADES), !is.na(HHPARN19_BRD)) %>%
ggplot(aes(x = HHPARN19_BRD, fill = SEGRADES)) +
geom_bar(position = "fill") +
labs(title = "Grade Distribution by Household Structure",
x = "Household Type", y = "Proportion")
Parent education shows a strong gradient — the proportion of A students increases substantially with each educational level. Two-parent households have a higher proportion of A students than single-parent households. Racial and ethnic disparities are visible across all grade categories.
pfi %>%
filter(!is.na(PARGRADEX)) %>%
ggplot(aes(x = PARGRADEX, y = FSFREQ)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "School Participation by Parent Education",
x = "Parent Education", y = "FSFREQ") +
theme(axis.text.x = element_text(angle = 30, hjust = 1))
pfi %>%
filter(!is.na(EDCPUB)) %>%
ggplot(aes(x = EDCPUB, y = FSFREQ)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "School Participation by School Type",
x = "School Type", y = "FSFREQ")
pfi %>%
filter(!is.na(HHPARN19_BRD)) %>%
ggplot(aes(x = HHPARN19_BRD, y = FSFREQ)) +
geom_boxplot(fill = "steelblue", alpha = 0.6) +
labs(title = "School Participation by Household Structure",
x = "Household Type", y = "FSFREQ")
Parents in non-public schools and two-parent households show higher median participation counts. Parent education shows a positive trend — more educated parents participate more. These patterns are consistent with the grade distributions above and motivate our predictor selection for the Negative Binomial model.
set.seed(631)
# Classification dataset
pfi_model <- pfi %>%
filter(!is.na(SEGRADES), !is.na(high_achiever)) %>%
drop_na(school_involve_score, home_engage_score,
PARGRADEX, RACEETH, TTLHHINC,
HHPARN19_BRD, DSBLTY, EDCPUB, CSEX)
# Count dataset for NB regression
pfi_count <- pfi %>%
drop_na(FSFREQ, school_involve_score, home_engage_score,
PARGRADEX, RACEETH, TTLHHINC,
HHPARN19_BRD, DSBLTY, EDCPUB, CSEX)
split_model <- initial_split(pfi_model, prop = 0.80, strata = SEGRADES)
train_model <- training(split_model)
test_model <- testing(split_model)
split_count <- initial_split(pfi_count, prop = 0.80, strata = FSFREQ)
train_count <- training(split_count)
test_count <- testing(split_count)
cat("Classification train/test:", nrow(train_model), "/", nrow(test_model), "\n")
## Classification train/test: 10733 / 2684
cat("Count model train/test: ", nrow(train_count), "/", nrow(test_count), "\n")
## Count model train/test: 12297 / 3077
We use an 80/20 train/test split for all models. We use
strata to ensure the class distribution is preserved in
both the training and test sets — this is important given the grade
imbalance. We set set.seed(631) for reproducibility. The
test set is held out completely and only used for final evaluation — no
modeling decisions are made using test set results.
Two separate datasets are created: pfi_model for the
classification models (logistic, multinomial, LDA/QDA) and
pfi_count for the count model (NB), since they require
different complete-case filtering.
Research question: What factors predict how frequently parents participate in school activities?
Why Negative Binomial? FSFREQ is a
count variable (0-99, integer values) making it a candidate for Poisson
regression. However, the EDA showed a variance-to-mean ratio of 12.2,
indicating severe overdispersion. Standard Poisson regression assumes
variance = mean. When this assumption is violated, standard errors are
underestimated and hypothesis tests are invalid. Negative Binomial
regression adds a dispersion parameter \(\theta\) that allows variance to exceed the
mean, making it the appropriate choice here.
count_formula <- FSFREQ ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC + HHPARN19_BRD +
EDCPUB + RACEETH + CSEX + DSBLTY
# Candidate Model 1: Standard Poisson (baseline for comparison)
pois_mod <- glm(count_formula,
data = train_count,
family = poisson(link = "log"))
# Candidate Model 2: Negative Binomial full model
nb_mod <- glm.nb(count_formula, data = train_count)
# Candidate Model 3: NB reduced (removing EDCPUB, CSEX)
nb_mod_reduced <- glm.nb(
FSFREQ ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY,
data = train_count)
# Model comparison
cat("Poisson AIC: ", AIC(pois_mod), "\n")
## Poisson AIC: 104067.7
cat("NB full AIC: ", AIC(nb_mod), "\n")
## NB full AIC: 69124.06
cat("NB reduced AIC: ", AIC(nb_mod_reduced), "\n")
## NB reduced AIC: 69180.38
# Likelihood ratio test: full vs reduced NB
anova(nb_mod_reduced, nb_mod, test = "Chisq")
## Likelihood ratio tests of Negative Binomial Models
##
## Response: FSFREQ
## Model
## 1 school_involve_score + home_engage_score + PARGRADEX + TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY
## 2 school_involve_score + home_engage_score + PARGRADEX + TTLHHINC + HHPARN19_BRD + EDCPUB + RACEETH + CSEX + DSBLTY
## theta Resid. df 2 x log-lik. Test df LR stat. Pr(Chi)
## 1 2.055859 12283 -69150.38
## 2 2.068716 12281 -69090.06 1 vs 2 2 60.32132 7.971401e-14
# Final model summary
summary(nb_mod)
##
## Call:
## glm.nb(formula = count_formula, data = train_count, init.theta = 2.068716274,
## link = log)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.615088 0.048233 12.752 < 2e-16 ***
## school_involve_score 0.256287 0.004265 60.086 < 2e-16 ***
## home_engage_score -0.008634 0.004219 -2.046 0.04073 *
## PARGRADEXHS_grad 0.001707 0.042610 0.040 0.96805
## PARGRADEXSome_college 0.041614 0.039624 1.050 0.29362
## PARGRADEXCollege_grad 0.087466 0.041093 2.128 0.03330 *
## PARGRADEXGrad_school 0.079498 0.042427 1.874 0.06096 .
## TTLHHINC 0.012679 0.003226 3.930 8.48e-05 ***
## HHPARN19_BRDSingle_parent -0.031019 0.018665 -1.662 0.09654 .
## EDCPUBNot_public 0.173806 0.022888 7.594 3.10e-14 ***
## RACEETHBlack -0.233135 0.026947 -8.652 < 2e-16 ***
## RACEETHHispanic -0.179418 0.020268 -8.852 < 2e-16 ***
## RACEETHAsian_PI -0.400010 0.030404 -13.156 < 2e-16 ***
## RACEETHOther -0.039615 0.030154 -1.314 0.18893
## CSEXFemale 0.016807 0.014776 1.137 0.25535
## DSBLTYNo_disability 0.051454 0.017380 2.960 0.00307 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(2.0687) family taken to be 1)
##
## Null deviance: 17813 on 12296 degrees of freedom
## Residual deviance: 12239 on 12281 degrees of freedom
## AIC: 69124
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 2.0687
## Std. Err.: 0.0330
##
## 2 x log-likelihood: -69090.0640
# Test set evaluation
nb_pred <- predict(nb_mod, newdata = test_count, type = "response")
nb_rmse <- sqrt(mean((test_count$FSFREQ - nb_pred)^2))
cat("\nNB Test RMSE:", nb_rmse, "\n")
##
## NB Test RMSE: 8.117712
# Predicted vs actual
tibble(actual = test_count$FSFREQ, predicted = nb_pred) %>%
ggplot(aes(x = actual, y = predicted)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_abline(slope = 1, intercept = 0, color = "red") +
labs(title = "Negative Binomial: Predicted vs Actual FSFREQ",
subtitle = "Points near the red line indicate accurate predictions",
x = "Actual FSFREQ", y = "Predicted FSFREQ")
# Residuals vs fitted
tibble(fitted = fitted(nb_mod),
resid = residuals(nb_mod, type = "pearson")) %>%
ggplot(aes(x = fitted, y = resid)) +
geom_point(alpha = 0.2, color = "steelblue") +
geom_hline(yintercept = 0, color = "red") +
labs(title = "NB Residuals vs Fitted",
subtitle = "Checking model fit — residuals should be centered at 0",
x = "Fitted Values", y = "Pearson Residuals")
Model selection: Three models were compared. The
Poisson model (AIC = 104,068) is far worse than both NB models,
confirming overdispersion. The full NB model (AIC = 69,124) outperforms
the reduced NB model (AIC = 69,180). A likelihood ratio test confirms
the full model is significantly better (p < 0.001), justifying
inclusion of school type (EDCPUB) and child sex
(CSEX).
Interpretation of key coefficients (all on the log count scale, exponentiated = multiplicative effect on expected FSFREQ):
school_involve_score: exp(0.256) = 1.29 — each
additional school activity participated in multiplies expected
participation count by 1.29, holding all else constant. This is the
dominant predictor.EDCPUBNot_public: exp(0.174) = 1.19 — parents of
non-public school students participate 19% more often.HHPARN19_BRDSingle_parent: exp(-0.031) = 0.97 — single
parents participate slightly less, though this effect is only marginally
significant.RACEETHBlack: exp(-0.233) = 0.79 — Black families
participate 21% less than White families on average.RACEETHAsian_PI: exp(-0.400) = 0.67 — Asian/Pacific
Islander families participate 33% less than White families.DSBLTYNo_disability: exp(0.051) = 1.05 — families of
children without disabilities participate slightly more.Test performance: RMSE = 8.12 on the test set. Given
that FSFREQ ranges from 0-99 with mean 7.0, this reflects
the wide spread of the outcome and the presence of extreme values (some
parents participate 50-99 times per year). The model captures the
central tendency well but struggles with outliers, which is expected
given the heavy right tail.
Research question: What factors predict whether a child is a high academic achiever (mostly A’s)?
Why logistic regression? high_achiever
is a binary outcome (Yes/No). Logistic regression models the log-odds of
the outcome as a linear combination of predictors, making it the
standard approach for binary classification. It produces interpretable
odds ratios and allows formal hypothesis testing for each predictor.
Note on coefficient direction: Since
high_achiever is coded with “Yes” as the first level, the
GLM models \(P(\text{No})\). Odds
ratios less than 1 indicate the predictor is associated with a higher
probability of being a high achiever. We note this explicitly below.
logistic_full <- high_achiever ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC + HHPARN19_BRD +
EDCPUB + RACEETH + CSEX + DSBLTY
logistic_reduced <- high_achiever ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC +
HHPARN19_BRD + RACEETH + DSBLTY
log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
# Candidate Model 1: Full
log_mod_full <- log_spec %>%
fit(logistic_full, data = train_model)
# Candidate Model 2: Reduced (no EDCPUB, CSEX)
log_mod_reduced <- log_spec %>%
fit(logistic_reduced, data = train_model)
# AIC comparison
cat("Logistic full AIC: ", glance(log_mod_full)$AIC, "\n")
## Logistic full AIC: 12999.26
cat("Logistic reduced AIC:", glance(log_mod_reduced)$AIC, "\n")
## Logistic reduced AIC: 13172.71
# Likelihood ratio test
anova(extract_fit_engine(log_mod_reduced),
extract_fit_engine(log_mod_full),
test = "Chisq")
## Analysis of Deviance Table
##
## Model 1: high_achiever ~ school_involve_score + home_engage_score + PARGRADEX +
## TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY
## Model 2: high_achiever ~ school_involve_score + home_engage_score + PARGRADEX +
## TTLHHINC + HHPARN19_BRD + EDCPUB + RACEETH + CSEX + DSBLTY
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 10719 13145
## 2 10717 12967 2 177.45 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# Odds ratios — exponentiated coefficients with confidence intervals
tidy(log_mod_full, exponentiate = TRUE, conf.int = TRUE)
## # A tibble: 16 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 7.07 0.132 14.9 6.75e-50 5.47 9.17
## 2 school_involve_score 0.955 0.0119 -3.91 9.32e- 5 0.933 0.977
## 3 home_engage_score 0.928 0.0120 -6.25 4.07e-10 0.906 0.950
## 4 PARGRADEXHS_grad 1.07 0.112 0.607 5.44e- 1 0.859 1.33
## 5 PARGRADEXSome_colle… 0.829 0.103 -1.81 7.02e- 2 0.677 1.01
## 6 PARGRADEXCollege_gr… 0.597 0.108 -4.78 1.79e- 6 0.483 0.737
## 7 PARGRADEXGrad_school 0.392 0.113 -8.27 1.32e-16 0.314 0.489
## 8 TTLHHINC 0.948 0.00911 -5.81 6.39e- 9 0.932 0.966
## 9 HHPARN19_BRDSingle_… 1.27 0.0515 4.64 3.42e- 6 1.15 1.41
## 10 EDCPUBNot_public 0.811 0.0717 -2.93 3.40e- 3 0.704 0.932
## 11 RACEETHBlack 1.82 0.0749 8.01 1.11e-15 1.57 2.11
## 12 RACEETHHispanic 1.39 0.0555 5.86 4.49e- 9 1.24 1.54
## 13 RACEETHAsian_PI 0.659 0.0928 -4.49 6.99e- 6 0.548 0.789
## 14 RACEETHOther 1.11 0.0876 1.20 2.30e- 1 0.935 1.32
## 15 CSEXFemale 0.576 0.0426 -12.9 2.39e-38 0.530 0.626
## 16 DSBLTYNo_disability 0.347 0.0508 -20.9 7.98e-97 0.314 0.383
# Test set evaluation
log_aug <- augment(log_mod_full, new_data = test_model)
conf_mat(log_aug, truth = high_achiever, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 1194 512
## No 329 649
accuracy(log_aug, truth = high_achiever, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.687
precision(log_aug, truth = high_achiever, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.700
recall(log_aug, truth = high_achiever, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.784
f_meas(log_aug, truth = high_achiever, estimate = .pred_class)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.740
# ROC curve
log_aug %>%
roc_curve(truth = high_achiever, .pred_Yes) %>%
autoplot() +
labs(title = "ROC Curve - Logistic Regression",
subtitle = "AUC = 0.743 — solid discriminative ability")
roc_auc(log_aug, truth = high_achiever, .pred_Yes)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.743
Model selection: The full model (AIC = 12,999) significantly outperforms the reduced model (AIC = 13,173) by a likelihood ratio test (p < 0.001). Both school type and child sex add meaningful predictive information beyond demographics and involvement scores.
Key findings from odds ratios (recall: OR < 1 = higher chance of being a high achiever):
DSBLTYNo_disability OR = 0.347 — the strongest
predictor. Children without a disability have 65% lower odds of being a
non-high-achiever (i.e., dramatically higher odds of getting mostly
A’s).PARGRADEXGrad_school OR = 0.392 — children of
graduate-educated parents have 61% lower odds of being
non-high-achievers compared to children of parents without a high school
diploma.CSEXFemale OR = 0.576 — girls have 42% lower odds of
being non-high-achievers (i.e., girls are more likely to earn mostly A’s
than boys).RACEETHBlack OR = 1.82 — Black children have 82% higher
odds of being non-high-achievers compared to White children, even after
controlling for income and parent education.TTLHHINC OR = 0.948 — each income category increase is
associated with slightly lower odds of being a non-high-achiever.Test performance: Accuracy = 68.7%, Precision = 70.0%, Recall = 78.4%, F1 = 74.0%, AUC = 0.743. The model performs well above random chance (50% accuracy for a balanced problem). Recall is higher than precision, meaning the model correctly identifies most true high achievers but also misclassifies some non-high-achievers as high achievers. AUC of 0.743 indicates solid discriminative ability.
Research question: What factors predict which of four grade categories (A’s, B’s, C’s, D’s or lower) a child falls into?
Why multinomial regression? Unlike logistic regression which requires a binary outcome, multinomial logistic regression handles responses with more than two unordered categories. It fits \(K-1\) equations simultaneously (here, \(K=4\) grade categories), each comparing one category to the baseline (Mostly A’s).
Addressing class imbalance with upsampling: The
original training data has 6,051 A students but only 194 D students — a
31:1 ratio. A model trained on this data will learn to almost never
predict D’s because it is rewarded for predicting the majority class. We
use step_upsample() from the {themis} package
to randomly duplicate rows from the minority classes until they reach
50% of the majority class size (over_ratio = 0.5). This is
applied only to training data — the test set remains in
its natural distribution to provide an honest evaluation.
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_full <- SEGRADES ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC + HHPARN19_BRD +
EDCPUB + RACEETH + CSEX + DSBLTY
multi_reduced <- SEGRADES ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC +
HHPARN19_BRD + RACEETH
# Upsampling recipe — applied to training data only
multi_recipe <- recipe(SEGRADES ~ school_involve_score + home_engage_score +
PARGRADEX + TTLHHINC + HHPARN19_BRD +
EDCPUB + RACEETH + CSEX + DSBLTY,
data = train_model) %>%
step_upsample(SEGRADES, over_ratio = 0.5, seed = 631)
train_upsampled <- multi_recipe %>%
prep() %>%
bake(new_data = NULL)
cat("Class distribution after upsampling:\n")
## Class distribution after upsampling:
train_upsampled %>% count(SEGRADES)
## # A tibble: 4 × 2
## SEGRADES n
## <fct> <int>
## 1 Mostly_As 6051
## 2 Mostly_Bs 3461
## 3 Mostly_Cs 3025
## 4 Mostly_Ds_lower 3025
# Candidate Model 1: Full
multi_mod_full <- multi_spec %>%
fit(multi_full, data = train_upsampled)
multi_mod_full <- repair_call(multi_mod_full, data = train_upsampled)
# Candidate Model 2: Reduced
multi_mod_reduced <- multi_spec %>%
fit(multi_reduced, data = train_upsampled)
multi_mod_reduced <- repair_call(multi_mod_reduced, data = train_upsampled)
cat("Multinomial full AIC: ", AIC(extract_fit_engine(multi_mod_full)), "\n")
## Multinomial full AIC: 35346.67
cat("Multinomial reduced AIC:", AIC(extract_fit_engine(multi_mod_reduced)), "\n")
## Multinomial reduced AIC: 38724.67
# Coefficients
tidy(multi_mod_full) %>% print(n = Inf)
## # A tibble: 48 × 6
## y.level term estimate std.error statistic p.value
## <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 Mostly_Bs (Intercept) 1.25 0.137 9.12 7.24e- 20
## 2 Mostly_Bs school_involve_score -0.0289 0.0124 -2.34 1.95e- 2
## 3 Mostly_Bs home_engage_score -0.0468 0.0125 -3.74 1.81e- 4
## 4 Mostly_Bs PARGRADEXHS_grad -0.0644 0.115 -0.561 5.75e- 1
## 5 Mostly_Bs PARGRADEXSome_college -0.235 0.106 -2.21 2.68e- 2
## 6 Mostly_Bs PARGRADEXCollege_grad -0.527 0.110 -4.78 1.77e- 6
## 7 Mostly_Bs PARGRADEXGrad_school -0.923 0.116 -7.95 1.94e- 15
## 8 Mostly_Bs TTLHHINC -0.0391 0.00937 -4.17 3.01e- 5
## 9 Mostly_Bs HHPARN19_BRDSingle_pa… 0.178 0.0536 3.33 8.79e- 4
## 10 Mostly_Bs EDCPUBNot_public -0.0785 0.0719 -1.09 2.75e- 1
## 11 Mostly_Bs RACEETHBlack 0.457 0.0787 5.81 6.30e- 9
## 12 Mostly_Bs RACEETHHispanic 0.312 0.0580 5.37 7.82e- 8
## 13 Mostly_Bs RACEETHAsian_PI -0.276 0.0965 -2.86 4.18e- 3
## 14 Mostly_Bs RACEETHOther 0.241 0.0913 2.64 8.35e- 3
## 15 Mostly_Bs CSEXFemale -0.527 0.0446 -11.8 2.72e- 32
## 16 Mostly_Bs DSBLTYNo_disability -0.801 0.0551 -14.5 6.57e- 48
## 17 Mostly_Cs (Intercept) 2.34 0.149 15.7 2.52e- 55
## 18 Mostly_Cs school_involve_score -0.0689 0.0138 -4.99 6.05e- 7
## 19 Mostly_Cs home_engage_score -0.145 0.0139 -10.4 1.65e- 25
## 20 Mostly_Cs PARGRADEXHS_grad 0.385 0.125 3.07 2.13e- 3
## 21 Mostly_Cs PARGRADEXSome_college 0.175 0.118 1.49 1.36e- 1
## 22 Mostly_Cs PARGRADEXCollege_grad -0.369 0.124 -2.96 3.04e- 3
## 23 Mostly_Cs PARGRADEXGrad_school -0.956 0.134 -7.13 1.03e- 12
## 24 Mostly_Cs TTLHHINC -0.0742 0.0103 -7.18 7.19e- 13
## 25 Mostly_Cs HHPARN19_BRDSingle_pa… 0.381 0.0575 6.62 3.64e- 11
## 26 Mostly_Cs EDCPUBNot_public -0.595 0.0988 -6.02 1.74e- 9
## 27 Mostly_Cs RACEETHBlack 0.662 0.0820 8.07 7.05e- 16
## 28 Mostly_Cs RACEETHHispanic 0.312 0.0642 4.85 1.22e- 6
## 29 Mostly_Cs RACEETHAsian_PI -1.07 0.158 -6.77 1.28e- 11
## 30 Mostly_Cs RACEETHOther 0.331 0.101 3.28 1.02e- 3
## 31 Mostly_Cs CSEXFemale -0.880 0.0512 -17.2 2.80e- 66
## 32 Mostly_Cs DSBLTYNo_disability -1.82 0.0562 -32.3 1.23e-228
## 33 Mostly_Ds_lower (Intercept) 3.03 0.156 19.4 1.33e- 83
## 34 Mostly_Ds_lower school_involve_score -0.151 0.0149 -10.1 6.76e- 24
## 35 Mostly_Ds_lower home_engage_score -0.141 0.0149 -9.42 4.39e- 21
## 36 Mostly_Ds_lower PARGRADEXHS_grad 0.573 0.131 4.37 1.22e- 5
## 37 Mostly_Ds_lower PARGRADEXSome_college 0.390 0.123 3.17 1.55e- 3
## 38 Mostly_Ds_lower PARGRADEXCollege_grad -0.338 0.132 -2.55 1.07e- 2
## 39 Mostly_Ds_lower PARGRADEXGrad_school -1.25 0.149 -8.35 6.88e- 17
## 40 Mostly_Ds_lower TTLHHINC -0.106 0.0111 -9.49 2.29e- 21
## 41 Mostly_Ds_lower HHPARN19_BRDSingle_pa… 0.557 0.0610 9.12 7.28e- 20
## 42 Mostly_Ds_lower EDCPUBNot_public -2.55 0.229 -11.2 7.16e- 29
## 43 Mostly_Ds_lower RACEETHBlack 0.570 0.0870 6.55 5.77e- 11
## 44 Mostly_Ds_lower RACEETHHispanic 0.208 0.0696 2.99 2.78e- 3
## 45 Mostly_Ds_lower RACEETHAsian_PI -1.88 0.256 -7.35 2.00e- 13
## 46 Mostly_Ds_lower RACEETHOther 0.576 0.104 5.55 2.90e- 8
## 47 Mostly_Ds_lower CSEXFemale -0.897 0.0557 -16.1 2.85e- 58
## 48 Mostly_Ds_lower DSBLTYNo_disability -2.62 0.0604 -43.4 0
# Test set evaluation
multi_aug <- augment(multi_mod_full, new_data = test_model)
conf_mat(multi_aug, truth = SEGRADES, estimate = .pred_class)
## Truth
## Prediction Mostly_As Mostly_Bs Mostly_Cs Mostly_Ds_lower
## Mostly_As 1308 546 117 16
## Mostly_Bs 50 65 20 4
## Mostly_Cs 36 66 27 7
## Mostly_Ds_lower 129 182 79 32
# Macro-averaged metrics
multi_metrics <- metric_set(accuracy, precision, recall, f_meas)
multi_metrics(multi_aug,
truth = SEGRADES,
estimate = .pred_class,
estimator = "macro")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.534
## 2 precision macro 0.350
## 3 recall macro 0.397
## 4 f_meas macro 0.288
# Predicted vs actual bar chart
multi_aug %>%
count(SEGRADES, .pred_class) %>%
ggplot(aes(x = SEGRADES, y = n, fill = .pred_class)) +
geom_bar(stat = "identity") +
labs(title = "Multinomial: Predicted vs Actual Grade Category",
subtitle = "Upsampling allows the model to predict all four classes",
x = "Actual Grade", y = "Count", fill = "Predicted")
Model selection: Full model (AIC = 35,347) substantially outperforms reduced (AIC = 38,725). The additional predictors — school type, sex, and disability — provide meaningful information about grade category.
repair_call() note: This tidymodels
utility fixes an internal issue where the model’s stored call references
data = data instead of the actual data object name. This is
required for augment() to work correctly with multinomial
models.
Upsampling effect: After upsampling, the training data has 6,051 A’s, 3,461 B’s, 3,025 C’s, and 3,025 D’s — much more balanced. The model now predicts all four classes rather than ignoring the minority classes entirely.
Key coefficient patterns (all relative to Mostly_As baseline):
DSBLTYNo_disability is the strongest predictor across
all grade levels. The coefficient grows in magnitude from B’s (-0.80) to
C’s (-1.82) to D’s (-2.62), meaning the disability gap widens
dramatically at lower grade levels.CSEXFemale is consistently negative across all levels —
girls are less likely to be in any lower grade category relative to
A’s.EDCPUBNot_public has a very large negative effect for
D’s (-2.55), meaning students in non-public schools are much less likely
to receive D’s or below, possibly reflecting selection effects.Test performance: Accuracy = 53.4% (vs 25% random chance for 4 classes), macro precision = 35.0%, macro recall = 39.7%, macro F1 = 28.8%. The macro averaging treats all classes equally, so the low-support D category pulls down the aggregate metrics. The confusion matrix shows the model now makes predictions across all four classes — a meaningful improvement over the pre-upsampling model that never predicted D’s.
Research question: Can we classify children as high achievers using only continuous numeric predictors?
Why LDA/QDA? LDA and QDA are alternative classification methods that take a different approach to logistic regression. Rather than modeling the probability of the outcome directly, they model the distribution of the predictors within each class and use Bayes’ theorem to derive posterior class probabilities.
Predictor selection: LDA and QDA require continuous
numeric predictors. We use school_involve_score,
home_engage_score, FODINNERX (days eating
dinner together), TTLHHINC (income), and
NUMSIBSX (number of siblings) — all continuous variables
available in the dataset.
lda_formula <- high_achiever ~ school_involve_score + home_engage_score +
FODINNERX + TTLHHINC + NUMSIBSX
train_lda <- train_model %>%
drop_na(school_involve_score, home_engage_score,
FODINNERX, TTLHHINC, NUMSIBSX)
test_lda <- test_model %>%
drop_na(school_involve_score, home_engage_score,
FODINNERX, TTLHHINC, NUMSIBSX)
# LDA
lda_mod <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification") %>%
fit(lda_formula, data = train_lda)
lda_aug <- augment(lda_mod, new_data = test_lda)
lda_acc <- accuracy(lda_aug, truth = high_achiever, estimate = .pred_class)
lda_prec <- precision(lda_aug, truth = high_achiever, estimate = .pred_class)
lda_rec <- recall(lda_aug, truth = high_achiever, estimate = .pred_class)
lda_f1 <- f_meas(lda_aug, truth = high_achiever, estimate = .pred_class)
lda_auc <- roc_auc(lda_aug, truth = high_achiever, .pred_Yes)
cat("LDA Accuracy: ", lda_acc$.estimate, "\n")
## LDA Accuracy: 0.6322653
cat("LDA Precision:", lda_prec$.estimate, "\n")
## LDA Precision: 0.6437768
cat("LDA Recall: ", lda_rec$.estimate, "\n")
## LDA Recall: 0.7879186
cat("LDA F1: ", lda_f1$.estimate, "\n")
## LDA F1: 0.7085917
cat("LDA AUC: ", lda_auc$.estimate, "\n")
## LDA AUC: 0.6511323
lda_aug %>%
roc_curve(truth = high_achiever, .pred_Yes) %>%
autoplot() +
labs(title = "ROC Curve - LDA",
subtitle = "AUC = 0.651")
# QDA
qda_mod <- discrim_quad() %>%
set_engine("MASS") %>%
set_mode("classification") %>%
fit(lda_formula, data = train_lda)
qda_aug <- augment(qda_mod, new_data = test_lda)
qda_acc <- accuracy(qda_aug, truth = high_achiever, estimate = .pred_class)
qda_prec <- precision(qda_aug, truth = high_achiever, estimate = .pred_class)
qda_rec <- recall(qda_aug, truth = high_achiever, estimate = .pred_class)
qda_f1 <- f_meas(qda_aug, truth = high_achiever, estimate = .pred_class)
qda_auc <- roc_auc(qda_aug, truth = high_achiever, .pred_Yes)
cat("QDA Accuracy: ", qda_acc$.estimate, "\n")
## QDA Accuracy: 0.6285395
cat("QDA Precision:", qda_prec$.estimate, "\n")
## QDA Precision: 0.641246
cat("QDA Recall: ", qda_rec$.estimate, "\n")
## QDA Recall: 0.783979
cat("QDA F1: ", qda_f1$.estimate, "\n")
## QDA F1: 0.7054653
cat("QDA AUC: ", qda_auc$.estimate, "\n")
## QDA AUC: 0.6498111
qda_aug %>%
roc_curve(truth = high_achiever, .pred_Yes) %>%
autoplot() +
labs(title = "ROC Curve - QDA",
subtitle = "AUC = 0.650")
# Final comparison table
tibble(
Method = c("Logistic", "LDA", "QDA"),
Accuracy = c(accuracy(log_aug, truth = high_achiever,
estimate = .pred_class)$.estimate,
lda_acc$.estimate, qda_acc$.estimate),
Precision = c(precision(log_aug, truth = high_achiever,
estimate = .pred_class)$.estimate,
lda_prec$.estimate, qda_prec$.estimate),
Recall = c(recall(log_aug, truth = high_achiever,
estimate = .pred_class)$.estimate,
lda_rec$.estimate, qda_rec$.estimate),
F1 = c(f_meas(log_aug, truth = high_achiever,
estimate = .pred_class)$.estimate,
lda_f1$.estimate, qda_f1$.estimate),
AUC = c(roc_auc(log_aug, truth = high_achiever,
.pred_Yes)$.estimate,
lda_auc$.estimate, qda_auc$.estimate)
) %>%
arrange(desc(AUC))
## # A tibble: 3 × 6
## Method Accuracy Precision Recall F1 AUC
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Logistic 0.687 0.700 0.784 0.740 0.743
## 2 LDA 0.632 0.644 0.788 0.709 0.651
## 3 QDA 0.629 0.641 0.784 0.705 0.650
LDA vs QDA comparison: LDA and QDA perform nearly identically — LDA AUC = 0.651, QDA AUC = 0.650. The negligible difference suggests that the equal-covariance assumption underlying LDA is adequate here. QDA’s additional flexibility in allowing different covariance structures per class does not improve predictions meaningfully, likely because the sample size relative to the number of parameters is sufficient for LDA but not so different for QDA.
Logistic vs LDA/QDA: Logistic regression (AUC = 0.743) outperforms both LDA (0.651) and QDA (0.650). This is expected — logistic regression uses both continuous and categorical predictors, while LDA/QDA are restricted to the five continuous variables available. The categorical variables (parent education, race, disability status) carry substantial predictive information that LDA/QDA cannot access in this formulation.
cat("=== NEGATIVE BINOMIAL (Response: FSFREQ) ===\n")
## === NEGATIVE BINOMIAL (Response: FSFREQ) ===
cat("Poisson AIC: ", AIC(pois_mod), "\n")
## Poisson AIC: 104067.7
cat("NB full AIC: ", AIC(nb_mod), "\n")
## NB full AIC: 69124.06
cat("AIC improvement over Poisson:", AIC(pois_mod) - AIC(nb_mod), "\n")
## AIC improvement over Poisson: 34943.64
cat("Test RMSE: ", nb_rmse, "\n")
## Test RMSE: 8.117712
cat("Theta (dispersion param): ", nb_mod$theta, "\n\n")
## Theta (dispersion param): 2.068716
cat("=== LOGISTIC REGRESSION (Response: high_achiever) ===\n")
## === LOGISTIC REGRESSION (Response: high_achiever) ===
print(accuracy(log_aug, truth = high_achiever, estimate = .pred_class))
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.687
print(precision(log_aug, truth = high_achiever, estimate = .pred_class))
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 precision binary 0.700
print(recall(log_aug, truth = high_achiever, estimate = .pred_class))
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 recall binary 0.784
print(f_meas(log_aug, truth = high_achiever, estimate = .pred_class))
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 f_meas binary 0.740
print(roc_auc(log_aug, truth = high_achiever, .pred_Yes))
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 roc_auc binary 0.743
cat("\n=== MULTINOMIAL REGRESSION (Response: SEGRADES) ===\n")
##
## === MULTINOMIAL REGRESSION (Response: SEGRADES) ===
multi_metrics(multi_aug,
truth = SEGRADES,
estimate = .pred_class,
estimator = "macro")
## # A tibble: 4 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy multiclass 0.534
## 2 precision macro 0.350
## 3 recall macro 0.397
## 4 f_meas macro 0.288
cat("\n=== LDA vs QDA (Response: high_achiever) ===\n")
##
## === LDA vs QDA (Response: high_achiever) ===
cat("LDA — Accuracy:", lda_acc$.estimate,
"| F1:", lda_f1$.estimate,
"| AUC:", lda_auc$.estimate, "\n")
## LDA — Accuracy: 0.6322653 | F1: 0.7085917 | AUC: 0.6511323
cat("QDA — Accuracy:", qda_acc$.estimate,
"| F1:", qda_f1$.estimate,
"| AUC:", qda_auc$.estimate, "\n")
## QDA — Accuracy: 0.6285395 | F1: 0.7054653 | AUC: 0.6498111
Overall conclusions:
The Negative Binomial model is the clear winner for the count outcome, with an AIC improvement of ~34,944 points over standard Poisson. School involvement score dominates as the strongest predictor of participation frequency.
The Logistic regression model is the best classifier for the binary high-achiever outcome (AUC = 0.743, F1 = 0.740), outperforming LDA and QDA which are limited to continuous predictors.
The Multinomial regression model benefits substantially from upsampling, enabling predictions across all four grade categories. While macro F1 (0.288) appears modest, this reflects the difficulty of the 4-class problem and the genuine rarity of D-level students. The model correctly identifies patterns and coefficient directions are all substantively meaningful.
LDA and QDA perform comparably to each other, confirming that the linear decision boundary assumption is adequate for this problem. Their lower performance relative to logistic regression is attributable to their restriction to continuous predictors rather than a failure of the method itself.