Modeling Process Summary

Libraries

library(MASS)
library(tidyverse)
library(tidymodels)
library(readxl)
library(GGally)
library(ggformula)
library(discrim)
library(nnet)
library(themis)

We load MASS first intentionally — it has a select() function that conflicts with dplyr::select(). Loading it before tidyverse ensures that dplyr::select() takes precedence throughout the document. We use dplyr::select() explicitly wherever needed as an additional safeguard.

1. Data Loading

pfi <- read_excel("pfi-data-2019.xlsx")
cat("Rows:", nrow(pfi), "| Cols:", ncol(pfi), "\n")

## Rows: 15500 | Cols: 75

We use only the 2019 survey year. The dataset contains 15,500 observations and 75 variables. We read it directly from the Excel file as required by the project instructions.

2. Variable Selection and Missing Value Treatment

pfi <- pfi %>%
  dplyr::select(
    BASMID,
    SEGRADES, SEGRADEQ, SEENJOY, SEABSNT, SEFUTUREX,
    FSFREQ,
    FSSPORTX, FSVOL, FSMTNG, FSPTMTNG,
    FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR,
    FSNOTESX, FSMEMO,
    FCSCHOOL, FCTEACHR, FCSTDS, FCORDER, FCSUPPRT,
    FHHOME, FHWKHRS, FHCHECKX, FHHELP,
    FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
    FOSPORT, FORESPON, FOHISTX,
    FODINNERX, FOLIBRAYX, FOBOOKSTX,
    EDCPUB, SPUBCHOIX, SCONSIDR, SCCHOICE,
    CSEX, RACEETH, PARGRADEX, TTLHHINC,
    HHPARN19_BRD, DSBLTY, INTACC, CENREG,
    ALLGRADEX, P1MRSTA, HHTOTALXX, NUMSIBSX
  )

# Recode all negative values as NA
# In PFI data, negative codes (-1, -9) represent valid skips or missing
# These are not real values and must be treated as NA
pfi <- pfi %>%
  mutate(across(everything(), ~ ifelse(. < 0, NA, .)))

# Missingness summary
missing_tbl <- pfi %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(),
               names_to  = "variable",
               values_to = "n_missing") %>%
  mutate(pct_missing = round(n_missing / nrow(pfi) * 100, 1)) %>%
  arrange(desc(pct_missing))

print(missing_tbl, n = 25)

## # A tibble: 52 × 3
##    variable  n_missing pct_missing
##    <chr>         <int>       <dbl>
##  1 FHWKHRS        1047         6.8
##  2 FHCHECKX       1047         6.8
##  3 FHHELP         1047         6.8
##  4 FSFREQ          126         0.8
##  5 FSSPORTX        126         0.8
##  6 FSVOL           126         0.8
##  7 FSMTNG          126         0.8
##  8 FSPTMTNG        126         0.8
##  9 FSATCNFN        126         0.8
## 10 FSFUNDRS        126         0.8
## 11 FSCOMMTE        126         0.8
## 12 FSCOUNSLR       126         0.8
## 13 FSNOTESX         53         0.3
## 14 FSMEMO           53         0.3
## 15 FCSCHOOL         53         0.3
## 16 FCTEACHR         53         0.3
## 17 FCSTDS           53         0.3
## 18 FCORDER          53         0.3
## 19 FCSUPPRT         53         0.3
## 20 FHHOME           53         0.3
## 21 BASMID            0         0  
## 22 SEGRADES          2         0  
## 23 SEGRADEQ          2         0  
## 24 SEENJOY           2         0  
## 25 SEABSNT           2         0  
## # ℹ 27 more rows

Variable selection rationale: We selected variables covering five domains: (1) child academic outcomes as response variables, (2) school involvement activities as key predictors, (3) home engagement behaviors, (4) school type and choice, and (5) household demographics as controls.

Missing value treatment: In the PFI dataset, negative values are special codes — -1 means “valid skip” (question not applicable to this respondent) and -9 means missing. Neither represents a real data value, so we recode all negative entries to NA before any analysis.

Missingness assessment: The three homework-related variables (FHWKHRS, FHCHECKX, FHHELP) have 6.8% missing — these apply only to students who do homework, so valid skips account for this. All other variables have less than 1% missing. Overall missingness is low and will not substantially affect our results.

3. Filtering to School-Enrolled Students

pfi <- pfi %>% filter(!is.na(FCSCHOOL))
cat("After filtering to school students:", nrow(pfi), "\n")

## After filtering to school students: 15447

pfi <- pfi %>%
  filter(!is.na(SEGRADES), !is.na(FSFREQ), !is.na(SEGRADEQ))
cat("After dropping missing outcomes:", nrow(pfi), "\n")

## After dropping missing outcomes: 15374

We filter to students who have school satisfaction data (FCSCHOOL not missing), which effectively removes homeschool-only students who were not enrolled in any school. These students have valid skips on all school-related variables and would introduce systematic bias if included. We then remove the small number of rows (73) missing key response variables.

4. Variable Recoding

pfi <- pfi %>%
  mutate(
    SEGRADES = ifelse(SEGRADES == 5, NA, SEGRADES),
    SEGRADES = factor(SEGRADES, levels = 1:4,
                      labels = c("Mostly_As", "Mostly_Bs",
                                 "Mostly_Cs", "Mostly_Ds_lower")),
    SEGRADEQ = factor(SEGRADEQ, levels = 1:5,
                      labels = c("Excellent", "Above_avg", "Average",
                                 "Below_avg", "Failing")),
    SEENJOY  = factor(SEENJOY, levels = 1:4,
                      labels = c("Strongly_agree", "Agree",
                                 "Disagree", "Strongly_disagree")),
    RACEETH  = factor(RACEETH, levels = 1:5,
                      labels = c("White", "Black", "Hispanic",
                                 "Asian_PI", "Other")),
    CSEX     = factor(CSEX, levels = 1:2,
                      labels = c("Male", "Female")),
    HHPARN19_BRD = factor(HHPARN19_BRD, levels = 1:2,
                          labels = c("Two_parents", "Single_parent")),
    DSBLTY   = factor(DSBLTY, levels = 1:2,
                      labels = c("Has_disability", "No_disability")),
    PARGRADEX = factor(PARGRADEX, levels = 1:5,
                       labels = c("Less_than_HS", "HS_grad",
                                  "Some_college", "College_grad",
                                  "Grad_school")),
    EDCPUB   = factor(EDCPUB, levels = 1:2,
                      labels = c("Public", "Not_public")),
    INTACC   = factor(INTACC, levels = 1:4,
                      labels = c("Home_and_cell", "Home_only",
                                 "Cell_only", "No_internet")),
    CENREG   = factor(CENREG, levels = 1:4,
                      labels = c("Northeast", "South", "Midwest", "West")),
    across(c(FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN,
             FSFUNDRS, FSCOMMTE, FSCOUNSLR,
             FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
             FOSPORT, FORESPON, FOHISTX,
             FOLIBRAYX, FOBOOKSTX),
           ~ ifelse(. == 1, 1L, ifelse(. == 2, 0L, NA_integer_)))
  )

Key recoding decisions:

SEGRADES code 5 (“School does not give these grades”) is excluded as NA — these students attend schools without traditional grading systems and cannot be meaningfully compared on this outcome. This removes 1,957 observations.
All binary school/home activity variables originally coded 1=Yes, 2=No are recoded to 1=Yes, 0=No for interpretability as numeric scores.
All categorical variables are converted to labeled factors with meaningful reference levels. For PARGRADEX, the reference level is “Less_than_HS” — all education coefficients are interpreted relative to parents without a high school diploma.

5. Composite Scores and Derived Variables

pfi <- pfi %>%
  mutate(
    # Sum of 8 binary school activity indicators (0-8 scale)
    school_involve_score = rowSums(
      dplyr::select(., FSSPORTX, FSVOL, FSMTNG, FSPTMTNG,
                       FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR),
      na.rm = TRUE),

    # Sum of 7 binary home engagement indicators (0-7 scale)
    home_engage_score = rowSums(
      dplyr::select(., FOSTORY2X, FOCRAFTS, FOGAMES, FOBUILDX,
                       FOSPORT, FORESPON, FOHISTX),
      na.rm = TRUE),

    # Binary high achiever outcome: Yes = Mostly A's, No = B's or below
    # Levels set as c("Yes", "No") so "Yes" is the positive class
    high_achiever = factor(
      ifelse(SEGRADES == "Mostly_As", "Yes", "No"),
      levels = c("Yes", "No")
    )
  )

cat("SEGRADES distribution:\n")

## SEGRADES distribution:

print(table(pfi$SEGRADES, useNA = "ifany"))

## 
##       Mostly_As       Mostly_Bs       Mostly_Cs Mostly_Ds_lower            <NA> 
##            7574            4320            1280             243            1957

cat("\nHigh achiever distribution:\n")

## 
## High achiever distribution:

print(table(pfi$high_achiever, useNA = "ifany"))

## 
##  Yes   No <NA> 
## 7574 5843 1957

cat("\nFSFREQ summary:\n")

## 
## FSFREQ summary:

print(summary(pfi$FSFREQ))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   4.000   7.018   8.000  99.000

cat("\nSchool involvement score:\n")

## 
## School involvement score:

print(summary(pfi$school_involve_score))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   3.000   4.000   4.322   6.000   8.000

cat("\nFinal dataset:", nrow(pfi), "rows x", ncol(pfi), "cols\n")

## 
## Final dataset: 15374 rows x 55 cols

Composite score rationale: We created two summary scores rather than including all individual binary items as separate predictors. This reduces dimensionality, avoids multicollinearity among highly related items, and produces more interpretable coefficients. school_involve_score (0-8) counts how many of the 8 school activities a household participated in. home_engage_score (0-7) counts how many of 7 home engagement activities occurred in the past week.

Binary outcome: high_achiever was created by dichotomizing SEGRADES — Mostly A’s vs everything else. This provides a clean binary outcome for logistic regression and LDA/QDA while preserving the full 4-level outcome for multinomial regression. The “Yes” level is set first so it serves as the positive class in all classification metrics.

Note on GLM direction: Since high_achiever has levels c("Yes", "No"), the GLM models the probability of the second level (“No”). This means odds ratios less than 1 indicate higher probability of being a high achiever, and odds ratios greater than 1 indicate lower probability. We note this explicitly when interpreting coefficients.

6. Exploratory Data Analysis

6.1 FSFREQ Distribution and Overdispersion Check

pfi %>%
  ggplot(aes(x = FSFREQ)) +
  geom_histogram(bins = 40, fill = "steelblue", color = "white") +
  labs(title = "Distribution of School Participation Count (FSFREQ)",
       subtitle = "Right-skewed count variable — candidate for Poisson regression",
       x = "Times participated in school activities", y = "Count")

cat("FSFREQ mean:     ", mean(pfi$FSFREQ, na.rm = TRUE), "\n")

## FSFREQ mean:      7.018017

cat("FSFREQ variance: ", var(pfi$FSFREQ,  na.rm = TRUE), "\n")

## FSFREQ variance:  85.42458

cat("Ratio (var/mean):",
    var(pfi$FSFREQ, na.rm = TRUE) /
    mean(pfi$FSFREQ, na.rm = TRUE), "\n")

## Ratio (var/mean): 12.17218

Overdispersion finding: The variance-to-mean ratio for FSFREQ is approximately 12.2 — far greater than 1. Standard Poisson regression assumes variance equals the mean (ratio = 1). A ratio this high indicates severe overdispersion, meaning standard Poisson will underestimate standard errors and produce falsely significant results. We address this by using Negative Binomial regression instead, which adds a dispersion parameter to account for the excess variance.

6.2 Grade Distribution

pfi %>%
  filter(!is.na(SEGRADES)) %>%
  ggplot(aes(x = SEGRADES)) +
  geom_bar(fill = "steelblue") +
  labs(title = "Distribution of Child Grades (2019 PFI)",
       subtitle = "Imbalanced — Mostly A's dominates",
       x = "Grade Category", y = "Count")

The grade distribution is heavily right-skewed: 7,574 students receive mostly A’s compared to only 243 receiving mostly D’s or lower. This class imbalance is a modeling concern for multinomial regression — a naive model will learn to ignore the minority classes. We address this with upsampling in Section 10.

6.3 Family Involvement vs Grades

pfi %>%
  filter(!is.na(SEGRADES)) %>%
  ggplot(aes(x = SEGRADES, y = school_involve_score)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "School Involvement Score by Grade",
       x = "Grade Category", y = "School Involvement Score (0-8)")

pfi %>%
  filter(!is.na(SEGRADES)) %>%
  ggplot(aes(x = SEGRADES, y = home_engage_score)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "Home Engagement Score by Grade",
       x = "Grade Category", y = "Home Engagement Score (0-7)")

pfi %>%
  filter(!is.na(SEGRADES)) %>%
  ggplot(aes(x = SEGRADES, y = FSFREQ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "School Participation Frequency by Grade",
       x = "Grade Category", y = "Times Participated (FSFREQ)")

Higher-achieving students tend to have parents with higher school involvement scores. The relationship with home engagement is weaker but still present. School participation frequency (FSFREQ) shows a similar pattern — parents of A students participate more often. These patterns motivate including both scores as predictors in all models.

6.4 Demographics vs Grades

pfi %>%
  filter(!is.na(SEGRADES), !is.na(PARGRADEX)) %>%
  ggplot(aes(x = PARGRADEX, fill = SEGRADES)) +
  geom_bar(position = "fill") +
  labs(title = "Grade Distribution by Parent Education",
       x = "Parent Education", y = "Proportion") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

pfi %>%
  filter(!is.na(SEGRADES), !is.na(RACEETH)) %>%
  ggplot(aes(x = RACEETH, fill = SEGRADES)) +
  geom_bar(position = "fill") +
  labs(title = "Grade Distribution by Race/Ethnicity",
       x = "Race/Ethnicity", y = "Proportion")

pfi %>%
  filter(!is.na(SEGRADES), !is.na(HHPARN19_BRD)) %>%
  ggplot(aes(x = HHPARN19_BRD, fill = SEGRADES)) +
  geom_bar(position = "fill") +
  labs(title = "Grade Distribution by Household Structure",
       x = "Household Type", y = "Proportion")

Parent education shows a strong gradient — the proportion of A students increases substantially with each educational level. Two-parent households have a higher proportion of A students than single-parent households. Racial and ethnic disparities are visible across all grade categories.

6.5 FSFREQ vs Key Predictors

pfi %>%
  filter(!is.na(PARGRADEX)) %>%
  ggplot(aes(x = PARGRADEX, y = FSFREQ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "School Participation by Parent Education",
       x = "Parent Education", y = "FSFREQ") +
  theme(axis.text.x = element_text(angle = 30, hjust = 1))

pfi %>%
  filter(!is.na(EDCPUB)) %>%
  ggplot(aes(x = EDCPUB, y = FSFREQ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "School Participation by School Type",
       x = "School Type", y = "FSFREQ")

pfi %>%
  filter(!is.na(HHPARN19_BRD)) %>%
  ggplot(aes(x = HHPARN19_BRD, y = FSFREQ)) +
  geom_boxplot(fill = "steelblue", alpha = 0.6) +
  labs(title = "School Participation by Household Structure",
       x = "Household Type", y = "FSFREQ")

Parents in non-public schools and two-parent households show higher median participation counts. Parent education shows a positive trend — more educated parents participate more. These patterns are consistent with the grade distributions above and motivate our predictor selection for the Negative Binomial model.

7. Data Splitting

set.seed(631)

# Classification dataset
pfi_model <- pfi %>%
  filter(!is.na(SEGRADES), !is.na(high_achiever)) %>%
  drop_na(school_involve_score, home_engage_score,
          PARGRADEX, RACEETH, TTLHHINC,
          HHPARN19_BRD, DSBLTY, EDCPUB, CSEX)

# Count dataset for NB regression
pfi_count <- pfi %>%
  drop_na(FSFREQ, school_involve_score, home_engage_score,
          PARGRADEX, RACEETH, TTLHHINC,
          HHPARN19_BRD, DSBLTY, EDCPUB, CSEX)

split_model <- initial_split(pfi_model, prop = 0.80, strata = SEGRADES)
train_model <- training(split_model)
test_model  <- testing(split_model)

split_count <- initial_split(pfi_count, prop = 0.80, strata = FSFREQ)
train_count <- training(split_count)
test_count  <- testing(split_count)

cat("Classification train/test:", nrow(train_model), "/", nrow(test_model), "\n")

## Classification train/test: 10733 / 2684

cat("Count model train/test:   ", nrow(train_count), "/", nrow(test_count), "\n")

## Count model train/test:    12297 / 3077

We use an 80/20 train/test split for all models. We use strata to ensure the class distribution is preserved in both the training and test sets — this is important given the grade imbalance. We set set.seed(631) for reproducibility. The test set is held out completely and only used for final evaluation — no modeling decisions are made using test set results.

Two separate datasets are created: pfi_model for the classification models (logistic, multinomial, LDA/QDA) and pfi_count for the count model (NB), since they require different complete-case filtering.

8. Negative Binomial Regression

Research question: What factors predict how frequently parents participate in school activities?

Why Negative Binomial? FSFREQ is a count variable (0-99, integer values) making it a candidate for Poisson regression. However, the EDA showed a variance-to-mean ratio of 12.2, indicating severe overdispersion. Standard Poisson regression assumes variance = mean. When this assumption is violated, standard errors are underestimated and hypothesis tests are invalid. Negative Binomial regression adds a dispersion parameter \(\theta\) that allows variance to exceed the mean, making it the appropriate choice here.

count_formula <- FSFREQ ~ school_involve_score + home_engage_score +
                           PARGRADEX + TTLHHINC + HHPARN19_BRD +
                           EDCPUB + RACEETH + CSEX + DSBLTY

# Candidate Model 1: Standard Poisson (baseline for comparison)
pois_mod <- glm(count_formula,
                data   = train_count,
                family = poisson(link = "log"))

# Candidate Model 2: Negative Binomial full model
nb_mod <- glm.nb(count_formula, data = train_count)

# Candidate Model 3: NB reduced (removing EDCPUB, CSEX)
nb_mod_reduced <- glm.nb(
  FSFREQ ~ school_involve_score + home_engage_score +
            PARGRADEX + TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY,
  data = train_count)

# Model comparison
cat("Poisson AIC:    ", AIC(pois_mod),       "\n")

## Poisson AIC:     104067.7

cat("NB full AIC:    ", AIC(nb_mod),         "\n")

## NB full AIC:     69124.06

cat("NB reduced AIC: ", AIC(nb_mod_reduced), "\n")

## NB reduced AIC:  69180.38

# Likelihood ratio test: full vs reduced NB
anova(nb_mod_reduced, nb_mod, test = "Chisq")

## Likelihood ratio tests of Negative Binomial Models
## 
## Response: FSFREQ
##                                                                                                               Model
## 1                 school_involve_score + home_engage_score + PARGRADEX + TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY
## 2 school_involve_score + home_engage_score + PARGRADEX + TTLHHINC + HHPARN19_BRD + EDCPUB + RACEETH + CSEX + DSBLTY
##      theta Resid. df    2 x log-lik.   Test    df LR stat.      Pr(Chi)
## 1 2.055859     12283       -69150.38                                   
## 2 2.068716     12281       -69090.06 1 vs 2     2 60.32132 7.971401e-14

# Final model summary
summary(nb_mod)

## 
## Call:
## glm.nb(formula = count_formula, data = train_count, init.theta = 2.068716274, 
##     link = log)
## 
## Coefficients:
##                            Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                0.615088   0.048233  12.752  < 2e-16 ***
## school_involve_score       0.256287   0.004265  60.086  < 2e-16 ***
## home_engage_score         -0.008634   0.004219  -2.046  0.04073 *  
## PARGRADEXHS_grad           0.001707   0.042610   0.040  0.96805    
## PARGRADEXSome_college      0.041614   0.039624   1.050  0.29362    
## PARGRADEXCollege_grad      0.087466   0.041093   2.128  0.03330 *  
## PARGRADEXGrad_school       0.079498   0.042427   1.874  0.06096 .  
## TTLHHINC                   0.012679   0.003226   3.930 8.48e-05 ***
## HHPARN19_BRDSingle_parent -0.031019   0.018665  -1.662  0.09654 .  
## EDCPUBNot_public           0.173806   0.022888   7.594 3.10e-14 ***
## RACEETHBlack              -0.233135   0.026947  -8.652  < 2e-16 ***
## RACEETHHispanic           -0.179418   0.020268  -8.852  < 2e-16 ***
## RACEETHAsian_PI           -0.400010   0.030404 -13.156  < 2e-16 ***
## RACEETHOther              -0.039615   0.030154  -1.314  0.18893    
## CSEXFemale                 0.016807   0.014776   1.137  0.25535    
## DSBLTYNo_disability        0.051454   0.017380   2.960  0.00307 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for Negative Binomial(2.0687) family taken to be 1)
## 
##     Null deviance: 17813  on 12296  degrees of freedom
## Residual deviance: 12239  on 12281  degrees of freedom
## AIC: 69124
## 
## Number of Fisher Scoring iterations: 1
## 
## 
##               Theta:  2.0687 
##           Std. Err.:  0.0330 
## 
##  2 x log-likelihood:  -69090.0640

# Test set evaluation
nb_pred <- predict(nb_mod, newdata = test_count, type = "response")
nb_rmse <- sqrt(mean((test_count$FSFREQ - nb_pred)^2))
cat("\nNB Test RMSE:", nb_rmse, "\n")

## 
## NB Test RMSE: 8.117712

# Predicted vs actual
tibble(actual = test_count$FSFREQ, predicted = nb_pred) %>%
  ggplot(aes(x = actual, y = predicted)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_abline(slope = 1, intercept = 0, color = "red") +
  labs(title = "Negative Binomial: Predicted vs Actual FSFREQ",
       subtitle = "Points near the red line indicate accurate predictions",
       x = "Actual FSFREQ", y = "Predicted FSFREQ")

# Residuals vs fitted
tibble(fitted = fitted(nb_mod),
       resid  = residuals(nb_mod, type = "pearson")) %>%
  ggplot(aes(x = fitted, y = resid)) +
  geom_point(alpha = 0.2, color = "steelblue") +
  geom_hline(yintercept = 0, color = "red") +
  labs(title = "NB Residuals vs Fitted",
       subtitle = "Checking model fit — residuals should be centered at 0",
       x = "Fitted Values", y = "Pearson Residuals")

Model selection: Three models were compared. The Poisson model (AIC = 104,068) is far worse than both NB models, confirming overdispersion. The full NB model (AIC = 69,124) outperforms the reduced NB model (AIC = 69,180). A likelihood ratio test confirms the full model is significantly better (p < 0.001), justifying inclusion of school type (EDCPUB) and child sex (CSEX).

Interpretation of key coefficients (all on the log count scale, exponentiated = multiplicative effect on expected FSFREQ):

school_involve_score: exp(0.256) = 1.29 — each additional school activity participated in multiplies expected participation count by 1.29, holding all else constant. This is the dominant predictor.
EDCPUBNot_public: exp(0.174) = 1.19 — parents of non-public school students participate 19% more often.
HHPARN19_BRDSingle_parent: exp(-0.031) = 0.97 — single parents participate slightly less, though this effect is only marginally significant.
RACEETHBlack: exp(-0.233) = 0.79 — Black families participate 21% less than White families on average.
RACEETHAsian_PI: exp(-0.400) = 0.67 — Asian/Pacific Islander families participate 33% less than White families.
DSBLTYNo_disability: exp(0.051) = 1.05 — families of children without disabilities participate slightly more.

Test performance: RMSE = 8.12 on the test set. Given that FSFREQ ranges from 0-99 with mean 7.0, this reflects the wide spread of the outcome and the presence of extreme values (some parents participate 50-99 times per year). The model captures the central tendency well but struggles with outliers, which is expected given the heavy right tail.

9. Logistic Regression

Research question: What factors predict whether a child is a high academic achiever (mostly A’s)?

Why logistic regression? high_achiever is a binary outcome (Yes/No). Logistic regression models the log-odds of the outcome as a linear combination of predictors, making it the standard approach for binary classification. It produces interpretable odds ratios and allows formal hypothesis testing for each predictor.

Note on coefficient direction: Since high_achiever is coded with “Yes” as the first level, the GLM models \(P(\text{No})\). Odds ratios less than 1 indicate the predictor is associated with a higher probability of being a high achiever. We note this explicitly below.

logistic_full <- high_achiever ~ school_involve_score + home_engage_score +
                                  PARGRADEX + TTLHHINC + HHPARN19_BRD +
                                  EDCPUB + RACEETH + CSEX + DSBLTY

logistic_reduced <- high_achiever ~ school_involve_score + home_engage_score +
                                     PARGRADEX + TTLHHINC +
                                     HHPARN19_BRD + RACEETH + DSBLTY

log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

# Candidate Model 1: Full
log_mod_full <- log_spec %>%
  fit(logistic_full, data = train_model)

# Candidate Model 2: Reduced (no EDCPUB, CSEX)
log_mod_reduced <- log_spec %>%
  fit(logistic_reduced, data = train_model)

# AIC comparison
cat("Logistic full AIC:   ", glance(log_mod_full)$AIC,    "\n")

## Logistic full AIC:    12999.26

cat("Logistic reduced AIC:", glance(log_mod_reduced)$AIC, "\n")

## Logistic reduced AIC: 13172.71

# Likelihood ratio test
anova(extract_fit_engine(log_mod_reduced),
      extract_fit_engine(log_mod_full),
      test = "Chisq")

## Analysis of Deviance Table
## 
## Model 1: high_achiever ~ school_involve_score + home_engage_score + PARGRADEX + 
##     TTLHHINC + HHPARN19_BRD + RACEETH + DSBLTY
## Model 2: high_achiever ~ school_involve_score + home_engage_score + PARGRADEX + 
##     TTLHHINC + HHPARN19_BRD + EDCPUB + RACEETH + CSEX + DSBLTY
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1     10719      13145                          
## 2     10717      12967  2   177.45 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# Odds ratios — exponentiated coefficients with confidence intervals
tidy(log_mod_full, exponentiate = TRUE, conf.int = TRUE)

## # A tibble: 16 × 7
##    term                 estimate std.error statistic  p.value conf.low conf.high
##    <chr>                   <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
##  1 (Intercept)             7.07    0.132      14.9   6.75e-50    5.47      9.17 
##  2 school_involve_score    0.955   0.0119     -3.91  9.32e- 5    0.933     0.977
##  3 home_engage_score       0.928   0.0120     -6.25  4.07e-10    0.906     0.950
##  4 PARGRADEXHS_grad        1.07    0.112       0.607 5.44e- 1    0.859     1.33 
##  5 PARGRADEXSome_colle…    0.829   0.103      -1.81  7.02e- 2    0.677     1.01 
##  6 PARGRADEXCollege_gr…    0.597   0.108      -4.78  1.79e- 6    0.483     0.737
##  7 PARGRADEXGrad_school    0.392   0.113      -8.27  1.32e-16    0.314     0.489
##  8 TTLHHINC                0.948   0.00911    -5.81  6.39e- 9    0.932     0.966
##  9 HHPARN19_BRDSingle_…    1.27    0.0515      4.64  3.42e- 6    1.15      1.41 
## 10 EDCPUBNot_public        0.811   0.0717     -2.93  3.40e- 3    0.704     0.932
## 11 RACEETHBlack            1.82    0.0749      8.01  1.11e-15    1.57      2.11 
## 12 RACEETHHispanic         1.39    0.0555      5.86  4.49e- 9    1.24      1.54 
## 13 RACEETHAsian_PI         0.659   0.0928     -4.49  6.99e- 6    0.548     0.789
## 14 RACEETHOther            1.11    0.0876      1.20  2.30e- 1    0.935     1.32 
## 15 CSEXFemale              0.576   0.0426    -12.9   2.39e-38    0.530     0.626
## 16 DSBLTYNo_disability     0.347   0.0508    -20.9   7.98e-97    0.314     0.383

# Test set evaluation
log_aug <- augment(log_mod_full, new_data = test_model)

conf_mat(log_aug,  truth = high_achiever, estimate = .pred_class)

##           Truth
## Prediction  Yes   No
##        Yes 1194  512
##        No   329  649

accuracy(log_aug,  truth = high_achiever, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.687

precision(log_aug, truth = high_achiever, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.700

recall(log_aug,    truth = high_achiever, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.784

f_meas(log_aug,    truth = high_achiever, estimate = .pred_class)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.740

# ROC curve
log_aug %>%
  roc_curve(truth = high_achiever, .pred_Yes) %>%
  autoplot() +
  labs(title = "ROC Curve - Logistic Regression",
       subtitle = "AUC = 0.743 — solid discriminative ability")

roc_auc(log_aug, truth = high_achiever, .pred_Yes)

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.743

Model selection: The full model (AIC = 12,999) significantly outperforms the reduced model (AIC = 13,173) by a likelihood ratio test (p < 0.001). Both school type and child sex add meaningful predictive information beyond demographics and involvement scores.

Key findings from odds ratios (recall: OR < 1 = higher chance of being a high achiever):

DSBLTYNo_disability OR = 0.347 — the strongest predictor. Children without a disability have 65% lower odds of being a non-high-achiever (i.e., dramatically higher odds of getting mostly A’s).
PARGRADEXGrad_school OR = 0.392 — children of graduate-educated parents have 61% lower odds of being non-high-achievers compared to children of parents without a high school diploma.
CSEXFemale OR = 0.576 — girls have 42% lower odds of being non-high-achievers (i.e., girls are more likely to earn mostly A’s than boys).
RACEETHBlack OR = 1.82 — Black children have 82% higher odds of being non-high-achievers compared to White children, even after controlling for income and parent education.
TTLHHINC OR = 0.948 — each income category increase is associated with slightly lower odds of being a non-high-achiever.

Test performance: Accuracy = 68.7%, Precision = 70.0%, Recall = 78.4%, F1 = 74.0%, AUC = 0.743. The model performs well above random chance (50% accuracy for a balanced problem). Recall is higher than precision, meaning the model correctly identifies most true high achievers but also misclassifies some non-high-achievers as high achievers. AUC of 0.743 indicates solid discriminative ability.

10. Multinomial Regression

Research question: What factors predict which of four grade categories (A’s, B’s, C’s, D’s or lower) a child falls into?

Why multinomial regression? Unlike logistic regression which requires a binary outcome, multinomial logistic regression handles responses with more than two unordered categories. It fits \(K-1\) equations simultaneously (here, \(K=4\) grade categories), each comparing one category to the baseline (Mostly A’s).

Addressing class imbalance with upsampling: The original training data has 6,051 A students but only 194 D students — a 31:1 ratio. A model trained on this data will learn to almost never predict D’s because it is rewarded for predicting the majority class. We use step_upsample() from the {themis} package to randomly duplicate rows from the minority classes until they reach 50% of the majority class size (over_ratio = 0.5). This is applied only to training data — the test set remains in its natural distribution to provide an honest evaluation.

multi_spec <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification")

multi_full <- SEGRADES ~ school_involve_score + home_engage_score +
                          PARGRADEX + TTLHHINC + HHPARN19_BRD +
                          EDCPUB + RACEETH + CSEX + DSBLTY

multi_reduced <- SEGRADES ~ school_involve_score + home_engage_score +
                              PARGRADEX + TTLHHINC +
                              HHPARN19_BRD + RACEETH

# Upsampling recipe — applied to training data only
multi_recipe <- recipe(SEGRADES ~ school_involve_score + home_engage_score +
                         PARGRADEX + TTLHHINC + HHPARN19_BRD +
                         EDCPUB + RACEETH + CSEX + DSBLTY,
                       data = train_model) %>%
  step_upsample(SEGRADES, over_ratio = 0.5, seed = 631)

train_upsampled <- multi_recipe %>%
  prep() %>%
  bake(new_data = NULL)

cat("Class distribution after upsampling:\n")

## Class distribution after upsampling:

train_upsampled %>% count(SEGRADES)

## # A tibble: 4 × 2
##   SEGRADES            n
##   <fct>           <int>
## 1 Mostly_As        6051
## 2 Mostly_Bs        3461
## 3 Mostly_Cs        3025
## 4 Mostly_Ds_lower  3025

# Candidate Model 1: Full
multi_mod_full <- multi_spec %>%
  fit(multi_full, data = train_upsampled)
multi_mod_full <- repair_call(multi_mod_full, data = train_upsampled)

# Candidate Model 2: Reduced
multi_mod_reduced <- multi_spec %>%
  fit(multi_reduced, data = train_upsampled)
multi_mod_reduced <- repair_call(multi_mod_reduced, data = train_upsampled)

cat("Multinomial full AIC:   ", AIC(extract_fit_engine(multi_mod_full)),    "\n")

## Multinomial full AIC:    35346.67

cat("Multinomial reduced AIC:", AIC(extract_fit_engine(multi_mod_reduced)), "\n")

## Multinomial reduced AIC: 38724.67

# Coefficients
tidy(multi_mod_full) %>% print(n = Inf)

## # A tibble: 48 × 6
##    y.level         term                   estimate std.error statistic   p.value
##    <chr>           <chr>                     <dbl>     <dbl>     <dbl>     <dbl>
##  1 Mostly_Bs       (Intercept)              1.25     0.137       9.12  7.24e- 20
##  2 Mostly_Bs       school_involve_score    -0.0289   0.0124     -2.34  1.95e-  2
##  3 Mostly_Bs       home_engage_score       -0.0468   0.0125     -3.74  1.81e-  4
##  4 Mostly_Bs       PARGRADEXHS_grad        -0.0644   0.115      -0.561 5.75e-  1
##  5 Mostly_Bs       PARGRADEXSome_college   -0.235    0.106      -2.21  2.68e-  2
##  6 Mostly_Bs       PARGRADEXCollege_grad   -0.527    0.110      -4.78  1.77e-  6
##  7 Mostly_Bs       PARGRADEXGrad_school    -0.923    0.116      -7.95  1.94e- 15
##  8 Mostly_Bs       TTLHHINC                -0.0391   0.00937    -4.17  3.01e-  5
##  9 Mostly_Bs       HHPARN19_BRDSingle_pa…   0.178    0.0536      3.33  8.79e-  4
## 10 Mostly_Bs       EDCPUBNot_public        -0.0785   0.0719     -1.09  2.75e-  1
## 11 Mostly_Bs       RACEETHBlack             0.457    0.0787      5.81  6.30e-  9
## 12 Mostly_Bs       RACEETHHispanic          0.312    0.0580      5.37  7.82e-  8
## 13 Mostly_Bs       RACEETHAsian_PI         -0.276    0.0965     -2.86  4.18e-  3
## 14 Mostly_Bs       RACEETHOther             0.241    0.0913      2.64  8.35e-  3
## 15 Mostly_Bs       CSEXFemale              -0.527    0.0446    -11.8   2.72e- 32
## 16 Mostly_Bs       DSBLTYNo_disability     -0.801    0.0551    -14.5   6.57e- 48
## 17 Mostly_Cs       (Intercept)              2.34     0.149      15.7   2.52e- 55
## 18 Mostly_Cs       school_involve_score    -0.0689   0.0138     -4.99  6.05e-  7
## 19 Mostly_Cs       home_engage_score       -0.145    0.0139    -10.4   1.65e- 25
## 20 Mostly_Cs       PARGRADEXHS_grad         0.385    0.125       3.07  2.13e-  3
## 21 Mostly_Cs       PARGRADEXSome_college    0.175    0.118       1.49  1.36e-  1
## 22 Mostly_Cs       PARGRADEXCollege_grad   -0.369    0.124      -2.96  3.04e-  3
## 23 Mostly_Cs       PARGRADEXGrad_school    -0.956    0.134      -7.13  1.03e- 12
## 24 Mostly_Cs       TTLHHINC                -0.0742   0.0103     -7.18  7.19e- 13
## 25 Mostly_Cs       HHPARN19_BRDSingle_pa…   0.381    0.0575      6.62  3.64e- 11
## 26 Mostly_Cs       EDCPUBNot_public        -0.595    0.0988     -6.02  1.74e-  9
## 27 Mostly_Cs       RACEETHBlack             0.662    0.0820      8.07  7.05e- 16
## 28 Mostly_Cs       RACEETHHispanic          0.312    0.0642      4.85  1.22e-  6
## 29 Mostly_Cs       RACEETHAsian_PI         -1.07     0.158      -6.77  1.28e- 11
## 30 Mostly_Cs       RACEETHOther             0.331    0.101       3.28  1.02e-  3
## 31 Mostly_Cs       CSEXFemale              -0.880    0.0512    -17.2   2.80e- 66
## 32 Mostly_Cs       DSBLTYNo_disability     -1.82     0.0562    -32.3   1.23e-228
## 33 Mostly_Ds_lower (Intercept)              3.03     0.156      19.4   1.33e- 83
## 34 Mostly_Ds_lower school_involve_score    -0.151    0.0149    -10.1   6.76e- 24
## 35 Mostly_Ds_lower home_engage_score       -0.141    0.0149     -9.42  4.39e- 21
## 36 Mostly_Ds_lower PARGRADEXHS_grad         0.573    0.131       4.37  1.22e-  5
## 37 Mostly_Ds_lower PARGRADEXSome_college    0.390    0.123       3.17  1.55e-  3
## 38 Mostly_Ds_lower PARGRADEXCollege_grad   -0.338    0.132      -2.55  1.07e-  2
## 39 Mostly_Ds_lower PARGRADEXGrad_school    -1.25     0.149      -8.35  6.88e- 17
## 40 Mostly_Ds_lower TTLHHINC                -0.106    0.0111     -9.49  2.29e- 21
## 41 Mostly_Ds_lower HHPARN19_BRDSingle_pa…   0.557    0.0610      9.12  7.28e- 20
## 42 Mostly_Ds_lower EDCPUBNot_public        -2.55     0.229     -11.2   7.16e- 29
## 43 Mostly_Ds_lower RACEETHBlack             0.570    0.0870      6.55  5.77e- 11
## 44 Mostly_Ds_lower RACEETHHispanic          0.208    0.0696      2.99  2.78e-  3
## 45 Mostly_Ds_lower RACEETHAsian_PI         -1.88     0.256      -7.35  2.00e- 13
## 46 Mostly_Ds_lower RACEETHOther             0.576    0.104       5.55  2.90e-  8
## 47 Mostly_Ds_lower CSEXFemale              -0.897    0.0557    -16.1   2.85e- 58
## 48 Mostly_Ds_lower DSBLTYNo_disability     -2.62     0.0604    -43.4   0

# Test set evaluation
multi_aug <- augment(multi_mod_full, new_data = test_model)

conf_mat(multi_aug, truth = SEGRADES, estimate = .pred_class)

##                  Truth
## Prediction        Mostly_As Mostly_Bs Mostly_Cs Mostly_Ds_lower
##   Mostly_As            1308       546       117              16
##   Mostly_Bs              50        65        20               4
##   Mostly_Cs              36        66        27               7
##   Mostly_Ds_lower       129       182        79              32

# Macro-averaged metrics
multi_metrics <- metric_set(accuracy, precision, recall, f_meas)
multi_metrics(multi_aug,
              truth     = SEGRADES,
              estimate  = .pred_class,
              estimator = "macro")

## # A tibble: 4 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  multiclass     0.534
## 2 precision macro          0.350
## 3 recall    macro          0.397
## 4 f_meas    macro          0.288

# Predicted vs actual bar chart
multi_aug %>%
  count(SEGRADES, .pred_class) %>%
  ggplot(aes(x = SEGRADES, y = n, fill = .pred_class)) +
  geom_bar(stat = "identity") +
  labs(title = "Multinomial: Predicted vs Actual Grade Category",
       subtitle = "Upsampling allows the model to predict all four classes",
       x = "Actual Grade", y = "Count", fill = "Predicted")

Model selection: Full model (AIC = 35,347) substantially outperforms reduced (AIC = 38,725). The additional predictors — school type, sex, and disability — provide meaningful information about grade category.

repair_call() note: This tidymodels utility fixes an internal issue where the model’s stored call references data = data instead of the actual data object name. This is required for augment() to work correctly with multinomial models.

Upsampling effect: After upsampling, the training data has 6,051 A’s, 3,461 B’s, 3,025 C’s, and 3,025 D’s — much more balanced. The model now predicts all four classes rather than ignoring the minority classes entirely.

Key coefficient patterns (all relative to Mostly_As baseline):

DSBLTYNo_disability is the strongest predictor across all grade levels. The coefficient grows in magnitude from B’s (-0.80) to C’s (-1.82) to D’s (-2.62), meaning the disability gap widens dramatically at lower grade levels.
CSEXFemale is consistently negative across all levels — girls are less likely to be in any lower grade category relative to A’s.
EDCPUBNot_public has a very large negative effect for D’s (-2.55), meaning students in non-public schools are much less likely to receive D’s or below, possibly reflecting selection effects.
Parent education and income show consistent gradients — higher education and income predict better grades.

Test performance: Accuracy = 53.4% (vs 25% random chance for 4 classes), macro precision = 35.0%, macro recall = 39.7%, macro F1 = 28.8%. The macro averaging treats all classes equally, so the low-support D category pulls down the aggregate metrics. The confusion matrix shows the model now makes predictions across all four classes — a meaningful improvement over the pre-upsampling model that never predicted D’s.

11. Linear Discriminant Analysis and Quadratic Discriminant Analysis

Research question: Can we classify children as high achievers using only continuous numeric predictors?

Why LDA/QDA? LDA and QDA are alternative classification methods that take a different approach to logistic regression. Rather than modeling the probability of the outcome directly, they model the distribution of the predictors within each class and use Bayes’ theorem to derive posterior class probabilities.

LDA assumes that the predictors follow a multivariate normal distribution within each class and that the covariance matrices are equal across classes. This produces linear decision boundaries.
QDA relaxes the equal-covariance assumption, allowing each class to have its own covariance matrix. This produces quadratic decision boundaries and is more flexible, but requires more data to estimate.

Predictor selection: LDA and QDA require continuous numeric predictors. We use school_involve_score, home_engage_score, FODINNERX (days eating dinner together), TTLHHINC (income), and NUMSIBSX (number of siblings) — all continuous variables available in the dataset.

lda_formula <- high_achiever ~ school_involve_score + home_engage_score +
                                FODINNERX + TTLHHINC + NUMSIBSX

train_lda <- train_model %>%
  drop_na(school_involve_score, home_engage_score,
          FODINNERX, TTLHHINC, NUMSIBSX)

test_lda <- test_model %>%
  drop_na(school_involve_score, home_engage_score,
          FODINNERX, TTLHHINC, NUMSIBSX)

# LDA
lda_mod <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification") %>%
  fit(lda_formula, data = train_lda)

lda_aug  <- augment(lda_mod, new_data = test_lda)
lda_acc  <- accuracy(lda_aug,  truth = high_achiever, estimate = .pred_class)
lda_prec <- precision(lda_aug, truth = high_achiever, estimate = .pred_class)
lda_rec  <- recall(lda_aug,    truth = high_achiever, estimate = .pred_class)
lda_f1   <- f_meas(lda_aug,    truth = high_achiever, estimate = .pred_class)
lda_auc  <- roc_auc(lda_aug,   truth = high_achiever, .pred_Yes)

cat("LDA Accuracy: ", lda_acc$.estimate,  "\n")

## LDA Accuracy:  0.6322653

cat("LDA Precision:", lda_prec$.estimate, "\n")

## LDA Precision: 0.6437768

cat("LDA Recall:   ", lda_rec$.estimate,  "\n")

## LDA Recall:    0.7879186

cat("LDA F1:       ", lda_f1$.estimate,   "\n")

## LDA F1:        0.7085917

cat("LDA AUC:      ", lda_auc$.estimate,  "\n")

## LDA AUC:       0.6511323

lda_aug %>%
  roc_curve(truth = high_achiever, .pred_Yes) %>%
  autoplot() +
  labs(title = "ROC Curve - LDA",
       subtitle = "AUC = 0.651")

# QDA
qda_mod <- discrim_quad() %>%
  set_engine("MASS") %>%
  set_mode("classification") %>%
  fit(lda_formula, data = train_lda)

qda_aug  <- augment(qda_mod, new_data = test_lda)
qda_acc  <- accuracy(qda_aug,  truth = high_achiever, estimate = .pred_class)
qda_prec <- precision(qda_aug, truth = high_achiever, estimate = .pred_class)
qda_rec  <- recall(qda_aug,    truth = high_achiever, estimate = .pred_class)
qda_f1   <- f_meas(qda_aug,    truth = high_achiever, estimate = .pred_class)
qda_auc  <- roc_auc(qda_aug,   truth = high_achiever, .pred_Yes)

cat("QDA Accuracy: ", qda_acc$.estimate,  "\n")

## QDA Accuracy:  0.6285395

cat("QDA Precision:", qda_prec$.estimate, "\n")

## QDA Precision: 0.641246

cat("QDA Recall:   ", qda_rec$.estimate,  "\n")

## QDA Recall:    0.783979

cat("QDA F1:       ", qda_f1$.estimate,   "\n")

## QDA F1:        0.7054653

cat("QDA AUC:      ", qda_auc$.estimate,  "\n")

## QDA AUC:       0.6498111

qda_aug %>%
  roc_curve(truth = high_achiever, .pred_Yes) %>%
  autoplot() +
  labs(title = "ROC Curve - QDA",
       subtitle = "AUC = 0.650")

# Final comparison table
tibble(
  Method    = c("Logistic", "LDA", "QDA"),
  Accuracy  = c(accuracy(log_aug,  truth = high_achiever,
                         estimate = .pred_class)$.estimate,
                lda_acc$.estimate, qda_acc$.estimate),
  Precision = c(precision(log_aug, truth = high_achiever,
                          estimate = .pred_class)$.estimate,
                lda_prec$.estimate, qda_prec$.estimate),
  Recall    = c(recall(log_aug,    truth = high_achiever,
                       estimate = .pred_class)$.estimate,
                lda_rec$.estimate, qda_rec$.estimate),
  F1        = c(f_meas(log_aug,    truth = high_achiever,
                       estimate = .pred_class)$.estimate,
                lda_f1$.estimate, qda_f1$.estimate),
  AUC       = c(roc_auc(log_aug,   truth = high_achiever,
                        .pred_Yes)$.estimate,
                lda_auc$.estimate, qda_auc$.estimate)
) %>%
  arrange(desc(AUC))

## # A tibble: 3 × 6
##   Method   Accuracy Precision Recall    F1   AUC
##   <chr>       <dbl>     <dbl>  <dbl> <dbl> <dbl>
## 1 Logistic    0.687     0.700  0.784 0.740 0.743
## 2 LDA         0.632     0.644  0.788 0.709 0.651
## 3 QDA         0.629     0.641  0.784 0.705 0.650

LDA vs QDA comparison: LDA and QDA perform nearly identically — LDA AUC = 0.651, QDA AUC = 0.650. The negligible difference suggests that the equal-covariance assumption underlying LDA is adequate here. QDA’s additional flexibility in allowing different covariance structures per class does not improve predictions meaningfully, likely because the sample size relative to the number of parameters is sufficient for LDA but not so different for QDA.

Logistic vs LDA/QDA: Logistic regression (AUC = 0.743) outperforms both LDA (0.651) and QDA (0.650). This is expected — logistic regression uses both continuous and categorical predictors, while LDA/QDA are restricted to the five continuous variables available. The categorical variables (parent education, race, disability status) carry substantial predictive information that LDA/QDA cannot access in this formulation.

12. Model Selection Summary

cat("=== NEGATIVE BINOMIAL (Response: FSFREQ) ===\n")

## === NEGATIVE BINOMIAL (Response: FSFREQ) ===

cat("Poisson AIC:              ", AIC(pois_mod), "\n")

## Poisson AIC:               104067.7

cat("NB full AIC:              ", AIC(nb_mod),   "\n")

## NB full AIC:               69124.06

cat("AIC improvement over Poisson:", AIC(pois_mod) - AIC(nb_mod), "\n")

## AIC improvement over Poisson: 34943.64

cat("Test RMSE:                ", nb_rmse,       "\n")

## Test RMSE:                 8.117712

cat("Theta (dispersion param): ", nb_mod$theta,  "\n\n")

## Theta (dispersion param):  2.068716

cat("=== LOGISTIC REGRESSION (Response: high_achiever) ===\n")

## === LOGISTIC REGRESSION (Response: high_achiever) ===

print(accuracy(log_aug,  truth = high_achiever, estimate = .pred_class))

## # A tibble: 1 × 3
##   .metric  .estimator .estimate
##   <chr>    <chr>          <dbl>
## 1 accuracy binary         0.687

print(precision(log_aug, truth = high_achiever, estimate = .pred_class))

## # A tibble: 1 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 precision binary         0.700

print(recall(log_aug,    truth = high_achiever, estimate = .pred_class))

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 recall  binary         0.784

print(f_meas(log_aug,    truth = high_achiever, estimate = .pred_class))

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 f_meas  binary         0.740

print(roc_auc(log_aug,   truth = high_achiever, .pred_Yes))

## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 roc_auc binary         0.743

cat("\n=== MULTINOMIAL REGRESSION (Response: SEGRADES) ===\n")

## 
## === MULTINOMIAL REGRESSION (Response: SEGRADES) ===

multi_metrics(multi_aug,
              truth     = SEGRADES,
              estimate  = .pred_class,
              estimator = "macro")

## # A tibble: 4 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  multiclass     0.534
## 2 precision macro          0.350
## 3 recall    macro          0.397
## 4 f_meas    macro          0.288

cat("\n=== LDA vs QDA (Response: high_achiever) ===\n")

## 
## === LDA vs QDA (Response: high_achiever) ===

cat("LDA — Accuracy:", lda_acc$.estimate,
    "| F1:", lda_f1$.estimate,
    "| AUC:", lda_auc$.estimate, "\n")

## LDA — Accuracy: 0.6322653 | F1: 0.7085917 | AUC: 0.6511323

cat("QDA — Accuracy:", qda_acc$.estimate,
    "| F1:", qda_f1$.estimate,
    "| AUC:", qda_auc$.estimate, "\n")

## QDA — Accuracy: 0.6285395 | F1: 0.7054653 | AUC: 0.6498111

Overall conclusions:

The Negative Binomial model is the clear winner for the count outcome, with an AIC improvement of ~34,944 points over standard Poisson. School involvement score dominates as the strongest predictor of participation frequency.

The Logistic regression model is the best classifier for the binary high-achiever outcome (AUC = 0.743, F1 = 0.740), outperforming LDA and QDA which are limited to continuous predictors.

The Multinomial regression model benefits substantially from upsampling, enabling predictions across all four grade categories. While macro F1 (0.288) appears modest, this reflects the difficulty of the 4-class problem and the genuine rarity of D-level students. The model correctly identifies patterns and coefficient directions are all substantively meaningful.

LDA and QDA perform comparably to each other, confirming that the linear decision boundary assumption is adequate for this problem. Their lower performance relative to logistic regression is attributable to their restriction to continuous predictors rather than a failure of the method itself.

Predicting Educational Success

STA 631 - GLM Project

Udita Bista, Kabin Devkota, Nishan Khanal, Shreya KC

2026-04-24

Summary of Insights

Research Question