1 Portfolio Overview

This portfolio represents my cumulative learning in STA 631: Statistical Modeling I, where I apply statistical learning methods to analyze real-world educational data using R.

The primary goal of this project is to analyze data from the Parent and Family Involvement (PFI) in Education surveys to understand key factors influencing K–12 student outcomes. More specifically, the objective is to identify the best statistical model to provide GVSU’s K–12 Connect with evidence-based recommendations on how school choice and family engagement in school activities and homework relate to student performance

Throughout the portfolio, I demonstrate how each analysis and model contributes to achieving the STA 631 learning objectives, including model selection, interpretation, diagnostics, and effective communication of results.

2 Data Dictionary

The data used in this portfolio come from the 2019 Parent and Family Involvement in Education (PFI) survey, which is part of the National Household Education Surveys (NHES) program. The PFI survey collects information from parents and guardians about the type of school their child attends, school choice options, participation in school activities, homework support at home, and basic household characteristics. The curated 2019 dataset used here contains 15,500 rows and 75variables, where each row represents one K–12 student.

# Core data wrangling + cleaning
library(tidyverse)
library(janitor)
library(readxl)
library(skimr)

# Modeling packages 
library(tidymodels)
library(discrim)
library(glmnet)
library(rpart)
library(rpart.plot)
library(randomForest)

# Load the 2019 curated PFI data (raw)
pfi2019 <- read_excel("pfi-data.xlsx", sheet = "curated 2019")

# A compact table of all variables: names, types, missingness
skim(pfi2019)

Data summary
Name	pfi2019
Number of rows	15500
Number of columns	75
_______________________
Column type frequency:
numeric	75
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
BASMID	1	2.019111e+10	65000.56	20191000012	2.019106e+10	20191113174	20191169241	20191225500	▇▇▇▇▇
ALLGRADEX	1	9.610000e+00	3.86	2	6.750000e+00	10	13	15	▃▅▃▆▇
EDCPUB	1	1.110000e+00	0.31	1	1.000000e+00	1	1	2	▇▁▁▁▁
SCCHOICE	1	1.370000e+00	0.48	-1	1.000000e+00	1	2	2	▁▁▁▇▅
SPUBCHOIX	1	1.900000e+00	0.78	-1	1.000000e+00	2	3	3	▁▁▇▇▆
SCONSIDR	1	1.620000e+00	0.49	-1	1.000000e+00	2	2	2	▁▁▁▅▇
SCHLHRSWK	1	3.770000e+00	0.60	-1	4.000000e+00	4	4	4	▁▁▁▁▇
EINTNET	1	3.920000e+00	0.35	-1	4.000000e+00	4	4	4	▁▁▁▁▇
MOSTIMPT	1	-5.200000e-01	2.22	-1	-1.000000e+00	-1	-1	14	▇▁▁▁▁
INTNUM	1	-8.200000e-01	0.83	-1	-1.000000e+00	-1	-1	16	▇▁▁▁▁
SEENJOY	1	1.770000e+00	0.70	-1	1.000000e+00	2	2	4	▁▆▇▂▁
SEGRADES	1	2.000000e+00	1.33	-1	1.000000e+00	2	2	5	▁▇▅▂▂
SEABSNT	1	1.230000e+00	0.55	-1	1.000000e+00	1	1	4	▁▇▂▁▁
SEGRADEQ	1	2.040000e+00	0.91	-1	1.000000e+00	2	3	5	▁▇▇▆▁
FSSPORTX	1	1.180000e+00	0.45	-1	1.000000e+00	1	1	2	▁▁▁▇▂
FSVOL	1	1.560000e+00	0.54	-1	1.000000e+00	2	2	2	▁▁▁▆▇
FSMTNG	1	1.130000e+00	0.40	-1	1.000000e+00	1	1	2	▁▁▁▇▂
FSPTMTNG	1	1.520000e+00	0.55	-1	1.000000e+00	2	2	2	▁▁▁▇▇
FSATCNFN	1	1.250000e+00	0.49	-1	1.000000e+00	1	2	2	▁▁▁▇▃
FSFUNDRS	1	1.400000e+00	0.54	-1	1.000000e+00	1	2	2	▁▁▁▇▆
FSCOMMTE	1	1.850000e+00	0.42	-1	2.000000e+00	2	2	2	▁▁▁▁▇
FSCOUNSLR	1	1.620000e+00	0.53	-1	1.000000e+00	2	2	2	▁▁▁▅▇
FSFREQ	1	6.950000e+00	9.23	-1	2.000000e+00	4	8	99	▇▁▁▁▁
FSNOTESX	1	1.330000e+00	0.49	-1	1.000000e+00	1	2	2	▁▁▁▇▅
FSMEMO	1	1.090000e+00	0.32	-1	1.000000e+00	1	1	2	▁▁▁▇▁
FCSCHOOL	1	1.440000e+00	0.67	-1	1.000000e+00	1	2	4	▁▇▃▁▁
FCTEACHR	1	1.460000e+00	0.67	-1	1.000000e+00	1	2	4	▁▇▅▁▁
FCSTDS	1	1.460000e+00	0.68	-1	1.000000e+00	1	2	4	▁▇▅▁▁
FCORDER	1	1.520000e+00	0.75	-1	1.000000e+00	1	2	4	▁▇▅▁▁
FCSUPPRT	1	1.590000e+00	0.77	-1	1.000000e+00	1	2	4	▁▇▅▁▁
FHHOME	1	3.160000e+00	1.11	-1	3.000000e+00	3	4	6	▁▁▇▅▁
FHWKHRS	1	5.360000e+00	5.39	-1	2.000000e+00	4	7	75	▇▁▁▁▁
FHAMOUNT	1	1.200000e+00	0.88	-1	1.000000e+00	1	1	3	▁▁▇▁▁
FHCAMT	1	1.230000e+00	0.80	-1	1.000000e+00	1	2	3	▁▁▇▅▁
FHPLACE	1	1.010000e+00	0.66	-1	1.000000e+00	1	1	3	▁▁▇▁▁
FHCHECKX	1	3.080000e+00	1.38	-1	3.000000e+00	4	4	4	▁▁▂▃▇
FHHELP	1	2.190000e+00	1.55	-1	1.000000e+00	2	3	5	▂▇▇▅▅
FOSTORY2X	1	1.410000e+00	0.49	1	1.000000e+00	1	2	2	▇▁▁▁▆
FOCRAFTS	1	1.560000e+00	0.50	1	1.000000e+00	2	2	2	▆▁▁▁▇
FOGAMES	1	1.460000e+00	0.50	1	1.000000e+00	1	2	2	▇▁▁▁▇
FOBUILDX	1	1.450000e+00	0.50	1	1.000000e+00	1	2	2	▇▁▁▁▆
FOSPORT	1	1.320000e+00	0.47	1	1.000000e+00	1	2	2	▇▁▁▁▃
FORESPON	1	1.300000e+00	0.46	1	1.000000e+00	1	2	2	▇▁▁▁▃
FOHISTX	1	1.460000e+00	0.50	1	1.000000e+00	1	2	2	▇▁▁▁▇
FODINNERX	1	4.780000e+00	2.05	0	3.000000e+00	5	7	7	▂▂▅▅▇
FOLIBRAYX	1	1.680000e+00	0.47	1	1.000000e+00	2	2	2	▃▁▁▁▇
FOBOOKSTX	1	1.670000e+00	0.47	1	1.000000e+00	2	2	2	▃▁▁▁▇
HDHEALTH	1	1.580000e+00	0.77	1	1.000000e+00	1	2	5	▇▅▂▁▁
CDOBMM	1	6.560000e+00	3.42	1	4.000000e+00	7	9	12	▇▆▆▆▇
CDOBYY	1	2.006010e+03	3.82	1998	2.003000e+03	2006	2009	2015	▃▇▇▅▃
CSEX	1	1.480000e+00	0.50	1	1.000000e+00	1	2	2	▇▁▁▁▇
CSPEAKX	1	2.290000e+00	0.89	1	2.000000e+00	2	2	6	▇▁▁▁▁
HHTOTALXX	1	4.050000e+00	1.23	2	3.000000e+00	4	5	10	▅▇▁▁▁
RELATION	1	1.790000e+00	1.49	1	1.000000e+00	1	2	11	▇▁▁▁▁
P1REL	1	1.290000e+00	0.97	1	1.000000e+00	1	1	6	▇▁▁▁▁
P1SEX	1	1.660000e+00	0.47	1	1.000000e+00	2	2	2	▅▁▁▁▇
P1MRSTA	1	1.820000e+00	1.37	1	1.000000e+00	1	3	5	▇▁▂▁▁
P1EMPL	1	1.890000e+00	1.70	1	1.000000e+00	1	2	7	▇▁▁▁▁
P1HRSWK	1	3.251000e+01	19.04	-1	2.300000e+01	40	43	80	▃▂▇▂▁
P1MTHSWRK	1	9.680000e+00	4.38	0	1.000000e+01	12	12	12	▂▁▁▁▇
P1AGE	1	4.480000e+01	8.70	15	3.900000e+01	44	50	90	▁▇▆▁▁
P2GUARD	1	1.260000e+00	0.44	1	1.000000e+00	1	2	2	▇▁▁▁▃
TTLHHINC	1	7.230000e+00	3.05	1	5.000000e+00	8	9	12	▃▃▃▇▆
OWNRNTHB	1	1.260000e+00	0.47	1	1.000000e+00	1	1	3	▇▁▂▁▁
CHLDNT	1	1.740000e+00	1.01	1	1.000000e+00	1	2	5	▇▅▁▁▁
SEFUTUREX	1	4.920000e+00	1.20	1	4.000000e+00	5	6	6	▂▁▂▇▇
DSBLTY	1	1.760000e+00	0.43	1	2.000000e+00	2	2	2	▂▁▁▁▇
HHPARN19X	1	1.450000e+00	0.79	1	1.000000e+00	1	2	4	▇▂▁▁▁
HHPARN19_BRD	1	1.260000e+00	0.44	1	1.000000e+00	1	2	2	▇▁▁▁▃
NUMSIBSX	1	1.030000e+00	0.96	0	0.000000e+00	1	1	7	▇▂▁▁▁
PARGRADEX	1	3.620000e+00	1.15	1	3.000000e+00	4	5	5	▂▃▇▇▇
RACEETH	1	2.000000e+00	1.29	1	1.000000e+00	1	3	5	▇▂▃▁▁
INTACC	1	1.140000e+00	0.53	1	1.000000e+00	1	1	4	▇▁▁▁▁
CENREG	1	2.530000e+00	1.03	1	2.000000e+00	2	3	4	▃▇▁▅▅
ZIPLOCL	1	2.281000e+01	10.31	11	1.300000e+01	21	31	43	▅▇▁▂▃

variable_summary <- tibble(
  variable = names(pfi2019),
  type = sapply(pfi2019, function(x) class(x)[1]),
  unique_values = sapply(pfi2019, function(x) length(unique(x))),
  example_values = sapply(pfi2019, function(x) paste0(unique(x)[1:5], collapse = ", "))
)

variable_summary

# Vector of variables we want to keep
vars <- c(
  "SEGRADES",   # outcome (grades)
  
  # School choice
  "EDCPUB",
  "SCCHOICE",
  "SCONSIDR",
  
  # Family engagement at school
  "FSVOL",
  "FSMTNG",
  "FSPTMTNG",
  "FSFUNDRS",
  
  # Homework engagement
  "FHHOME",
  "FHCHECKX",
  "FHHELP",
  "FHWKHRS",
  
  # Socioeconomic Status
  "TTLHHINC",
  "PARGRADEX",
  "RACEETH"
)

# Select only these variables
pfi2019_selected <- pfi2019 %>%
  select(all_of(vars))

# Check selected variables
glimpse(pfi2019_selected)

## Rows: 15,500
## Columns: 15
## $ SEGRADES  <dbl> 1, 5, 3, 1, 1, 5, 1, 5, 1, 5, 2, 1, 2, 2, 2, 1, 5, 1, 3, 1, …
## $ EDCPUB    <dbl> 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, …
## $ SCCHOICE  <dbl> 2, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 1, 2, 1, 1, …
## $ SCONSIDR  <dbl> 2, 2, 1, 1, 2, 2, 2, 1, 2, 2, 1, 1, 2, 2, 2, 2, 1, 2, 1, 2, …
## $ FSVOL     <dbl> 2, 1, 1, 1, 2, 2, 2, 2, 1, 1, 2, 1, 2, 2, 2, 2, 1, 2, 1, 2, …
## $ FSMTNG    <dbl> 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 1, 2, 1, 1, 1, 1, 1, 1, …
## $ FSPTMTNG  <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 1, 1, …
## $ FSFUNDRS  <dbl> 1, 2, 2, 1, 1, 1, 2, 1, 1, 1, 2, 1, 2, 2, 1, 2, 1, 1, 1, 1, …
## $ FHHOME    <dbl> 4, 4, 2, 3, 3, 3, 3, 3, 4, 4, 2, 3, 4, 4, 3, 2, 6, 1, 3, 4, …
## $ FHCHECKX  <dbl> 3, 4, 3, 4, 4, 4, 4, 4, 3, 4, 4, 2, 3, 4, 3, 3, -1, 4, 3, 4,…
## $ FHHELP    <dbl> 2, 4, 1, 3, 3, 3, 2, 3, 1, 4, 2, 2, 3, 2, 2, 2, -1, 1, 2, 2,…
## $ FHWKHRS   <dbl> 2, 7, 6, 5, 5, 3, 10, 2, 10, 4, 5, 3, 3, 24, 5, 8, -1, 1, 8,…
## $ TTLHHINC  <dbl> 8, 5, 3, 2, 10, 10, 3, 12, 11, 7, 3, 10, 8, 5, 8, 8, 12, 7, …
## $ PARGRADEX <dbl> 3, 3, 5, 2, 5, 5, 3, 4, 5, 3, 4, 5, 4, 2, 3, 3, 5, 4, 4, 5, …
## $ RACEETH   <dbl> 4, 3, 2, 3, 5, 1, 3, 1, 1, 1, 3, 1, 1, 3, 1, 1, 5, 1, 3, 4, …

Although the full PFI dataset includes many variables, this portfolio focuses on a limited set of about 15 variables chosen for their direct relevance to the research question. These variables were selected by manually reviewing the PFI 2019 codebook and identifying items specifically related to school choice, family engagement in school activities, homework support, and socioeconomic context.

library(flextable)
library(tibble)

pfi_dict <- tibble(
  Variable = c(
    "SEGRADES",
    "FHWKHRS",
    "EDCPUB",
    "SCCHOICE",
    "SCONSIDR",
    "FSVOL",
    "FSMTNG",
    "FSPTMTNG",
    "FSFUNDRS",
    "FHHOME",
    "FHCHECKX",
    "FHHELP",
    "TTLHHINC",
    "PARGRADEX",
    "RACEETH"
  ),
  Type = c(
    "Categorical",
    "Continuous",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical",
    "Categorical"
  ),
  Description = c(
    "Child's overall grade across all subjects duirng the school year(Mostly A's, B's, C's, or D's or lower).",
    "Weekly hours the child spends on homework outside of school(0 -75).",
    "Indicator for whether the child attends a public school (1 = Yes, 2 = No).",
    "Parent felt they had a choice in selecting their child’s school (1 = Yes, 2 = No -1 = Valid Skip).",
    "Parent considered other schools before enrolling the child(1 = Yes, 2 = No -1 = Valid Skip).",
    "Parent volunteered at the school during the year(1 = Yes, 2 = No -1 = Valid Skip).",
    "Parent attended general school meetings (open house / back-to-school night)(1 = Yes, 2 = No -1 = Valid Skip).",
    "Parent attended PTO/PTA meetings(1 = Yes, 2 = No -1 = Valid Skip).",
    "Parent participated in school fundraising activities(1 = Yes, 2 = No -1 = Valid Skip).",
    "Days per week the child does homework at home(-1,1,2,3,4,5,6).",
    "How often an adult checks whether homework is completed(-1,1,2,3,4).",
    "Days per week someone in the household helps with homework(-1,1,2,3,4,5).",
    "Household income category (ordered from low to high(1-12 values)).",
    "Highest educational attainment of the parent/guardian(1,2,3,4,5).",
    "Child’s race/ethnicity classification(1,2,3,4,5)."
  )
)

flextable(pfi_dict) |>
  set_caption("Table 1. PFI 2019 Data Dictionary") |>
  autofit()

Table 1. PFI 2019 Data Dictionary
Variable	Type	Description
SEGRADES	Categorical	Child's overall grade across all subjects duirng the school year(Mostly A's, B's, C's, or D's or lower).
FHWKHRS	Continuous	Weekly hours the child spends on homework outside of school(0 -75).
EDCPUB	Categorical	Indicator for whether the child attends a public school (1 = Yes, 2 = No).
SCCHOICE	Categorical	Parent felt they had a choice in selecting their child’s school (1 = Yes, 2 = No -1 = Valid Skip).
SCONSIDR	Categorical	Parent considered other schools before enrolling the child(1 = Yes, 2 = No -1 = Valid Skip).
FSVOL	Categorical	Parent volunteered at the school during the year(1 = Yes, 2 = No -1 = Valid Skip).
FSMTNG	Categorical	Parent attended general school meetings (open house / back-to-school night)(1 = Yes, 2 = No -1 = Valid Skip).
FSPTMTNG	Categorical	Parent attended PTO/PTA meetings(1 = Yes, 2 = No -1 = Valid Skip).
FSFUNDRS	Categorical	Parent participated in school fundraising activities(1 = Yes, 2 = No -1 = Valid Skip).
FHHOME	Categorical	Days per week the child does homework at home(-1,1,2,3,4,5,6).
FHCHECKX	Categorical	How often an adult checks whether homework is completed(-1,1,2,3,4).
FHHELP	Categorical	Days per week someone in the household helps with homework(-1,1,2,3,4,5).
TTLHHINC	Categorical	Household income category (ordered from low to high(1-12 values)).
PARGRADEX	Categorical	Highest educational attainment of the parent/guardian(1,2,3,4,5).
RACEETH	Categorical	Child’s race/ethnicity classification(1,2,3,4,5).

3 Data Cleaning and Pre-processing

In this section, I recoded the raw PFI variables into meaningful categories based on the official 2019 PFI codebook. Numerical codes were converted into readable factors (e.g., Yes/No responses, homework frequency categories, income groups, parent education levels), and new variables such as grade_cat, grade_level, and income_group were created to support different modeling approaches.

pfi2019_clean <- pfi2019_selected %>%
  mutate(
    # ------------------ Grades ------------------
    grade_cat = case_when(
      SEGRADES == 1 ~ "A",
      SEGRADES == 2 ~ "B",
      SEGRADES == 3 ~ "C",
      SEGRADES == 4 ~ "D_or_F",
      TRUE ~ NA_character_
    ),
    grade_cat = factor(grade_cat, levels = c("A", "B", "C", "D_or_F")),

    grade_level = case_when(
      grade_cat %in% c("A", "B") ~ "High",
      grade_cat %in% c("C", "D_or_F") ~ "Low",
      TRUE ~ NA_character_
    ),
    grade_level = factor(grade_level, levels = c("Low", "High")),

    # ------------------ School Choice ------------------
    school_public = case_when(
      EDCPUB == 1 ~ "Public",
      EDCPUB == 2 ~ "NotPublic",
      TRUE ~ NA_character_
    ),
    school_public = factor(school_public),

    school_choice_feel = case_when(
      SCCHOICE == 1 ~ "Yes",
      SCCHOICE == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    school_choice_feel = factor(school_choice_feel),

    considered_other_school = case_when(
      SCONSIDR == 1 ~ "Yes",
      SCONSIDR == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    considered_other_school = factor(considered_other_school),

    # ------------------ Family Engagement (School Activities) ------------------
    volunteer_school = case_when(
      FSVOL == 1 ~ "Yes",
      FSVOL == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    volunteer_school = factor(volunteer_school),

    general_meeting = case_when(
      FSMTNG == 1 ~ "Yes",
      FSMTNG == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    general_meeting = factor(general_meeting),

    pto_meeting = case_when(
      FSPTMTNG == 1 ~ "Yes",
      FSPTMTNG == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    pto_meeting = factor(pto_meeting),

    fundraising = case_when(
      FSFUNDRS == 1 ~ "Yes",
      FSFUNDRS == 2 ~ "No",
      TRUE ~ NA_character_
    ),
    fundraising = factor(fundraising),

    # ------------------ Homework Engagement ------------------
    homework_days = case_when(
      FHHOME == 1 ~ "Less than once a week",
      FHHOME == 2 ~ "1–2 days/week",
      FHHOME == 3 ~ "3–4 days/week",
      FHHOME == 4 ~ "5+ days/week",
      FHHOME == 5 ~ "Never",
      FHHOME == 6 ~ "No homework",
      TRUE ~ NA_character_
    ),
    homework_days = factor(
      homework_days,
      levels = c(
        "Less than once a week",
        "1–2 days/week",
        "3–4 days/week",
        "5+ days/week",
        "Never",
        "No homework"
      )
    ),

    check_homework = case_when(
      FHCHECKX == 1 ~ "Never",
      FHCHECKX == 2 ~ "Rarely",
      FHCHECKX == 3 ~ "Sometimes",
      FHCHECKX == 4 ~ "Always",
      TRUE ~ NA_character_
    ),
    check_homework = factor(
      check_homework,
      levels = c("Never", "Rarely", "Sometimes", "Always"),
      ordered = TRUE
    ),

    help_homework = case_when(
      FHHELP == 1 ~ "Less than once a week",
      FHHELP == 2 ~ "1–2 days/week",
      FHHELP == 3 ~ "3–4 days/week",
      FHHELP == 4 ~ "5+ days/week",
      FHHELP == 5 ~ "Never",
      TRUE ~ NA_character_
    ),
    help_homework = factor(
      help_homework,
      levels = c(
        "Less than once a week",
        "1–2 days/week",
        "3–4 days/week",
        "5+ days/week",
        "Never"
      ),
      ordered = TRUE
    ),

    homework_hours = if_else(
      FHWKHRS >= 0 & FHWKHRS <= 75,
      FHWKHRS,
      NA_real_
    ),

    # ------------------ Socioeconomic Status ------------------
    income_group = case_when(
      TTLHHINC %in% 1:4 ~ "Low",
      TTLHHINC %in% 5:8 ~ "Medium",
      TTLHHINC %in% 9:12 ~ "High",
      TRUE ~ NA_character_
    ),
    income_group = factor(income_group, levels = c("Low", "Medium", "High")),

    parent_edu = factor(
      PARGRADEX,
      levels = 1:5,
      labels = c(
        "Less than HS",
        "HS graduate",
        "Some college/technical",
        "College graduate",
        "Graduate school"
      ),
      ordered = TRUE
    ),

    race = factor(
      RACEETH,
      levels = 1:5,
      labels = c("White", "Black", "Hispanic", "Asian", "Other")
    )
  )

pfi2019_final <- pfi2019_clean %>%
  select(
    # Outcomes
    grade_cat,
    grade_level,
    homework_hours,
    
    # School choice
    school_public,
    school_choice_feel,
    considered_other_school,
    
    # Family engagement
    volunteer_school,
    general_meeting,
    pto_meeting,
    fundraising,
    
    # Homework engagement
    homework_days,
    check_homework,
    help_homework,
    
    # SES variables
    income_group,
    parent_edu,
    race
  )

glimpse(pfi2019_final)

## Rows: 15,500
## Columns: 16
## $ grade_cat               <fct> A, NA, C, A, A, NA, A, NA, A, NA, B, A, B, B, …
## $ grade_level             <fct> High, NA, Low, High, High, NA, High, NA, High,…
## $ homework_hours          <dbl> 2, 7, 6, 5, 5, 3, 10, 2, 10, 4, 5, 3, 3, 24, 5…
## $ school_public           <fct> Public, Public, NotPublic, Public, Public, Pub…
## $ school_choice_feel      <fct> No, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, …
## $ considered_other_school <fct> No, No, Yes, Yes, No, No, No, Yes, No, No, Yes…
## $ volunteer_school        <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, N…
## $ general_meeting         <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
## $ pto_meeting             <fct> No, No, Yes, Yes, No, No, No, No, No, No, Yes,…
## $ fundraising             <fct> Yes, No, No, Yes, Yes, Yes, No, Yes, Yes, Yes,…
## $ homework_days           <fct> 5+ days/week, 5+ days/week, 1–2 days/week, 3–4…
## $ check_homework          <ord> Sometimes, Always, Sometimes, Always, Always, …
## $ help_homework           <ord> 1–2 days/week, 5+ days/week, Less than once a …
## $ income_group            <fct> Medium, Medium, Low, Low, High, High, Low, Hig…
## $ parent_edu              <ord> Some college/technical, Some college/technical…
## $ race                    <fct> Asian, Hispanic, Black, Hispanic, Other, White…

4 Exploratory Data Analysis

library(skimr)

skim(pfi2019_final)

Data summary
Name	pfi2019_final
Number of rows	15500
Number of columns	16
_______________________
Column type frequency:
factor	15
numeric	1
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
grade_cat	1967	0.87	FALSE	4	A: 7619, B: 4372, C: 1294, D_o: 248
grade_level	1967	0.87	FALSE	2	Hig: 11991, Low: 1542
school_public	0	1.00	FALSE	2	Pub: 13782, Not: 1718
school_choice_feel	2	1.00	FALSE	2	Yes: 9823, No: 5675
considered_other_school	2	1.00	FALSE	2	No: 9634, Yes: 5864
volunteer_school	126	0.99	FALSE	2	No: 8856, Yes: 6518
general_meeting	126	0.99	FALSE	2	Yes: 13090, No: 2284
pto_meeting	126	0.99	FALSE	2	No: 8381, Yes: 6993
fundraising	126	0.99	FALSE	2	Yes: 8981, No: 6393
homework_days	53	1.00	FALSE	6	3–4: 6023, 5+ : 4710, 1–2: 2738, Les: 982
check_homework	1047	0.93	TRUE	4	Alw: 8420, Som: 3797, Rar: 1533, Nev: 703
help_homework	1047	0.93	TRUE	5	Les: 4582, 1–2: 4016, 3–4: 2815, Nev: 1749
income_group	0	1.00	FALSE	3	Hig: 6388, Med: 5527, Low: 3585
parent_edu	0	1.00	TRUE	5	Col: 4360, Som: 4314, Gra: 4261, HS : 1810
race	0	1.00	FALSE	5	Whi: 8634, His: 3216, Bla: 1497, Asi: 1110

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
homework_hours	1047	0.93	5.82	5.3	0	2	5	8	75	▇▁▁▁▁

The summary shows that most variables in the dataset have very few missing values, which means the data is generally complete and ready for analysis. Some missing values appear in grade-related and homework-related variables, which is expected because not all students receive letter grades or report homework behavior.

4.1 Univariate EDA

pfi2019_final %>%
  drop_na(grade_cat) %>%
  ggplot(aes(x = grade_cat, fill = grade_cat)) +
  geom_bar() +
  labs(
    title = "Distribution of Student Grade Categories",
    x = "Grade Category",
    y = "Number of Students"
  ) +
  theme_minimal()

Most students report earning A’s or B’s, while far fewer report C or D/F grades. This shows the dataset is skewed toward higher-performing students. This imbalance will matter later when building classification models.

pfi2019_final %>%
  drop_na(grade_level) %>%
  ggplot(aes(x = grade_level, fill = grade_level)) +
  geom_bar() +
  labs(
    title = "Distribution of Grade Performance Levels",
    x = "Performance Level",
    y = "Number of Students"
  ) +
  theme_minimal()

Most students are classified as High performance (A/B), with far fewer in the Low category (C/D/F).

pfi2019_final %>%
  drop_na(homework_hours) %>%
  ggplot(aes(x = homework_hours)) +
  geom_histogram(binwidth = 1, fill = "steelblue") +
  labs(
    title = "Distribution of Weekly Homework Hours",
    x = "Homework Hours Per Week",
    y = "Number of Students"
  ) +
  theme_minimal()

Homework hours are mostly concentrated between 0 and 10 hours per week, with only a small number of students reporting very high hours. This creates a right-skewed distribution.

4.2 bivariate EDA

pfi2019_final %>%
  drop_na(grade_cat, school_public) %>%
  count(school_public, grade_cat) %>%
  group_by(school_public) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = school_public, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3, color = "black") +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Student Grade Outcome by School Type",
    x = "School Type (Public / NotPublic)",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Students attending public schools show a more balanced grade distribution, with relatively higher proportions of B’s and C’s compared to non-public school students. In contrast, non-public school students display a much stronger concentration of A’s (over 65%).

library(dplyr)
library(tidyr)
library(ggplot2)
library(scales)

# data frame
df_choice_engage <- pfi2019_final %>%
  filter(!is.na(grade_cat)) %>%
  select(
    grade_cat,
    school_choice_feel,
    considered_other_school,
    volunteer_school,
    general_meeting,
    pto_meeting,
    fundraising
  ) %>%
  pivot_longer(
    cols = -grade_cat,
    names_to = "Activity",
    values_to = "Participation"
  ) %>%
  filter(!is.na(Participation)) %>%
  count(Activity, Participation, grade_cat) %>%
  group_by(Activity, Participation) %>%
  mutate(prop = n / sum(n)) %>%
  ungroup()

# facet labels
df_choice_engage <- df_choice_engage %>%
  mutate(
    Activity = factor(
      Activity,
      levels = c(
        "school_choice_feel",
        "considered_other_school",
        "volunteer_school",
        "general_meeting",
        "pto_meeting",
        "fundraising"
      ),
      labels = c(
        "Felt had school choice",
        "Considered other schools",
        "Volunteered at school",
        "General school meeting",
        "PTO/PTA meeting",
        "Fundraising participation"
      )
    )
  )

ggplot(df_choice_engage,
       aes(x = Participation, y = prop, fill = grade_cat)) +
  geom_col(width = 0.9) +
  coord_flip() +
  facet_wrap(~ Activity, ncol = 1) +
  scale_y_continuous(labels = percent_format()) +
  labs(
    title = "Student Grade Outcome by School Choice & Family Engagement",
    x = "Participation (No / Yes)",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Across all six indicators of school choice and family engagement, students whose parents report higher involvement consistently show a greater proportion of Mostly A and B grades. In contrast, lower levels of parental engagement are associated with a modest but noticeable increase in C and D/F outcomes. Although the magnitude of the differences varies by activity type, the overall pattern suggests a positive association between family engagement and academic performance.

pfi2019_final %>%
  drop_na(grade_cat, income_group) %>%
  count(income_group, grade_cat) %>%
  group_by(income_group) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = income_group, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3, color = "black") +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Student Grade Outcome by Household Income Group",
    x = "Household Income Group (Low / Medium / High)",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Higher-income households exhibit a substantial increase in the proportion of A grades, while lower-income students show a more even split between A and B grades and a higher share of C’s. The gradient across income categories suggests a positive association between household socioeconomic status and reported academic performance.

pfi2019_final %>%
  drop_na(grade_cat, race) %>%
  count(race, grade_cat) %>%
  group_by(race) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = race, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3, color = "black") +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Student Grade Outcome by Race/Ethnicity",
    x = "Race/Ethnicity",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Grade distributions vary across racial/ethnic groups, with Asian and White students showing the highest proportion of A’s, while Hispanic, Black, and Other groups show a more mixed distribution with higher shares of B and C grades. These differences highlight potential disparities in academic achievement across demographic groups.

pfi2019_final %>%
  drop_na(grade_cat, parent_edu) %>%
  count(parent_edu, grade_cat) %>%
  group_by(parent_edu) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = parent_edu, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3, color = "black") +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Student Grade Outcome by Parent Education",
    x = "Parent Education Level",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

A clear upward gradient appears across parent education levels: students with college-educated or graduate-educated parents show the highest share of A grades. Students whose parents have less than a high school education display more evenly distributed grades across A, B, and C. This strong monotonic pattern suggests parental education is an important predictor of academic outcomes.

pfi2019_final %>%
  drop_na(grade_cat, homework_days) %>%
  count(homework_days, grade_cat) %>%
  group_by(homework_days) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = homework_days, y = prop, fill = grade_cat)) +
  geom_col(width = 0.5) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Grade Outcome by Days per Week Child Does Homework",
    x = "Homework Days (Categorical)",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Students who complete homework more frequently tend to report higher academic performance. Those doing homework 3–4 days or 5+ days per week show the highest proportions of A grades, while students who never do homework show noticeably fewer A’s and more C/D_or_F outcomes.

pfi2019_final %>%
  drop_na(grade_cat, help_homework) %>%
  count(help_homework, grade_cat) %>%
  group_by(help_homework) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = help_homework, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Grade Outcome by Homework Help Frequency",
    x = "Homework Help Frequency",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Students who receive homework help less frequently (“Never” or “Less than once a week”) show the highest proportion of A grades, while students receiving more frequent help (1–2 days, 3–4 days, 5+ days per week) display lower A percentages and higher shares of B/C outcomes. This pattern suggests that homework help may be driven by student need: parents are more likely to help frequently when children are struggling academically. The descriptive association therefore reflects likely reverse causality rather than a negative effect of homework help.

pfi2019_final %>%
  drop_na(grade_cat, check_homework) %>%
  count(check_homework, grade_cat) %>%
  group_by(check_homework) %>%
  mutate(prop = n / sum(n)) %>%
  ggplot(aes(x = check_homework, y = prop, fill = grade_cat)) +
  geom_col(width = 0.7) +
  geom_text(aes(label = scales::percent(prop, accuracy = 0.1)),
            position = position_stack(vjust = 0.5),
            size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  coord_flip() +
  labs(
    title = "Grade Outcome by Frequency of Homework Checking",
    x = "Homework Checking Frequency",
    y = "Percent of Students",
    fill = "Grade category"
  ) +
  theme_minimal(base_size = 13)

Interestingly, students whose homework is never checked show the highest proportion of A grades. This likely reflects reverse causality: parents tend to monitor homework more closely when a child struggles academically, rather than monitoring causing lower performance.

library(dplyr)
library(rstatix)
library(purrr)

cat_vars <- pfi2019_final %>%
  select(
    grade_cat,
    school_public,
    school_choice_feel,
    considered_other_school,
    volunteer_school,
    general_meeting,
    pto_meeting,
    fundraising,
    homework_days,
    check_homework,
    help_homework,
    income_group,
    parent_edu,
    race
  ) %>%
  drop_na()

vars <- names(cat_vars)

cramer_results <- map_dfr(
  combn(vars, 2, simplify = FALSE),
  ~{
    v1 <- .x[1]
    v2 <- .x[2]
    tibble(
      var1 = v1,
      var2 = v2,
      cramer_v = cramer_v(cat_vars[[v1]], cat_vars[[v2]])
    )
  }
)

cramer_results %>%
arrange(desc(cramer_v))

Because most variables in this dataset are categorical, Cramer’s V provides a more appropriate measure of association than Pearson correlation. The largest associations appear between income and parent education, as well as among related engagement activities (volunteering, help in homework, PTO involvement and fundraising).

5 Modeling

In this section, I apply several statistical learning models to understand how homework behavior, family engagement, and socioeconomic factors relate to student outcomes. Each model uses the cleaned dataset created earlier and focuses on variables selected based on the research questions explored in the EDA.

5.1 Regression Models

5.1.1 Multiple Linear Regression

The following steps are performed:

Fitting and interpreting a multiple linear regression model
Using categorical predictors with dummy encoding
Applying cross-validation and test-set evaluation
Checking regression assumptions with residual and Q–Q plots
Identifying key predictors of homework time

set.seed(631)

# predict homework_hours (numeric)
# Drop other outcome variables
mlr_data <- pfi2019_final |>
  dplyr::select(-grade_cat, -grade_level) |>
  dplyr::filter(!is.na(homework_hours))

# Split data: 80% train, 20% test
mlr_split <- initial_split(mlr_data, prop = 0.8)
mlr_train <- training(mlr_split)
mlr_test  <- testing(mlr_split)

# Create 10-fold cross-validation folds
mlr_folds <- vfold_cv(mlr_train, v = 10)

# check dataset shapes
dim(mlr_train)

## [1] 11562    14

dim(mlr_test)

## [1] 2891   14

# All predictors are categorical; outcome is numeric.
mlr_recipe <- recipe(homework_hours ~ ., data = mlr_train) |>
  step_zv(all_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

# sanity check: preview processed columns
mlr_recipe_prepped <- prep(mlr_recipe)
mlr_recipe_prepped
summary(mlr_recipe_prepped)

mlr_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# Combine model + recipe into a workflow
mlr_wf <- workflow() |>
  add_model(mlr_spec) |>
  add_recipe(mlr_recipe)

# 10-fold cross-validation on the training data
mlr_res <- fit_resamples(
  mlr_wf,
  resamples = mlr_folds,
  metrics = yardstick::metric_set(yardstick::rmse,
                                  yardstick::mae,
                                  yardstick::rsq),
  control = control_resamples(save_pred = TRUE)
)

# show average performance across folds
collect_metrics(mlr_res)

Interpretation: The multiple linear regression model was used to predict the number of weekly homework hours (homework_hours) from a set of categorical predictors describing school characteristics, family engagement activities, homework routines, socioeconomic status, and demographics. The cross-validated results (MAE = 2.9, RMSE = 4.5, R² = 0.27) indicate moderate prediction accuracy: on average, the model’s predictions differ by about 3–4 hours from the actual homework time, and approximately one-quarter of the variance in homework hours is explained by the included predictors. This suggests that while family engagement and school context are related to homework time, they do not fully account for students’ differences in study effort.

# Fit the final model on training and evaluate on test data
mlr_final <- last_fit(mlr_wf, mlr_split)

# Collect and print test metrics
collect_metrics(mlr_final)

# Predictions + residuals
mlr_preds <- collect_predictions(mlr_final) |>
  mutate(resid = homework_hours - .pred)

# Residuals vs Fitted plot
mlr_preds |>
  ggplot(aes(x = .pred, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, linetype = 2, color = "red") +
  labs(title = "Residuals vs Fitted (Test Set)",
       x = "Predicted homework hours", y = "Residuals") +
  theme_minimal()

# Normal Q-Q plot
mlr_preds |>
  ggplot(aes(sample = resid)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "Normal Q-Q Plot of Residuals") +
  theme_minimal()

# Top 15 coefficients (largest t-values)
mlr_fit <- mlr_final |> extract_workflow() |> extract_fit_engine()
broom::tidy(mlr_fit) |>
  arrange(desc(abs(statistic))) |>
  slice_head(n = 15)

Interpretation of Test Results and Diagnostics The final test evaluation produced an RMSE of 4.66 and an R² of 0.26, indicating that the model explains roughly 26% of the variance in weekly homework hours among students. These results are consistent with cross-validation metrics, suggesting that the model generalizes reasonably well and is not overfitted.

The Residuals vs Fitted plot shows that errors are centered around zero but display a slight funnel shape, suggesting heteroscedasticity — variance in residuals increases as predicted homework hours rise. This pattern, along with the mild downward slope, indicates possible nonlinearity that the model does not capture.

The Normal Q–Q plot reveals that residuals deviate from the normal line at the upper tail, indicating a few high-homework outliers (students spending unusually large amounts of time on homework).

From the coefficient table, the most influential predictors include the number of days students spend doing homework each week and how often their parents check homework. Specifically, students completing homework fewer days per week (e.g., “1–2 days/week” or “Less than once a week”) spend 4–7 fewer hours weekly on homework compared to those who complete homework five or more days. Parental checking frequency has a positive association, while very frequent homework help and lower parental education levels show negative effects.

Overall, this model serves as a strong baseline, but the residual patterns and modest R² suggest that relationships between the predictors and homework hours are not strictly linear. In the next section, a Polynomial Regression model will be explored to test whether adding nonlinear (curved) effects improves prediction accuracy.

5.1.2 Polynomial Regression

The following steps are performed:

Building a polynomial regression model with step_poly()
Adding squared terms for numeric socioeconomic predictors
Using cross-validation to assess model performance
Comparing results to the baseline linear model
Checking residuals to evaluate nonlinear improvement

set.seed(631)

# Prepare data: remove grade outcomes and convert ordered factors to numeric
poly_data <- pfi2019_final |>
  dplyr::select(-grade_cat, -grade_level) |>
  dplyr::filter(!is.na(homework_hours)) |>
  mutate(
    parent_edu_num = as.numeric(parent_edu),
    income_group_num = as.numeric(income_group)
  )

# check structure
glimpse(poly_data)

## Rows: 14,453
## Columns: 16
## $ homework_hours          <dbl> 2, 7, 6, 5, 5, 3, 10, 2, 10, 4, 5, 3, 3, 24, 5…
## $ school_public           <fct> Public, Public, NotPublic, Public, Public, Pub…
## $ school_choice_feel      <fct> No, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, …
## $ considered_other_school <fct> No, No, Yes, Yes, No, No, No, Yes, No, No, Yes…
## $ volunteer_school        <fct> No, Yes, Yes, Yes, No, No, No, No, Yes, Yes, N…
## $ general_meeting         <fct> Yes, No, Yes, Yes, Yes, Yes, Yes, Yes, Yes, Ye…
## $ pto_meeting             <fct> No, No, Yes, Yes, No, No, No, No, No, No, Yes,…
## $ fundraising             <fct> Yes, No, No, Yes, Yes, Yes, No, Yes, Yes, Yes,…
## $ homework_days           <fct> 5+ days/week, 5+ days/week, 1–2 days/week, 3–4…
## $ check_homework          <ord> Sometimes, Always, Sometimes, Always, Always, …
## $ help_homework           <ord> 1–2 days/week, 5+ days/week, Less than once a …
## $ income_group            <fct> Medium, Medium, Low, Low, High, High, Low, Hig…
## $ parent_edu              <ord> Some college/technical, Some college/technical…
## $ race                    <fct> Asian, Hispanic, Black, Hispanic, Other, White…
## $ parent_edu_num          <dbl> 3, 3, 5, 2, 5, 5, 3, 4, 5, 3, 4, 5, 4, 2, 3, 3…
## $ income_group_num        <dbl> 2, 2, 1, 1, 3, 3, 1, 3, 3, 2, 1, 3, 2, 2, 2, 2…

# Polynomial regression recipe:
# add polynomial terms for parent_edu_num and income_group_num
poly_recipe <- recipe(homework_hours ~ ., data = poly_data) |>
  step_rm(parent_edu, income_group) |>  # remove the original categorical versions
  step_zv(all_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_poly(parent_edu_num, income_group_num, degree = 2)  # add squared terms

# check which columns will be used
poly_recipe_prep <- prep(poly_recipe)
summary(poly_recipe_prep)

# Model specification (same as linear regression)
poly_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# Workflow: combine model + recipe
poly_wf <- workflow() |>
  add_model(poly_spec) |>
  add_recipe(poly_recipe)

# 10-fold cross-validation
set.seed(631)
poly_folds <- vfold_cv(poly_data, v = 10)

poly_res <- fit_resamples(
  poly_wf,
  resamples = poly_folds,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::mae, yardstick::rsq),
  control = control_resamples(save_pred = TRUE)
)

# View performance metrics
collect_metrics(poly_res)

Interpretation To test for potential nonlinear relationships between socioeconomic variables and homework time, a polynomial regression model was fitted by adding second-degree terms for parent_edu and income_group. The model’s performance (MAE = 2.91, RMSE = 4.53, R² = 0.27) was nearly identical to the baseline linear model, indicating that adding curvature did not improve predictive accuracy. This suggests that the effects of parental education and household income on students’ weekly homework hours are primarily linear. Consequently, the linear model remains a simpler and equally effective representation of these relationships

# Fit the final polynomial regression model (train/test split)
poly_split <- initial_split(poly_data, prop = 0.8)
poly_train <- training(poly_split)
poly_test  <- testing(poly_split)

poly_final_fit <- last_fit(poly_wf, poly_split)

# ---- Test performance ----
collect_metrics(poly_final_fit)

# ---- Predictions & residuals ----
poly_preds <- collect_predictions(poly_final_fit) |>
  mutate(resid = homework_hours - .pred)

# Residuals vs Fitted
poly_preds |>
  ggplot(aes(x = .pred, y = resid)) +
  geom_point(alpha = 0.5) +
  geom_hline(yintercept = 0, color = "red", linetype = 2) +
  labs(title = "Polynomial Regression: Residuals vs Fitted (Test Set)",
       x = "Predicted Homework Hours", y = "Residuals") +
  theme_minimal()

# Normal Q-Q Plot
poly_preds |>
  ggplot(aes(sample = resid)) +
  stat_qq() + stat_qq_line(color = "red") +
  labs(title = "Polynomial Regression: Normal Q-Q Plot of Residuals") +
  theme_minimal()

# ---- Coefficients ----
poly_fit <- poly_final_fit |> extract_workflow() |> extract_fit_engine()
broom::tidy(poly_fit) |>
  arrange(desc(abs(statistic))) |>
  slice_head(n = 15)

Polynomial Regression Diagnostics and Interpretation The polynomial regression model included second-degree terms for parent_edu and income_group to capture potential nonlinear socioeconomic effects on weekly homework hours. Model performance on the test set (RMSE = 4.89, R² = 0.25) was slightly worse than the baseline multiple linear regression, suggesting no substantial improvement. Residual diagnostics revealed similar issues to the linear model — residuals remained heteroskedastic and right-skewed, with a few outliers at high predicted values. The Q–Q plot confirmed deviations from normality, and squared terms did not reduce error patterns. Coefficient estimates indicated that homework frequency and parental monitoring remain the strongest predictors of homework time, while the nonlinear (polynomial) effects of socioeconomic variables were small and statistically weak. Overall, the polynomial regression confirmed that the linear relationships captured most of the predictive structure, and added complexity did not yield better generalization.

5.1.3 Lasso Regression

The following steps are performed:

Building a penalized regression model using the Lasso method (glmnet)
Applying regularization to reduce overfitting and select key predictors
Tuning the penalty parameter (λ) through cross-validation
Comparing Lasso performance to Multiple Linear Regression
Interpreting the most influential variables retained by the model

set.seed(631)

# Prepare data (same as before)
lasso_data <- pfi2019_final |>
  dplyr::select(-grade_cat, -grade_level) |>
  dplyr::filter(!is.na(homework_hours))

# Split into train/test
lasso_split <- initial_split(lasso_data, prop = 0.8)
lasso_train <- training(lasso_split)
lasso_test  <- testing(lasso_split)

# 10-fold cross-validation
lasso_folds <- vfold_cv(lasso_train, v = 10)

# Recipe: handle categorical predictors (dummy encoding) and missing data
lasso_recipe <- recipe(homework_hours ~ ., data = lasso_train) |>
  step_zv(all_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())  # normalization is important for glmnet

# Model specification: Lasso regression (mixture = 1 means pure Lasso)
lasso_spec <- linear_reg(
  penalty = tune(),   # lambda: the regularization parameter to tune
  mixture = 1         # 1 = Lasso, 0 = Ridge, between = Elastic Net
) |>
  set_engine("glmnet") |>
  set_mode("regression")

# Workflow: combine recipe and model
lasso_wf <- workflow() |>
  add_model(lasso_spec) |>
  add_recipe(lasso_recipe)

# Define penalty tuning grid (lambda values)
lasso_grid <- grid_regular(penalty(), levels = 30)

# Tune the model using cross-validation
set.seed(631)
lasso_res <- tune_grid(
  lasso_wf,
  resamples = lasso_folds,
  grid = lasso_grid,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::mae, yardstick::rsq),
  control = control_grid(save_pred = TRUE)
)

# View performance for each lambda
lasso_metrics <- collect_metrics(lasso_res)
lasso_metrics

# Select best lambda (penalty) by lowest RMSE
best_lasso <- select_best(lasso_res, metric = "rmse")
best_lasso

# Finalize workflow with best lambda
final_lasso_wf <- finalize_workflow(lasso_wf, best_lasso)

# Fit final model on training data and test it
lasso_final_fit <- last_fit(final_lasso_wf, lasso_split)

# Test-set metrics
collect_metrics(lasso_final_fit)

# View coefficients after regularization
lasso_fit_obj <- lasso_final_fit |> extract_workflow() |> extract_fit_parsnip()

coef(lasso_fit_obj$fit)[1:5, ]  # show first 20 coefficients (many will shrink toward 0)

## 5 x 70 sparse Matrix of class "dgCMatrix"

##                                                                              
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899 5.800899
## school_public_NotPublic .        .        .        .        .        .       
## school_public_Public    .        .        .        .        .        .       
## school_choice_feel_No   .        .        .        .        .        .       
## school_choice_feel_Yes  .        .        .        .        .        .       
##                                                                              
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899 5.800899
## school_public_NotPublic .        .        .        .        .        .       
## school_public_Public    .        .        .        .        .        .       
## school_choice_feel_No   .        .        .        .        .        .       
## school_choice_feel_Yes  .        .        .        .        .        .       
##                                                                              
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899 5.800899
## school_public_NotPublic .        .        .        .        .        .       
## school_public_Public    .        .        .        .        .        .       
## school_choice_feel_No   .        .        .        .        .        .       
## school_choice_feel_Yes  .        .        .        .        .        .       
##                                                                              
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899 5.800899
## school_public_NotPublic .        .        .        .        .        .       
## school_public_Public    .        .        .        .        .        .       
## school_choice_feel_No   .        .        .        .        .        .       
## school_choice_feel_Yes  .        .        .        .        .        .       
##                                                                              
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899 5.800899
## school_public_NotPublic .        .        .        .        .        .       
## school_public_Public    .        .        .        .        .        .       
## school_choice_feel_No   .        .        .        .        .        .       
## school_choice_feel_Yes  .        .        .        .        .        .       
##                                                                        
## (Intercept)             5.800899 5.800899 5.800899 5.800899 5.800899498
## school_public_NotPublic .        .        .        .        0.004013881
## school_public_Public    .        .        .        .        .          
## school_choice_feel_No   .        .        .        .        .          
## school_choice_feel_Yes  .        .        .        .        .          
##                                                                    
## (Intercept)             5.800899498 5.8008995 5.80089950 5.80089950
## school_public_NotPublic 0.009494601 0.0138805 0.01787718 0.02148123
## school_public_Public    .           .         .          .         
## school_choice_feel_No   .           .         .          .         
## school_choice_feel_Yes  .           .         .          .         
##                                                                                
## (Intercept)              5.8008994984  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  0.0246437294  2.694729e-02  2.914197e-02  3.114826e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -0.0006726755 -3.792649e-03 -6.601081e-03 -9.157224e-03
## school_choice_feel_Yes   .             2.739492e-17  1.369746e-16  2.465543e-16
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  3.294863e-02  3.457804e-02  3.634724e-02  3.833627e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -1.143985e-02 -1.347977e-02 -1.539757e-02 -1.725617e-02
## school_choice_feel_Yes   4.109238e-16  5.433326e-16  6.620439e-16  7.259653e-16
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  4.011879e-02  4.177005e-02  4.319752e-02  4.449608e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -1.893334e-02 -2.047424e-02 -2.186154e-02 -2.311801e-02
## school_choice_feel_Yes   7.853210e-16  8.401108e-16  8.903349e-16  9.496905e-16
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  4.568049e-02  4.675980e-02  4.774323e-02  4.863930e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -2.426355e-02 -2.530738e-02 -2.625849e-02 -2.712511e-02
## school_choice_feel_Yes   1.004480e-15  1.070685e-15  1.143738e-15  1.193962e-15
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  4.945576e-02  5.018646e-02  5.086492e-02  5.141942e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -2.791473e-02 -2.862642e-02 -2.928239e-02 -2.982624e-02
## school_choice_feel_Yes   1.267015e-15  1.312673e-15  1.358331e-15  1.385726e-15
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  5.198362e-02  5.250315e-02  5.297687e-02  5.340832e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -3.037232e-02 -3.087322e-02 -3.132999e-02 -3.174618e-02
## school_choice_feel_Yes   1.415404e-15  1.440516e-15  1.472477e-15  1.494165e-15
##                                                                                
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  5.380128e-02  5.415924e-02  5.448536e-02  5.478249e-02
## school_public_Public     .             .             .             .           
## school_choice_feel_No   -3.212534e-02 -3.247079e-02 -3.278554e-02 -3.307231e-02
## school_choice_feel_Yes   1.519277e-15  1.544389e-15  1.568359e-15  1.591188e-15
##                                                                  
## (Intercept)              5.800899e+00  5.800899e+00  5.800899e+00
## school_public_NotPublic  5.505321e-02  5.529988e-02  5.552464e-02
## school_public_Public     .             .             .           
## school_choice_feel_No   -3.333361e-02 -3.357169e-02 -3.378862e-02
## school_choice_feel_Yes   1.618012e-15  1.633993e-15  1.659105e-15

# Visualize CV RMSE vs penalty
autoplot(lasso_res, metric = "rmse") +
  labs(title = "Lasso: CV RMSE across Penalty (λ)")

# Extract coefficients safely
lasso_fit_obj <- lasso_final_fit |> extract_workflow() |> extract_fit_parsnip()

# Convert coefficients matrix to dataframe 
coef_df <- as.matrix(coef(lasso_fit_obj$fit, s = best_lasso$penalty)) |>
  as.data.frame() |>
  tibble::rownames_to_column("term")

# The coefficient column might have a generic name like "1" or "s1"
colnames(coef_df)[2] <- "estimate"

coef_df <- coef_df |>
  filter(term != "(Intercept)") |>
  mutate(abs_est = abs(estimate)) |>
  arrange(desc(abs_est))

# How many predictors were kept?
coef_kept <- coef_df |> filter(estimate != 0)
nrow(coef_kept)

## [1] 28

# Top 15 non-zero coefficients
coef_kept |> slice_head(n = 15)

# Optional: quick plot
coef_kept |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(term, abs_est), y = estimate)) +
  geom_col() +
  coord_flip() +
  labs(title = "Lasso: Top Non-zero Coefficients at Best λ",
       x = "Predictor", y = "Coefficient") +
  theme_minimal()

Interpretation

The Lasso regression model applied an L1 regularization penalty to identify the most influential predictors of weekly homework hours while shrinking weak coefficients toward zero. Cross-validation selected the optimal penalty value of λ = 0.0085, which achieved RMSE = 4.66 and R² = 0.26 on the test data virtually identical to the multiple linear regression results. Although predictive accuracy did not improve, the model became more parsimonious by retaining only 28 non-zero coefficients out of the original set of dummy-coded predictors.

The strongest predictors retained by Lasso included categories of homework_days (especially “5+ days/week”, “1–2 days/week”, and “Less than once a week”) and several engagement-related variables such as check_homework, help_homework, and select demographic indicators (e.g., race_Asian, race_Black, race_White). These results confirm that students who complete homework more frequently spend substantially more time on homework overall, while parental monitoring and demographic characteristics have moderate effects.

The tuning curve showed that RMSE remained stable across small penalties and increased sharply once λ became too large, indicating that only mild regularization was needed. Overall, the Lasso regression provided a simplified yet equally accurate model, highlighting the key predictors while setting less informative variables to zero.

5.1.4 PCA Regression

The following steps are performed:

Using PCA to reduce dimensionality of dummy predictors
Creating principal components for regression
Tuning the number of components through cross-validation
Fitting regression on principal components
Comparing PCA performance to MLR and Lasso

set.seed(631)

pca_data <- pfi2019_final |>
  dplyr::select(-grade_cat, -grade_level) |>
  dplyr::filter(!is.na(homework_hours))

# Train/test split (80/20)
pca_split <- initial_split(pca_data, prop = 0.8)
pca_train <- training(pca_split)
pca_test  <- testing(pca_split)

# 10-fold cross-validation on training data
pca_folds <- vfold_cv(pca_train, v = 10)

# Just to check sizes
dim(pca_train)

## [1] 11562    14

dim(pca_test)

## [1] 2891   14

# PCA recipe: impute → dummy encode → normalize → PCA
pca_recipe <- recipe(homework_hours ~ ., data = pca_train) |>
  step_impute_mode(all_nominal_predictors()) |>            # fill missing categorical
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>   # convert to 0/1 dummies
  step_normalize(all_numeric_predictors()) |>               # scale numerics for PCA
  step_pca(all_numeric_predictors(), num_comp = tune())     # tune # components

# Model: linear regression applied to PCs
pca_spec <- linear_reg() |>
  set_engine("lm") |>
  set_mode("regression")

# Workflow: combine recipe + model
pca_wf <- workflow() |>
  add_recipe(pca_recipe) |>
  add_model(pca_spec)

# Grid of PCA components to try
pca_grid <- tibble(num_comp = seq(2, 20, by = 2))

# Tune via 10-fold CV
set.seed(631)
pca_res <- tune_grid(
  pca_wf,
  resamples = pca_folds,
  grid = pca_grid,
  metrics = yardstick::metric_set(yardstick::rmse, yardstick::rsq),
  control = control_grid(save_pred = TRUE)
)

# View results of tuning
collect_metrics(pca_res)

# Select best number of principal components based on lowest RMSE
best_pca <- select_best(pca_res, metric = "rmse")
best_pca

# Finalize workflow using best num_comp
final_pca_wf <- finalize_workflow(pca_wf, best_pca)

# Fit on training and evaluate on test set
pca_final <- last_fit(final_pca_wf, pca_split)

# Test-set metrics (RMSE, R²)
collect_metrics(pca_final)

Interpretation

PCA Regression was used to reduce the dimensionality of the dummy-coded predictors before fitting a linear model for weekly homework hours. The number of principal components was tuned using 10-fold cross-validation, and the best model selected 20 components (the value minimizing RMSE).

The test-set performance (RMSE = 4.86, R² = 0.20) was noticeably worse than the Multiple Linear Regression, Polynomial Regression, and Lasso models. This indicates that PCA did not improve predictive accuracy for this dataset. The likely reason is that most predictors are categorical and already low-dimensional, so converting them into principal components does not uncover meaningful underlying structure.

Although PCA did not enhance prediction, this example demonstrates how dimensionality reduction can be incorporated into the tidymodels framework and evaluated using cross-validation. It also highlights an important modeling insight: PCA is most effective when there are many correlated numeric predictors conditions not strongly met in the PFI dataset.

5.1.5 Comparison of Regression Models

library(knitr)
library(tibble)
library(flextable)

reg_compare <- tribble(
  ~Model,                       ~CV_RMSE,     ~CV_R2,       ~Test_RMSE, ~Test_R2,     ~Notes,
  "Multiple Linear Regression", "4.50",       "0.266",      "4.66",     "0.264",      "Strong baseline performer",
  "Polynomial Regression",      "4.53",       "0.267",      "4.89",     "0.249",      "No improvement from curvature",
  "Lasso Regression",           "4.50",       "0.266",      "4.66",     "0.264",      "Simpler model; same accuracy",
  "PCA Regression",             "4.65–4.85",  "0.02–0.21",  "4.86",     "0.200",      "Performance reduced"
)

flextable(reg_compare) |>
  autofit() |>
  set_caption("Comparison of Regression Models")

Comparison of Regression Models
Model	CV_RMSE	CV_R2	Test_RMSE	Test_R2	Notes
Multiple Linear Regression	4.50	0.266	4.66	0.264	Strong baseline performer
Polynomial Regression	4.53	0.267	4.89	0.249	No improvement from curvature
Lasso Regression	4.50	0.266	4.66	0.264	Simpler model; same accuracy
PCA Regression	4.65–4.85	0.02–0.21	4.86	0.200	Performance reduced

The regression analyses demonstrated that predicting weekly homework hours using the PFI dataset is challenging due to the limited number of numeric predictors and the categorical nature of most explanatory variables. The baseline Multiple Linear Regression model achieved moderate performance, explaining about 26% of the variance in homework time. Polynomial Regression did not improve accuracy, indicating that the relationships between socio-behavioral predictors and homework hours are largely linear. Lasso Regression produced a more parsimonious model by shrinking weaker coefficients to zero, but its predictive accuracy remained nearly identical to the baseline model. PCA Regression, which reduces dimensionality by projecting dummy-coded predictors into principal components, resulted in worse performance, confirming that PCA is less effective when predictors are primarily categorical and not highly correlated.

Across all regression methods, model performance remained relatively stable, with RMSE values between 4.6 and 4.9. These results suggest that student homework hours cannot be strongly predicted from the available survey-based variables, which capture school engagement, homework routines, and socioeconomic factors but may omit other influences such as motivation, school policy, or teacher expectations. Overall, the regression section highlights important modeling lessons: added model complexity does not always yield better predictions, and method selection must align with the structure and richness of the data.

5.2 Classification Models

5.2.1 Multinomial Logistic Regression

The following steps are performed:

Preparing a multiclass grade outcome (grade_cat)
Fitting a multinomial logistic regression model
Dummy-encoding categorical predictors
Using cross-validation and test-set evaluation
Interpreting predicted class probabilities and key predictors

set.seed(631)

# Use grade_cat as outcome, keep homework_hours as a predictor
mnl_data <- pfi2019_final |>
  dplyr::select(-grade_level) |>     # drop only the binary outcome
  drop_na(grade_cat) |>              # need grade_cat present
  mutate(grade_cat = droplevels(grade_cat))

# Train/test split with stratification on grade_cat
mnl_split <- initial_split(mnl_data, prop = 0.8, strata = grade_cat)
mnl_train <- training(mnl_split)
mnl_test  <- testing(mnl_split)

# 10-fold CV, also stratified
mnl_folds <- vfold_cv(mnl_train, v = 10, strata = grade_cat)

# Check class balance in training data
mnl_train |> count(grade_cat)

# Preprocessing for multinomial logistic regression
mnl_recipe <- recipe(grade_cat ~ ., data = mnl_train) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>        # for homework_hours
  step_impute_mode(all_nominal_predictors()) |>          # for any missing factors
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |> 
  step_normalize(all_numeric_predictors())               # scale homework_hours

# Multinomial logistic regression model
mnl_spec <- multinom_reg() |>
  set_engine("nnet") |>
  set_mode("classification")

# Workflow: combine recipe + model
mnl_wf <- workflow() |>
  add_recipe(mnl_recipe) |>
  add_model(mnl_spec)

set.seed(631)

mnl_res <- fit_resamples(
  mnl_wf,
  resamples = mnl_folds,
  metrics = metric_set(accuracy, mn_log_loss),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(mnl_res)

mnl_final <- last_fit(mnl_wf, mnl_split)

# Test accuracy & log loss
collect_metrics(mnl_final)

# Predictions with actual & predicted classes
mnl_test_preds <- collect_predictions(mnl_final)

# Confusion matrix
mnl_test_preds |>
  conf_mat(truth = grade_cat, estimate = .pred_class) |>
  autoplot(type = "heatmap")

Interpretation:

The multinomial logistic model reached 59% accuracy with moderate discrimination (AUC = 0.66). It predicts A and B well but rarely identifies C and never predicts D_or_F due to strong class imbalance in the dataset. Most errors occur between adjacent grade categories. Homework behavior and Socioeconomic status measures help explain grade categories, though imbalance limits prediction of low-performing students.

5.2.2 Logistic Regression (Binary)

The following steps are performed:

Converting grade outcomes into a binary label (High vs Low)
Fitting a logistic regression classifier
Dummy encoding categorical predictors
Using 10-fold cross-validation + test-set evaluation
Interpreting accuracy and key misclassification patterns

set.seed(631)

# grade_level = High (A/B) vs Low (C/D_or_F)
logit_data <- pfi2019_final |>
  dplyr::select(-grade_cat) |>             # drop multinomial outcome
  drop_na(grade_level) |>
  mutate(grade_level = droplevels(grade_level))

# Stratified split
logit_split <- initial_split(logit_data, prop = 0.8, strata = grade_level)
logit_train <- training(logit_split)
logit_test  <- testing(logit_split)

# CV folds (stratified)
logit_folds <- vfold_cv(logit_train, v = 10, strata = grade_level)

# Class counts
logit_train |> count(grade_level)

# Preprocessing steps for binary logistic regression
logit_recipe <- recipe(grade_level ~ ., data = logit_train) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>         # homework_hours
  step_impute_mode(all_nominal_predictors()) |>           # fill missing factors
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |> # one-hot encoding
  step_normalize(all_numeric_predictors())                # normalization for numeric predictors

# Logistic regression model
logit_spec <- logistic_reg() |>
  set_engine("glm") |>
  set_mode("classification")

# Workflow
logit_wf <- workflow() |>
  add_recipe(logit_recipe) |>
  add_model(logit_spec)

set.seed(631)

logit_res <- fit_resamples(
  logit_wf,
  resamples = logit_folds,
  metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
  control = control_resamples(save_pred = TRUE)
)


collect_metrics(logit_res)

The logistic regression model achieved high accuracy (≈ 88%) but extremely low sensitivity (≈ 2%), indicating that it predicts nearly all students as “High” performance. The ROC_AUC of 0.76 suggests moderate ability to separate the two classes overall, but class imbalance heavily biases predictions toward the majority category. As a result, the model fails to identify Low-performing students, despite performing well on High-performing ones.

logit_final <- last_fit(logit_wf, logit_split)

# Test metrics
collect_metrics(logit_final)

# Predictions
logit_test_preds <- collect_predictions(logit_final)

# Confusion matrix
logit_test_preds |>
  conf_mat(truth = grade_level, estimate = .pred_class) |>
  autoplot(type = "heatmap")

logit_test_preds |> 
  roc_curve(truth = grade_level, .pred_Low) |> 
  autoplot() +
  ggtitle("ROC Curve – Binary Logistic Regression")

Interpretation: The binary logistic regression model achieved 88.5% accuracy on the test set, but this result is misleading due to strong class imbalance (High = 88%, Low = 12%). The model predicted nearly all students as “High,” resulting in very high specificity (≈1.00) but extremely low sensitivity—only 7 Low students were correctly identified, while 302 Low students were misclassified as High.

Although the ROC AUC of 0.77 indicates moderate overall discrimination ability, the confusion matrix shows that the model is ineffective at detecting Low-performing students. This behavior is expected given the imbalance in the dataset and the fact that logistic regression without class balancing tends to favor the majority class. Overall, while the model performs well numerically, it is not practically useful for identifying students at risk of low performance.

5.2.3 Linear Discriminant Analysis (LDA)

The following steps are performed:

Preparing binary grade categories (grade_level) for LDA
Applying LDA using the discrim package
Dummy-encoding categorical predictors
Evaluating model performance using cross-validation
Interpreting accuracy, ROC AUC, and misclassification patterns

library(discrim)

set.seed(631)

lda_data <- pfi2019_final |>
  dplyr::select(-grade_cat) |>
  drop_na(grade_level) |>
  mutate(grade_level = droplevels(grade_level))

lda_split <- initial_split(lda_data, prop = 0.8, strata = grade_level)
lda_train <- training(lda_split)
lda_test  <- testing(lda_split)

lda_folds <- vfold_cv(lda_train, v = 10, strata = grade_level)

lda_train |> count(grade_level)

lda_recipe <- recipe(grade_level ~ ., data = lda_train) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE) |>
  step_normalize(all_numeric_predictors())

lda_spec <- discrim_linear() |>
  set_engine("MASS") |>
  set_mode("classification")

lda_wf <- workflow() |>
  add_recipe(lda_recipe) |>
  add_model(lda_spec)

set.seed(631)

lda_res <- fit_resamples(
  lda_wf,
  resamples = lda_folds,
  metrics = metric_set(accuracy, roc_auc, sensitivity, specificity),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(lda_res)

lda_final <- last_fit(lda_wf, lda_split)

collect_metrics(lda_final)

lda_preds <- collect_predictions(lda_final)

lda_preds |>
  conf_mat(truth = grade_level, estimate = .pred_class) |>
  autoplot(type = "heatmap")

lda_preds |> 
  roc_curve(truth = grade_level, .pred_Low) |> 
  autoplot() +
  ggtitle("ROC Curve – LDA Model")

Interpretation: The LDA model achieved 88% accuracy and a ROC AUC of 0.75 in cross-validation, indicating moderate ability to separate High and Low performance students. However, sensitivity remained very low (≈ 6%), meaning the model rarely identified Low-performing students correctly, while specificity was very high (≈ 99%), reflecting the strong class imbalance (High ≈ 88%, Low ≈ 12%).

Test-set performance was similar, with 88.6% accuracy and AUC = 0.77, confirming stable but imbalanced predictive behavior. The confusion matrix shows that LDA correctly classified most High-performing students but correctly identified only 23 out of 331 Low-performing cases, misclassifying the majority as High.

Overall, LDA performs comparably to binary logistic regression, with moderate discrimination but poor sensitivity due to the imbalance in grade_level classes. This limits its usefulness for detecting Low-performing students, though it remains effective at modeling the majority class.

5.2.4 Comparison of Classification Models

library(flextable)

class_compare <- tribble(
  ~Model,                         ~Outcome,       ~Accuracy,   ~ROC_AUC,   ~Notes,
  "Multinomial Logistic Regression", "grade_cat",   "0.592",     "0.661",    "Moderate accuracy; poor for small classes",
  "Binary Logistic Regression",      "grade_level", "0.885",     "0.771",    "High accuracy; very low sensitivity",
  "Linear Discriminant Analysis",    "grade_level", "0.886",     "0.766",    "Moderate AUC; imbalance limits detection"
)

flextable(class_compare) |>
  autofit() |>
  set_caption("Comparison of Classification Models")

Comparison of Classification Models
Model	Outcome	Accuracy	ROC_AUC	Notes
Multinomial Logistic Regression	grade_cat	0.592	0.661	Moderate accuracy; poor for small classes
Binary Logistic Regression	grade_level	0.885	0.771	High accuracy; very low sensitivity
Linear Discriminant Analysis	grade_level	0.886	0.766	Moderate AUC; imbalance limits detection

The classification models showed moderate success in predicting students’ grade outcomes. Multinomial logistic regression performed reasonably well for common grade categories but struggled with rare D/F grades. Binary logistic regression and LDA achieved high overall accuracy but had very low sensitivity, meaning they rarely detected Low-performing students due to class imbalance. Overall, the models identify general performance patterns but are limited in identifying struggling students.

5.3 Tree-Based & Ensemble Models

5.3.1 Decision Tree

The following steps are performed:

Fitting a decision tree to classify grade_level
Using dummy-encoded predictors
Performing cross-validation to assess accuracy
Evaluating test accuracy and confusion matrix
Interpreting how the tree splits predictors

set.seed(631)

tree_data <- pfi2019_final |>
  dplyr::select(-grade_cat) |> 
  drop_na(grade_level)

tree_split <- initial_split(tree_data, prop = 0.8, strata = grade_level)
tree_train <- training(tree_split)
tree_test  <- testing(tree_split)

tree_folds <- vfold_cv(tree_train, v = 10, strata = grade_level)

tree_train |> count(grade_level)

tree_recipe <- recipe(grade_level ~ ., data = tree_train) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>
  step_impute_mode(all_nominal_predictors()) |>
  step_dummy(all_nominal_predictors(), one_hot = TRUE)

# Decision tree spec
tree_spec <- decision_tree(
  cost_complexity = tune(),     # prune tree complexity
  tree_depth = tune(),          # max depth
  min_n = tune()                # minimum nodes
) |>
  set_engine("rpart") |>
  set_mode("classification")

# Workflow
tree_wf <- workflow() |>
  add_recipe(tree_recipe) |>
  add_model(tree_spec)

tree_grid <- grid_regular(
  cost_complexity(),
  tree_depth(),
  min_n(),
  levels = 3
)

set.seed(631)
tree_res <- tune_grid(
  tree_wf,
  resamples = tree_folds,
  grid = tree_grid,
  metrics = metric_set(accuracy, roc_auc),
  control = control_grid(save_pred = TRUE)
)

collect_metrics(tree_res)

best_tree <- select_best(tree_res, metric = "accuracy")

final_tree_wf <- finalize_workflow(tree_wf, best_tree)

tree_final <- last_fit(final_tree_wf, tree_split)

# test metrics
collect_metrics(tree_final)

# predictions
tree_preds <- collect_predictions(tree_final)

# confusion matrix
tree_preds |>
  conf_mat(truth = grade_level, estimate = .pred_class) |>
  autoplot(type = "heatmap")

Interpretation:

The decision tree model achieved test accuracy of about 88.6%, but this performance is entirely driven by the severe class imbalance in grade_level. The tuned tree degenerated into a trivial classifier that predicts every student as High, resulting in a ROC AUC of 0.50 and a confusion matrix with no students predicted as Low. Although accuracy appears high, the model has zero sensitivity for detecting Low-performing students and provides no improvement over a naive majority-class rule. Compared to logistic regression and LDA, the decision tree offers no additional predictive value and confirms that the current data and class distribution are not well suited to tree based classification for identifying at risk students.

5.3.2 Random Forest

The following steps are performed:

Fitting a random forest classifier to predict grade_level (High vs Low).
Preprocessing predictors using dummy encoding and imputation.
Training the model using 10-fold stratified cross-validation.
Evaluating performance on a held-out test set using accuracy, ROC AUC, and a confusion matrix.
Examining variable importance to identify the strongest predictors of student performance.

set.seed(631)

rf_data <- pfi2019_final |>
  dplyr::select(-grade_cat) |>           # drop multinomial outcome
  drop_na(grade_level) |>
  mutate(grade_level = droplevels(grade_level))

# 80/20 split stratified by grade_level
rf_split <- initial_split(rf_data, prop = 0.8, strata = grade_level)
rf_train <- training(rf_split)
rf_test  <- testing(rf_split)

# 10-fold CV (also stratified)
rf_folds <- vfold_cv(rf_train, v = 10, strata = grade_level)

# check balance
rf_train |> count(grade_level)

## ---- rf-step2-recipe, message=FALSE, warning=FALSE --------------------

rf_recipe <- recipe(grade_level ~ ., data = rf_train) |>
  step_zv(all_predictors()) |>
  step_impute_median(all_numeric_predictors()) |>        # homework_hours
  step_impute_mode(all_nominal_predictors()) |>          # categorical NAs
  step_dummy(all_nominal_predictors(), one_hot = TRUE)   # one-hot factors

# Random forest specification (fixed hyperparameters)
rf_spec <- rand_forest(
  mtry  = 15,      # number of predictors to sample at each split
  trees = 500,     # number of trees
  min_n = 10       # minimum node size
) |>
  set_engine("ranger", importance = "impurity") |>
  set_mode("classification")

# Workflow
rf_wf <- workflow() |>
  add_recipe(rf_recipe) |>
  add_model(rf_spec)

# Cross-validation performance
set.seed(631)
rf_res <- fit_resamples(
  rf_wf,
  resamples = rf_folds,
  metrics = metric_set(accuracy, roc_auc),
  control = control_resamples(save_pred = TRUE)
)

collect_metrics(rf_res)

rf_final <- last_fit(rf_wf, rf_split)

# Test metrics
collect_metrics(rf_final)

# Predictions and confusion matrix
rf_preds <- collect_predictions(rf_final)

rf_preds |>
  conf_mat(truth = grade_level, estimate = .pred_class) |>
  autoplot(type = "heatmap")

rf_engine <- rf_final |>
  extract_workflow() |>
  extract_fit_engine()

vip_tbl <- tibble(
  term = names(rf_engine$variable.importance),
  importance = as.numeric(rf_engine$variable.importance)
) |>
  arrange(desc(importance))

# top 15 important predictors
vip_tbl |> slice_head(n = 15)

# simple importance plot
vip_tbl |>
  slice_head(n = 15) |>
  ggplot(aes(x = reorder(term, importance), y = importance)) +
  geom_col() +
  coord_flip() +
  labs(
    title = "Random Forest: Top 15 Variable Importances",
    x = "Predictor", y = "Importance"
  )

Interpretation

The random forest model achieved a test accuracy of 88.4% and a ROC AUC of 0.73, indicating moderate ability to distinguish between High and Low performers. Compared to logistic regression and LDA, random forest provides slightly better differentiateand identifies more Low-performing students, though sensitivity remains low due to severe class imbalance.

Variable importance results highlight homework_hours as the dominant predictor, followed by parent education, race, and several homework engagement variables. These findings reinforce the conclusion that homework behavior and SES characteristics are the strongest contributors to academic performance in the PFI dataset. While the model handles nonlinearities and interactions automatically, imbalance still limits its effectiveness as an early-warning tool.

5.3.3 Comparison of Tree-based models

## ---- tree-rf-comparison-flextable, message=FALSE, warning=FALSE --------
library(flextable)

tree_rf_compare <- tribble(
  ~Model,           ~Outcome,      ~Accuracy, ~ROC_AUC, ~Notes,
  "Decision Tree",  "grade_level", "0.886",   "0.500",  "Predicts all High; no useful splits",
  "Random Forest",  "grade_level", "0.884",   "0.727",  "Moderate AUC; finds more Low; nonlinear patterns captured"
)

flextable(tree_rf_compare) |>
  autofit() |>
  set_caption("Comparison of Decision Tree and Random Forest Models")

Comparison of Decision Tree and Random Forest Models
Model	Outcome	Accuracy	ROC_AUC	Notes
Decision Tree	grade_level	0.886	0.500	Predicts all High; no useful splits
Random Forest	grade_level	0.884	0.727	Moderate AUC; finds more Low; nonlinear patterns captured

While the single decision tree collapses into a majority-class classifier with no useful splits (AUC = 0.50), the random forest provides much better discrimination (AUC = 0.73), identifies more Low-performing students, and captures nonlinear patterns, highlighting the importance of ensemble methods for imbalanced educational data.

6 Challenges and Limitations

Working with the PFI dataset presented several challenges. The data was highly categorical, requiring substantial preprocessing and resulting in a large number of dummy-coded predictors. Class imbalance between High- and Low-performing students caused many classification models to predict only the dominant class, making it difficult to identify struggling students. Regression models explained only a small portion of the variability in homework hours, suggesting that the available variables did not fully capture the behaviors influencing student effort. More complex models, such as PCA and decision trees, did not improve performance and underscored the difficulty of modeling educational outcomes using survey responses alone.

7 Conclusion and Key Insights

The analysis of the PFI dataset shows that homework behavior and socioeconomic status are the strongest and most consistent predictors of academic performance. EDA revealed clear grade gaps across homework frequency, parent education, income, and student race, patterns that were confirmed across regression and classification models. Regression models demonstrated that while homework and SES variables significantly influence homework hours, they explain only a modest portion of the variation, suggesting additional unmeasured factors contribute to student effort. Classification models were better at separating high-performing students than low-performing ones, reflecting strong class imbalance, but Random Forest achieved the best overall discrimination with an AUC of 0.73 and meaningful variable importance results. Overall, the findings show that homework consistency, parental involvement, and socioeconomic advantage play a central role in shaping academic outcomes, while more detailed academic data would be needed to reliably identify struggling students.

Insights From EDA

Students who complete homework more frequently and spend more hours on homework are far more likely to earn A/B grades.
Strong Socioeconomic status patterns emerged: students from higher-income households and higher parent education backgrounds show higher achievement.
Family engagement (volunteering, attending meetings, fundraising) is positively associated with stronger grades.
Racial disparities appear, with Asian and White students reporting higher A/B rates compared to other groups.

Insights From Regression Models

Multiple Linear Regression and Lasso identified homework_days, homework_hours, check_homework, and parent_edu as the strongest predictors of homework effort.
All regression models explained only 20–27% of variance in homework hours. Polynomial and PCA regression offered no improvement, confirming mostly linear relationships.
Lasso improved interpretability by shrinking weaker predictors, but predictive accuracy remained similar to MLR.

Insights From Classification Models

Multinomial Logistic Regression achieved ~59% accuracy and could distinguish A/B students reasonably well, but struggled with C and D/F due to imbalance.
Binary Logistic Regression and LDA achieved high accuracy (~88%) but had very low sensitivity, predicting almost all students as High performers.
Decision Tree collapsed into a majority-class model because of imbalance.
Random Forest performed best, reaching ROC AUC ≈ 0.73 and identifying more Low-performing students than other models.

Most Important Predictors (Across All Models)

Homework_days (days per week homework is done)
Homework_hours (weekly hours spent)
check_homework (parent checking frequency)
help_homework
income_group
parent_edu
race
school choice & engagement factors (weaker but still relevant)
These predictors repeatedly appeared in Lasso, Random Forest importance scores, and multinomial coefficients.

Best Model

Random Forest is the best-performing classification model.
It achieved the highest ROC AUC (~0.73), handled nonlinear effects, and identified more Low-performing students than logistic regression, LDA, or decision trees.
It provided meaningful variable importance rankings, aligning with EDA and regression results.
However even Random Forest struggled with sensitivity due to extreme class imbalance, so it is best viewed as a descriptive model, not an early-warning system.

8 Course Objectives

1. Describe probability as a foundation of statistical modeling, including inference and maximum likelihood estimation.

In this project, I used probability ideas when interpreting regression and classification models. Logistic regression and LDA rely on maximum likelihood estimation, which helped me understand how model parameters are estimated. Checking residuals, uncertainty, and variability also showed how probability underlies all statistical inference.

2. Apply the appropriate generalized linear model for a specific data context

I used linear models for homework_hours, multinomial logistic regression for grade categories, and logistic regression/LDA for High vs Low grades. Each model matched the type of outcome I was analyzing.

3. Demonstrate model selection given a set of candidate models

I compared models using cross-validation, test-set metrics, and tuning. LASSO showed how penalization selects important predictors. Random Forest and decision trees were compared to GLMs using accuracy, ROC AUC, and confusion matrices. This helped me choose which model worked best for each task.

4. Express the results of statistical models to a general audience

Throughout the portfolio, I explained results in simple language and focused on what the findings mean in practical terms. EDA, Visuals such as confusion matrices, ROC curves, and variable importance plots helped communicate the results to a general audience.

5. Use programming software to fit and assess statistical models

I used tidymodels to preprocess data, build models, tune parameters, run cross-validation, and evaluate model performance. Recipes, workflows, and resampling made the modeling process organized, consistent, and reproducible.

9 Future Work

Explore class-balancing methods such as SMOTE, oversampling, undersampling, or class weights to improve detection of Low-performing students.
Incorporate richer academic data (prior grades, attendance, teacher evaluations) to strengthen predictive accuracy.
Test more advanced machine learning models, such as gradient boosting or XGBoost, to better capture nonlinear relationships.
Investigate interaction effects between homework behavior, Socioeconomic status, and family engagement for deeper understanding of performance patterns.

10 References

https://emilhvitfeldt.github.io/ISLR-tidymodels-labs/09-support-vector-machines.html

https://nces.ed.gov/nhes/data/2019/ecpp/cbook_ecpp_pu.pdf

https://hastie.su.domains/ISLR2/ISLRv2_corrected_June_2023.pdf.download.html

A Statistical Learning Portfolio

Likhitha Chowdary Jampani

December 09, 2025

1 Portfolio Overview

2 Data Dictionary

3 Data Cleaning and Pre-processing

4 Exploratory Data Analysis

4.1 Univariate EDA

4.2 bivariate EDA

5 Modeling

5.1 Regression Models

5.1.1 Multiple Linear Regression

5.1.2 Polynomial Regression

5.1.3 Lasso Regression

5.1.4 PCA Regression

5.1.5 Comparison of Regression Models

5.2 Classification Models

5.2.1 Multinomial Logistic Regression

5.2.2 Logistic Regression (Binary)

5.2.3 Linear Discriminant Analysis (LDA)

5.2.4 Comparison of Classification Models

5.3 Tree-Based & Ensemble Models

5.3.1 Decision Tree

5.3.2 Random Forest

5.3.3 Comparison of Tree-based models

6 Challenges and Limitations

7 Conclusion and Key Insights

8 Course Objectives

9 Future Work

10 References