Loading Data and Necessary Packages.

library(readxl)
library(discrim)
library(nnet)
library(glmnet)
library(vip)
library(broom)
library(kableExtra)
library(patchwork)
library(forcats)
library(themis)
library(poissonreg)
library(gt)
library(ggcorrplot)
library(car)
library(tidyverse)
library(tidymodels)
library(dplyr)

I begin by importing the curated 2019 PFI dataset and inspecting its structure.

Classification Problem

pfi_data <- read_excel("~/STA 631/Projects/pfi-data.xlsx", sheet = "curated 2019")
dim(pfi_data)
## [1] 15500    75
names(pfi_data)
##  [1] "BASMID"       "ALLGRADEX"    "EDCPUB"       "SCCHOICE"     "SPUBCHOIX"   
##  [6] "SCONSIDR"     "SCHLHRSWK"    "EINTNET"      "MOSTIMPT"     "INTNUM"      
## [11] "SEENJOY"      "SEGRADES"     "SEABSNT"      "SEGRADEQ"     "FSSPORTX"    
## [16] "FSVOL"        "FSMTNG"       "FSPTMTNG"     "FSATCNFN"     "FSFUNDRS"    
## [21] "FSCOMMTE"     "FSCOUNSLR"    "FSFREQ"       "FSNOTESX"     "FSMEMO"      
## [26] "FCSCHOOL"     "FCTEACHR"     "FCSTDS"       "FCORDER"      "FCSUPPRT"    
## [31] "FHHOME"       "FHWKHRS"      "FHAMOUNT"     "FHCAMT"       "FHPLACE"     
## [36] "FHCHECKX"     "FHHELP"       "FOSTORY2X"    "FOCRAFTS"     "FOGAMES"     
## [41] "FOBUILDX"     "FOSPORT"      "FORESPON"     "FOHISTX"      "FODINNERX"   
## [46] "FOLIBRAYX"    "FOBOOKSTX"    "HDHEALTH"     "CDOBMM"       "CDOBYY"      
## [51] "CSEX"         "CSPEAKX"      "HHTOTALXX"    "RELATION"     "P1REL"       
## [56] "P1SEX"        "P1MRSTA"      "P1EMPL"       "P1HRSWK"      "P1MTHSWRK"   
## [61] "P1AGE"        "P2GUARD"      "TTLHHINC"     "OWNRNTHB"     "CHLDNT"      
## [66] "SEFUTUREX"    "DSBLTY"       "HHPARN19X"    "HHPARN19_BRD" "NUMSIBSX"    
## [71] "PARGRADEX"    "RACEETH"      "INTACC"       "CENREG"       "ZIPLOCL"
# glimpse(pfi_data)

Research Question

This project examines how family involvement at school and at home is associated with student academic performance in the 2019 Parent and Family Involvement (PFI) survey. Academic performance is measured using SEGRADES, the parent-reported overall grades item. Family involvement is measured using school-based involvement items, contact with the school, and homework/home-engagement variables.

Select Variables + Response (SEGRADES)

pfi_clean <- pfi_data %>%
  mutate(across(where(is.numeric), ~ ifelse(. < 0, NA, .))) %>%
  dplyr::select(
    BASMID,
    SEGRADES, SEENJOY, SEABSNT, SEGRADEQ,
    FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR, FSFREQ,
    FSNOTESX, FSMEMO,
    FHHOME, FHWKHRS, FHAMOUNT, FHCAMT, FODINNERX,
    EDCPUB, SCCHOICE, SCONSIDR, SPUBCHOIX,
    ALLGRADEX, PARGRADEX, TTLHHINC, RACEETH,
    CSEX, DSBLTY, HHPARN19_BRD, OWNRNTHB, CENREG,
    HHTOTALXX, NUMSIBSX, P1AGE, P1EMPL, INTACC, ZIPLOCL
  ) %>%
  mutate(
    SEGRADES = ifelse(SEGRADES == 5, NA, SEGRADES),
    grade_multi = factor(
      SEGRADES,
      levels = c(1, 2, 3, 4),
      labels = c("Mostly_A", "Mostly_B", "Mostly_C", "Mostly_D_or_lower")
    ),
    high_achiever = factor(
      ifelse(SEGRADES == 1, "Yes", "No"),
      levels = c("Yes", "No")
    )
  )

Lets check for missingness

missingness_table <- pfi_clean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
  mutate(percent_missing = round(100 * n_missing / nrow(pfi_clean), 1)) %>%
  arrange(desc(percent_missing))
missingness_table %>%
  slice_head(n = 20) %>%
  kbl(caption = "Top 20 Variables by Missingness") %>%
  kable_classic(full_width = FALSE)
Top 20 Variables by Missingness
variable n_missing percent_missing
SEGRADES 1967 12.7
grade_multi 1967 12.7
high_achiever 1967 12.7
FHWKHRS 1047 6.8
FHAMOUNT 1047 6.8
FHCAMT 1047 6.8
FSSPORTX 126 0.8
FSVOL 126 0.8
FSMTNG 126 0.8
FSPTMTNG 126 0.8
FSATCNFN 126 0.8
FSFUNDRS 126 0.8
FSCOMMTE 126 0.8
FSCOUNSLR 126 0.8
FSFREQ 126 0.8
FSNOTESX 53 0.3
FSMEMO 53 0.3
FHHOME 53 0.3
BASMID 0 0.0
SEENJOY 2 0.0
pfi_model <- pfi_clean %>%
  filter(
    !is.na(grade_multi),
    !is.na(high_achiever),
    !is.na(PARGRADEX),
    !is.na(TTLHHINC),
    !is.na(RACEETH),
    !is.na(CSEX)
  )
dim(pfi_model)
## [1] 13533    42

Lets add composite family-involvement variables

The questionnaire confirms those FS* variables are all school-based family involvement items, and FSFREQ is the count of how many times an adult in the household went to meetings or activities at the child’s school.

pfi_model <- pfi_model %>%
  mutate(
    across(
      c(FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR,
        FSNOTESX, FSMEMO),
      ~ case_when(
        . == 1 ~ 1,
        . == 2 ~ 0,
        TRUE ~ NA_real_
      )
    ),
    school_involve_score = rowSums(
      dplyr::select(., FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR),
      na.rm = TRUE
    ),
    school_contact_score = rowSums(
      dplyr::select(., FSNOTESX, FSMEMO),
      na.rm = TRUE
    )
  )
pfi_model <- pfi_model %>%
  mutate(
    CSEX = factor(CSEX, levels = c(1, 2), labels = c("Male", "Female")),
    RACEETH = factor(RACEETH, levels = c(1, 2, 3, 4, 5),
                     labels = c("White", "Black", "Hispanic", "Asian_PI", "Other")),
    PARGRADEX = factor(PARGRADEX,
                       labels = c("Less_than_HS", "HS", "Associate", "Bachelors", "Graduate")),
    TTLHHINC = factor(TTLHHINC, ordered = TRUE),
    EDCPUB = factor(EDCPUB),
    SCCHOICE = factor(SCCHOICE),
    SCONSIDR = factor(SCONSIDR),
    SPUBCHOIX = factor(SPUBCHOIX),
    DSBLTY = factor(DSBLTY),
    HHPARN19_BRD = factor(HHPARN19_BRD),
    OWNRNTHB = factor(OWNRNTHB),
    CENREG = factor(CENREG),
    P1EMPL = factor(P1EMPL),
    INTACC = factor(INTACC),
    ZIPLOCL = factor(ZIPLOCL),
    SEENJOY = factor(SEENJOY, ordered = TRUE),
    SEABSNT = factor(SEABSNT, ordered = TRUE)
  )

Exploratory Data Analysis

ggplot(pfi_model, aes(x = grade_multi)) +
  geom_bar(fill = "#2C7FB8") +
  labs(
    title = "Distribution of Student Grades",
    x = "SEGRADES",
    y = "Count"
  ) +
  theme_minimal()

The distribution of SEGRADES shows a strong imbalance, with most students reporting mostly A’s or B’s and relatively few reporting lower grades. This imbalance is important for model selection and evaluation.

ggplot(pfi_model, aes(x = grade_multi, y = school_involve_score)) +
  geom_boxplot(fill = "#A6CEE3") +
  labs(
    title = "School Involvement Score by Student Grades",
    x = "SEGRADES",
    y = "School Involvement Score"
  ) +
  theme_minimal()

School involvement scores do not show a strong positive relationship with higher academic performance; in fact, students with lower grades often have similar or slightly higher involvement levels, suggesting that involvement may be reactive rather than preventative.

ggplot(pfi_model, aes(x = grade_multi, y = FSFREQ)) +
  geom_boxplot(fill = "#B2DF8A") +
  labs(
    title = "Frequency of School Meetings/Activities by Student Grades",
    x = "SEGRADES",
    y = "FSFREQ"
  ) +
  theme_minimal()

The frequency of school meetings and activities shows a weak positive association with academic performance, but the large variability and presence of extreme values suggest that this relationship is not strong and that the distribution is highly skewed.

ggplot(pfi_model, aes(x = FHHOME, fill = grade_multi)) +
  geom_bar(position = "fill") +
  labs(
    title = "Grade Distribution by Homework Frequency",
    x = "How often homework is done outside school",
    y = "Proportion"
  ) +
  theme_minimal()

Homework frequency shows a clear positive relationship with academic performance, with higher frequencies associated with a larger proportion of students receiving mostly A’s. This suggests that home-based engagement may be a stronger predictor of academic success than school-based involvement.

Data splitting for modeling

set.seed(631)

class_df <- pfi_model %>%
  dplyr::select(
    high_achiever, grade_multi, SEGRADES,
    school_involve_score, school_contact_score, FSFREQ,
    FHHOME, FHWKHRS, FHAMOUNT, FHCAMT, FODINNERX,
    SEENJOY, SEABSNT,
    EDCPUB, SCCHOICE, SCONSIDR, SPUBCHOIX,
    PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY,
    HHPARN19_BRD, OWNRNTHB, CENREG, HHTOTALXX, NUMSIBSX, P1AGE, P1EMPL, INTACC, ZIPLOCL
  ) %>%
  drop_na()

class_split <- initial_split(class_df, prop = 0.75, strata = high_achiever)
class_train <- training(class_split)
class_test  <- testing(class_split)
set.seed(631)
class_folds <- vfold_cv(class_train, v = 10, strata = high_achiever)

count_df <- pfi_model %>%
  dplyr::select(
    FSFREQ,
    school_involve_score, school_contact_score,
    FHHOME, FHWKHRS, FODINNERX,
    PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY,
    SCCHOICE, EDCPUB
  ) %>%
  drop_na()

count_split <- initial_split(count_df, prop = 0.75)
count_train <- training(count_split)
count_test  <- testing(count_split)
multi_df <- class_df %>%
  dplyr::select(
    grade_multi,
    school_involve_score, school_contact_score, FSFREQ,
    FHHOME, FHWKHRS, FODINNERX, SEENJOY, SEABSNT,
    SCCHOICE, EDCPUB, PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY
  ) %>%
  drop_na()

set.seed(631)
multi_split <- initial_split(multi_df, prop = 0.75, strata = grade_multi)

multi_train <- training(multi_split)
multi_test  <- testing(multi_split)

Lasso Regression

lasso_recipe <- recipe(high_achiever ~ ., data = class_train) %>%
  step_rm(grade_multi, SEGRADES) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors()) %>%
  step_normalize(all_numeric_predictors())
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

lasso_wf <- workflow() %>%
  add_recipe(lasso_recipe) %>%
  add_model(lasso_spec)

lambda_grid <- grid_regular(penalty(), levels = 40)

set.seed(631)
lasso_res <- tune_grid(
  lasso_wf,
  resamples = class_folds,
  grid = lambda_grid,
  metrics = metric_set(roc_auc, accuracy),
  control = control_grid(save_pred = TRUE)
)
collect_metrics(lasso_res) %>%
  ggplot(aes(x = penalty, y = mean, color = .metric)) +
  geom_line() +
  geom_point() +
  facet_wrap(~ .metric, scales = "free_y", ncol = 1) +
  scale_x_log10() +
  theme_minimal() +
  labs(title = "Lasso Tuning Results", x = "Penalty", y = "Cross-validated performance")

best_penalty <- select_best(lasso_res, metric = "roc_auc")

final_lasso_wf <- finalize_workflow(lasso_wf, best_penalty)

lasso_last_fit <- last_fit(final_lasso_wf, class_split)

lasso_preds <- collect_predictions(lasso_last_fit)
collect_metrics(lasso_last_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.715 Preprocessor1_Model1
## 2 roc_auc     binary         0.773 Preprocessor1_Model1
## 3 brier_class binary         0.191 Preprocessor1_Model1

The lasso logistic model achieved an accuracy of 71.5% and an ROC AUC of 0.773 on the test set, indicating moderate predictive ability in distinguishing high-achieving students from others. Performance remained stable across small penalty values but declined sharply at larger penalties, suggesting that overly aggressive regularization removed too much predictive information.

final_lasso_fit <- fit(final_lasso_wf, data = class_train)

lasso_coefs <- final_lasso_fit %>%
  extract_fit_parsnip() %>%
  tidy()

lasso_coefs_nonzero <- lasso_coefs %>%
  filter(term != "(Intercept)", estimate != 0) %>%
  arrange(desc(abs(estimate)))

lasso_coefs_nonzero %>%
  slice_head(n = 20) %>%
  knitr::kable(caption = "Top Lasso-Selected Predictors")
Top Lasso-Selected Predictors
term estimate penalty
SEENJOY_1 0.4124879 0.0049239
DSBLTY_X2 -0.3599257 0.0049239
PARGRADEX_Graduate -0.3021157 0.0049239
CSEX_Female -0.2390815 0.0049239
TTLHHINC_01 -0.2087388 0.0049239
RACEETH_Black 0.1939437 0.0049239
RACEETH_Hispanic 0.1761597 0.0049239
SEENJOY_2 -0.1480184 0.0049239
FHHOME -0.1448516 0.0049239
PARGRADEX_Bachelors -0.1344068 0.0049239
FHCAMT 0.1291649 0.0049239
SEABSNT_1 0.0760875 0.0049239
HHPARN19_BRD_X2 0.0718702 0.0049239
P1AGE 0.0649976 0.0049239
RACEETH_Asian_PI -0.0640073 0.0049239
CENREG_X4 0.0613216 0.0049239
PARGRADEX_HS 0.0566932 0.0049239
CENREG_X2 -0.0492400 0.0049239
school_contact_score -0.0451594 0.0049239
ZIPLOCL_X32 -0.0356906 0.0049239

The lasso model selected a relatively small set of predictors, indicating that only a subset of the available family, school, and demographic variables contributed meaningful predictive information. Among the retained predictors, student school experience variables such as school enjoyment and absenteeism, along with household background characteristics such as parental education, income, disability status, and demographic factors, showed the strongest effects. Importantly for the research question, home-based involvement variables such as homework-related engagement (FHHOME, FHCAMT) remained in the model, while school contact was retained with a smaller coefficient. This suggests that family involvement at home may be more strongly associated with academic performance than school-based involvement, although broader student and household characteristics also play a major role.

race/ethnicity remained in the selected model, indicating differences in predicted academic performance across groups after accounting for other retained covariates.

Confusion Matrix for lasso:

lasso_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Lasso Confusion Matrix")

Standard Logistic Regression Model

log_formula <- high_achiever ~ school_involve_score + school_contact_score + FSFREQ +
  FHHOME + FHWKHRS + FODINNERX + SEENJOY + SEABSNT +
  SCCHOICE + EDCPUB + PARGRADEX + TTLHHINC + RACEETH + CSEX + DSBLTY

log_recipe <- recipe(log_formula, data = class_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

log_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")

log_wf <- workflow() %>%
  add_recipe(log_recipe) %>%
  add_model(log_spec)

log_last_fit <- last_fit(log_wf, class_split)
log_preds <- collect_predictions(log_last_fit)

collect_metrics(log_last_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.708 Preprocessor1_Model1
## 2 roc_auc     binary         0.769 Preprocessor1_Model1
## 3 brier_class binary         0.193 Preprocessor1_Model1

The standard logistic regression model achieved an accuracy of 70.8%, an ROC AUC of 0.769, and a Brier score of 0.193 on the test set. These results were very similar to the lasso logistic model, though the lasso model performed slightly better on all three metrics. This indicates that reducing the predictor set through regularization did not harm predictive performance and may have improved out-of-sample generalization. Because the lasso model is also more parsimonious, it provides a strong candidate for the preferred binary classification model

metric_set(accuracy, precision, recall, f_meas, roc_auc)(
  log_preds,
  truth = high_achiever,
  estimate = .pred_class,
  .pred_Yes
)
## # A tibble: 5 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.708
## 2 precision binary         0.723
## 3 recall    binary         0.790
## 4 f_meas    binary         0.755
## 5 roc_auc   binary         0.769

The higher recall relative to precision suggests that the model is better at identifying true positive cases than at avoiding false positive classifications.

log_fit_object <- fit(log_wf, data = class_train)

tidy(extract_fit_parsnip(log_fit_object), conf.int = TRUE, exponentiate = TRUE) %>%
  kbl(caption = "Logistic Regression Odds Ratios") %>%
  kable_classic(full_width = FALSE)
Logistic Regression Odds Ratios
term estimate std.error statistic p.value conf.low conf.high
(Intercept) 13.1729145 0.2618645 9.8454082 0.0000000 7.8978791 22.0481314
school_involve_score 1.0156403 0.0136713 1.1351735 0.2563026 0.9887850 1.0432257
school_contact_score 0.8664530 0.0398616 -3.5961306 0.0003230 0.8012676 0.9367968
FSFREQ 0.9979975 0.0027483 -0.7293740 0.4657729 0.9925799 1.0033392
FHHOME 0.8546945 0.0306294 -5.1261533 0.0000003 0.8048561 0.9075457
FHWKHRS 0.9995538 0.0050493 -0.0883976 0.9295606 0.9896971 1.0094923
FODINNERX 0.9694966 0.0117818 -2.6293311 0.0085553 0.9473583 0.9921416
SEENJOY_1 3.8365627 0.1278157 10.5196527 0.0000000 2.9982910 4.9524917
SEENJOY_2 0.6062330 0.1025449 -4.8807028 0.0000011 0.4970689 0.7434014
SEENJOY_3 0.8658592 0.0700573 -2.0559320 0.0397891 0.7547893 0.9934586
SEABSNT_1 1.5679110 0.1913098 2.3508677 0.0187297 1.0863142 2.3050736
SEABSNT_2 1.0533276 0.1571315 0.3306424 0.7409146 0.7780781 1.4426190
SEABSNT_3 1.1609103 0.1153429 1.2935728 0.1958130 0.9262671 1.4562678
SCCHOICE_X2 1.0077543 0.0498453 0.1549681 0.8768465 0.9138807 1.1110985
EDCPUB_X2 0.9523333 0.0814732 -0.5994636 0.5488637 0.8112800 1.1166176
PARGRADEX_HS 0.9298361 0.1261089 -0.5768581 0.5640353 0.7255995 1.1897746
PARGRADEX_Associate 0.7165266 0.1183000 -2.8177500 0.0048361 0.5676212 0.9026967
PARGRADEX_Bachelors 0.5016405 0.1231958 -5.5997968 0.0000000 0.3936177 0.6380956
PARGRADEX_Graduate 0.3470300 0.1281318 -8.2598099 0.0000000 0.2696862 0.4457235
TTLHHINC_01 0.4411260 0.1195522 -6.8457514 0.0000000 0.3488658 0.5574590
TTLHHINC_02 1.0302578 0.0989166 0.3013557 0.7631433 0.8487545 1.2508639
TTLHHINC_03 1.0206128 0.0902301 0.2261243 0.8211047 0.8548084 1.2175973
TTLHHINC_04 0.8815719 0.0944977 -1.3338814 0.1822427 0.7325351 1.0610139
TTLHHINC_05 1.1176032 0.0949471 1.1710357 0.2415844 0.9282033 1.3468123
TTLHHINC_06 0.8368187 0.0935647 -1.9040075 0.0569092 0.6966818 1.0054220
TTLHHINC_07 1.1341144 0.0896348 1.4040528 0.1603031 0.9516037 1.3522888
TTLHHINC_08 0.9868376 0.0842962 -0.1571808 0.8751023 0.8365292 1.1641288
TTLHHINC_09 0.9354899 0.0786952 -0.8473820 0.3967822 0.8017291 1.0914664
TTLHHINC_10 1.0105759 0.0755992 0.1391597 0.8893240 0.8713862 1.1719888
TTLHHINC_11 0.9343384 0.0812248 -0.8361565 0.4030669 0.7968118 1.0955985
RACEETH_Black 2.1653251 0.0845965 9.1324155 0.0000000 1.8351176 2.5568759
RACEETH_Hispanic 1.6536256 0.0625065 8.0466858 0.0000000 1.4630556 1.8693184
RACEETH_Asian_PI 0.7581050 0.1056804 -2.6204810 0.0087806 0.6149963 0.9308290
RACEETH_Other 1.0780326 0.0972970 0.7722503 0.4399662 0.8902733 1.3038168
CSEX_Female 0.5877427 0.0475344 -11.1806698 0.0000000 0.5354051 0.6450757
DSBLTY_X2 0.3943715 0.0569492 -16.3384592 0.0000000 0.3526254 0.4408335

Confusion Matrix for logistic:

log_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class) %>%
  autoplot(type = "heatmap") +
  labs(title = "Logistic Regression Confusion Matrix")

log_preds %>%
  roc_curve(truth = high_achiever, .pred_Yes) %>%
  autoplot() +
  labs(title = "Logistic Regression ROC Curve")

Multinomial Model

multi_recipe <- recipe(grade_multi ~ ., data = multi_train) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

multi_spec <- multinom_reg() %>%
  set_engine("nnet") %>%
  set_mode("classification")

multi_wf <- workflow() %>%
  add_recipe(multi_recipe) %>%
  add_model(multi_spec)

multi_last_fit <- last_fit(multi_wf, multi_split)

collect_metrics(multi_last_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    multiclass     0.624 Preprocessor1_Model1
## 2 roc_auc     hand_till      0.712 Preprocessor1_Model1
## 3 brier_class multiclass     0.239 Preprocessor1_Model1

The multinomial model is harder to fit well than the binary models, but it is the more appropriate model for the full SEGRADES response because it preserves all four academic-performance categories instead of collapsing them into a yes/no outcome.

Multinomial Confusion Matrix

multi_preds <- collect_predictions(multi_last_fit)

multi_cm <- multi_preds %>%
  conf_mat(truth = grade_multi, estimate = .pred_class)

multi_cm
##                    Truth
## Prediction          Mostly_A Mostly_B Mostly_C Mostly_D_or_lower
##   Mostly_A              1553      612       87                13
##   Mostly_B               249      393      138                27
##   Mostly_C                17       36       43                18
##   Mostly_D_or_lower        1        1        2                 2
autoplot(multi_cm, type = "heatmap") +
  labs(title = "Multinomial Regression Confusion Matrix")

The multinomial confusion matrix shows that the model performs best on the more common A and B categories and struggles more with the less common C and D-or-lower categories, which is consistent with the class imbalance observed in the outcome distribution.

Linear and Quadratic Discriminant Analysis

disc_recipe <- recipe(
  high_achiever ~ school_involve_score + school_contact_score + FSFREQ +
    FHWKHRS + FODINNERX + P1AGE + HHTOTALXX + NUMSIBSX,
  data = class_train
) %>%
  step_normalize(all_numeric_predictors()) %>%
  step_zv(all_predictors())

lda_spec <- discrim_linear() %>%
  set_engine("MASS") %>%
  set_mode("classification")

qda_spec <- discrim_quad() %>%
  set_engine("MASS") %>%
  set_mode("classification")

lda_wf <- workflow() %>% add_recipe(disc_recipe) %>% add_model(lda_spec)
qda_wf <- workflow() %>% add_recipe(disc_recipe) %>% add_model(qda_spec)

lda_fit <- last_fit(lda_wf, class_split)
qda_fit <- last_fit(qda_wf, class_split)

collect_metrics(lda_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.607 Preprocessor1_Model1
## 2 roc_auc     binary         0.617 Preprocessor1_Model1
## 3 brier_class binary         0.236 Preprocessor1_Model1
collect_metrics(qda_fit)
## # A tibble: 3 × 4
##   .metric     .estimator .estimate .config             
##   <chr>       <chr>          <dbl> <chr>               
## 1 accuracy    binary         0.624 Preprocessor1_Model1
## 2 roc_auc     binary         0.648 Preprocessor1_Model1
## 3 brier_class binary         0.234 Preprocessor1_Model1

Confusion matrix for lda

lda_preds <- collect_predictions(lda_fit)

# Confusion matrix
lda_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class)
##           Truth
## Prediction  Yes   No
##        Yes 1612 1044
##        No   209  327
# Heatmap
lda_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

Confusion matrix for qda

qda_preds <- collect_predictions(qda_fit)

qda_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class)
##           Truth
## Prediction  Yes   No
##        Yes 1407  786
##        No   414  585
qda_preds %>%
  conf_mat(truth = high_achiever, estimate = .pred_class) %>%
  autoplot(type = "heatmap")

metric_set(accuracy, precision, recall, f_meas, roc_auc)(
  lda_preds,
  truth = high_achiever,
  estimate = .pred_class,
  .pred_Yes
)
## # A tibble: 5 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.607
## 2 precision binary         0.607
## 3 recall    binary         0.885
## 4 f_meas    binary         0.720
## 5 roc_auc   binary         0.617
metric_set(accuracy, precision, recall, f_meas, roc_auc)(
  qda_preds,
  truth = high_achiever,
  estimate = .pred_class,
  .pred_Yes
)
## # A tibble: 5 × 3
##   .metric   .estimator .estimate
##   <chr>     <chr>          <dbl>
## 1 accuracy  binary         0.624
## 2 precision binary         0.642
## 3 recall    binary         0.773
## 4 f_meas    binary         0.701
## 5 roc_auc   binary         0.648

Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) were fit as alternative binary classifiers for high academic achievement. LDA achieved 60.7% accuracy, 88.5% recall, and an ROC AUC of 0.617, while QDA achieved 62.4% accuracy, 77.3% recall, and an ROC AUC of 0.648. LDA showed the highest recall, indicating that it was more aggressive in identifying positive cases, but this came at the cost of lower precision and weaker overall discrimination.

QDA performed slightly better than LDA in accuracy, precision, and AUC, suggesting that allowing quadratic decision boundaries improved classification somewhat. However, both discriminant-analysis methods underperformed compared with the logistic and lasso logistic models, indicating that logistic-based approaches were better suited to this dataset.

Poisson Regression Model

Since FSFREQ is literally a count of how many times adults in the household went to meetings or activities at the child’s school, it is an excellent Poisson outcome.

count_train %>%
  summarise(
    mean_fsfreq = mean(FSFREQ),
    var_fsfreq = var(FSFREQ),
    ratio = var(FSFREQ) / mean(FSFREQ)
  ) %>%
  kbl(caption = "Poisson Dispersion Check for FSFREQ") %>%
  kable_classic(full_width = FALSE)
Poisson Dispersion Check for FSFREQ
mean_fsfreq var_fsfreq ratio
7.190098 92.28142 12.83451

If variance is much larger than the mean, I will note overdispersion. This will mean Poisson was demonstrated but may not fit perfectly.

pois_recipe <- recipe(
  FSFREQ ~ school_involve_score + school_contact_score + FHHOME + FHWKHRS +
    FODINNERX + PARGRADEX + TTLHHINC + RACEETH + CSEX + DSBLTY + SCCHOICE + EDCPUB,
  data = count_train
) %>%
  step_dummy(all_nominal_predictors()) %>%
  step_zv(all_predictors())

pois_spec <- poisson_reg() %>%
  set_engine("glm") %>%
  set_mode("regression")

pois_wf <- workflow() %>%
  add_recipe(pois_recipe) %>%
  add_model(pois_spec)

pois_fit <- fit(pois_wf, data = count_train)

pois_preds <- predict(pois_fit, new_data = count_test) %>%
  bind_cols(count_test)

rmse(pois_preds, truth = FSFREQ, estimate = .pred)
## # A tibble: 1 × 3
##   .metric .estimator .estimate
##   <chr>   <chr>          <dbl>
## 1 rmse    standard        8.53

The Poisson regression model was used to analyze FSFREQ, the number of times adults in the household attended meetings or activities at the child’s school. However, the dispersion check indicated substantial overdispersion: the mean of FSFREQ was 7.19, while the variance was 92.28, yielding a variance-to-mean ratio of 12.83. Because the Poisson model assumes equality of the mean and variance, this result indicates a serious violation of the Poisson assumption. The test-set RMSE of 8.53 was also large relative to the mean count, suggesting weak predictive accuracy. Therefore, while Poisson regression was demonstrated as required, it was not considered the preferred model for this count outcome.

Binary Model Comparison Table

lasso_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)(
  lasso_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
  dplyr::select(.metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "Lasso Logistic")

log_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)(
  log_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
  dplyr::select(.metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "Standard Logistic")

lda_preds <- collect_predictions(lda_fit)
lda_metrics <- metric_set(accuracy, roc_auc)(
  lda_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
  dplyr::select(.metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "LDA")

qda_preds <- collect_predictions(qda_fit)
qda_metrics <- metric_set(accuracy, roc_auc)(
  qda_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
  dplyr::select(.metric, .estimate) %>%
  pivot_wider(names_from = .metric, values_from = .estimate) %>%
  mutate(model = "QDA")

bind_rows(lasso_metrics, log_metrics, lda_metrics, qda_metrics) %>%
  dplyr::select(model, everything()) %>%
  kbl(caption = "Binary Classification Model Comparison") %>%
  kable_classic(full_width = FALSE)
Binary Classification Model Comparison
model accuracy precision recall f_meas roc_auc
Lasso Logistic 0.7149123 0.7265042 0.8023064 0.7625261 0.7734983
Standard Logistic 0.7077068 0.7231156 0.7902252 0.7551824 0.7688812
LDA 0.6074561 NA NA NA 0.6173456
QDA 0.6240602 NA NA NA 0.6480891

All-model comparison table

lasso_summary <- tibble(
  model = "Lasso Logistic",
  outcome = "high_achiever",
  accuracy = 0.7149123,
  roc_auc = 0.7734983,
  comment = "Best binary model"
)

log_summary <- tibble(
  model = "Standard Logistic",
  outcome = "high_achiever",
  accuracy = 0.7077068,
  roc_auc = 0.7688812,
  comment = "Close second"
)

lda_summary <- tibble(
  model = "LDA",
  outcome = "high_achiever",
  accuracy = 0.6074561,
  roc_auc = 0.6173456,
  comment = "High recall, weak AUC"
)

qda_summary <- tibble(
  model = "QDA",
  outcome = "high_achiever",
  accuracy = 0.6240602,
  roc_auc = 0.6480891,
  comment = "Better than LDA, below logistic"
)

multi_summary <- tibble(
  model = "Multinomial Regression",
  outcome = "grade_multi",
  accuracy = 0.6237469,
  roc_auc = 0.7119616,
  comment = "Best full-outcome model"
)

pois_summary <- tibble(
  model = "Poisson Regression",
  outcome = "FSFREQ",
  accuracy = NA_real_,
  roc_auc = NA_real_,
  comment = "Rejected due to overdispersion; RMSE = 8.53"
)

bind_rows(
  lasso_summary, log_summary, multi_summary,
  lda_summary, qda_summary, pois_summary
) %>%
  kbl(
    caption = "Model Comparison Summary",
    digits = 3
  ) %>%
  kable_classic(full_width = FALSE)
Model Comparison Summary
model outcome accuracy roc_auc comment
Lasso Logistic high_achiever 0.715 0.773 Best binary model
Standard Logistic high_achiever 0.708 0.769 Close second
Multinomial Regression grade_multi 0.624 0.712 Best full-outcome model
LDA high_achiever 0.607 0.617 High recall, weak AUC
QDA high_achiever 0.624 0.648 Better than LDA, below logistic
Poisson Regression FSFREQ NA NA Rejected due to overdispersion; RMSE = 8.53
  • Because the grade-based outcomes are imbalanced, ROC AUC is emphasized more heavily than confusion-matrix accuracy when identifying the best classification models. Accuracy is still reported for completeness, but it can be misleading when most observations belong to the majority classes.

  • The lasso model was used as a variable-selection tool to identify the most important family involvement, school contact, and demographic predictors of high academic achievement. Because lasso shrinks weak coefficients to zero, it provides a parsimonious starting point for final model interpretation.

  • The standard logistic model provides interpretable odds ratios. Predictors with odds ratios greater than 1 are associated with higher odds of being a high achiever, holding other variables constant.

  • The multinomial model is the most direct model for the original SEGRADES response because it preserves all four academic performance categories rather than collapsing them into a binary outcome.

  • Poisson regression was used to model the count outcome FSFREQ, the number of times adults in the household attended meetings or activities at the child’s school. This demonstrates the required count-modeling component of the project while remaining substantively tied to family involvement.

Preferred Models

Because the binary outcome high_achiever is imbalanced, model comparison emphasized ROC AUC in addition to confusion-matrix-based accuracy. Among the binary classifiers, the lasso logistic model is preferred because it achieved the strongest overall discrimination while also producing a more parsimonious set of predictors. The standard logistic model performed similarly and remains useful for interpretation, but lasso had a slight edge in predictive performance. For the original four-category SEGRADES outcome, the multinomial regression model is preferred because it preserves the full academic-performance structure rather than collapsing the response into a binary indicator. LDA and QDA were useful benchmark methods, but both underperformed relative to the logistic-based models. Poisson regression was demonstrated for the count outcome FSFREQ, but it was not retained as a preferred model because strong overdispersion indicated poor fit.

Limitations

This analysis used a curated PFI analytic file and unweighted tidymodels workflows. The NHES:2019 Parent and Family Involvement survey is a complex national survey with weighting, nonresponse adjustment, and imputation procedures, so these results should be interpreted as associations within the analytic sample rather than fully survey-weighted national estimates. In addition, because the data are observational and cross-sectional, the models identify predictive associations rather than causal effects. For example, higher school involvement may reflect reactive engagement when a student is already struggling, rather than a direct causal improvement in grades.

Recommendations for GVSU’s K-12 Connect

Across the exploratory analysis and the fitted models, home-based family engagement showed a clearer and more consistent relationship with academic performance than school-based involvement measures. In particular, homework-related engagement variables remained important in both the lasso-selected and regression-based models, whereas school event participation showed weaker and less consistent effects. Based on these findings, GVSU’s K-12 Connect should prioritize programs that help families build effective homework routines, maintain regular academic support at home, and strengthen day-to-day engagement with children’s schoolwork. School-based family involvement should still be encouraged, but the results suggest it may be more reactive than predictive when considered on its own.

Linear Regression Problem

Research Question

Can the carat weight of a diamond be determined from visually observable characteristics, without actually weighing the diamond?

Price is excluded because it is not visually observable.

In this project, the response variable is carat. The goal is to see whether visually observable variables can explain and predict carat well.

Load and Inspect the Data

I begin by loading the diamonds dataset and checking its structure.

# For the computer that set the seed
diamond_dataset <- ggplot2::diamonds

glimpse() shows the variable names and data types.

# Define the variable descriptions based on ?diamonds
diamond_metadata <- tibble(
  Variable = c("carat", "cut", "color", "clarity", "depth", "table", "x", "y", "z"),
  Description = c(
    "Weight of the diamond",
    "Quality of the cut (Fair, Good, Very Good, Premium, Ideal)",
    "Diamond colour, from D (best) to J (worst)",
    "A measurement of how clear the diamond is (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)",
    "Total depth percentage = z / mean(x, y) = 2 * z / (x + y)",
    "Width of top of diamond relative to widest point",
    "Length in mm",
    "Width in mm",
    "Depth in mm"
  ),
  Type = c("Numeric (Target)", "Ordinal Factor", "Ordinal Factor", "Ordinal Factor", 
           "Numeric", "Numeric", "Numeric", "Numeric", "Numeric")
)
diamond_metadata %>%
  gt() %>%
  tab_header(
    title = "Data Dictionary: Diamonds Dataset",
    subtitle = "Visually Observable Variables for Predicting Carat Weight"
  ) %>%
  cols_label(
    Variable = md("**Variable Name**"),
    Description = md("**Definition**"),
    Type = md("**Data Type**")
  ) %>%
  tab_source_note(
    source_note = "Note: Price is excluded as it is not a visually observable characteristic."
  ) %>%
  opt_stylize(color = "blue", style = 6)
Data Dictionary: Diamonds Dataset
Visually Observable Variables for Predicting Carat Weight
Variable Name Definition Data Type
carat Weight of the diamond Numeric (Target)
cut Quality of the cut (Fair, Good, Very Good, Premium, Ideal) Ordinal Factor
color Diamond colour, from D (best) to J (worst) Ordinal Factor
clarity A measurement of how clear the diamond is (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF) Ordinal Factor
depth Total depth percentage = z / mean(x, y) = 2 * z / (x + y) Numeric
table Width of top of diamond relative to widest point Numeric
x Length in mm Numeric
y Width in mm Numeric
z Depth in mm Numeric
Note: Price is excluded as it is not a visually observable characteristic.
glimpse(diamond_dataset)
## Rows: 53,940
## Columns: 10
## $ carat   <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut     <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color   <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth   <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table   <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price   <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x       <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y       <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z       <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…

summary() gives quick numerical summaries for each variable.

summary(diamond_dataset)
##      carat               cut        color        clarity          depth      
##  Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00  
##  1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00  
##  Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80  
##  Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75  
##  3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50  
##  Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00  
##                                     J: 2808   (Other): 2531                  
##      table           price             x                y         
##  Min.   :43.00   Min.   :  326   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:56.00   1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720  
##  Median :57.00   Median : 2401   Median : 5.700   Median : 5.710  
##  Mean   :57.46   Mean   : 3933   Mean   : 5.731   Mean   : 5.735  
##  3rd Qu.:59.00   3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540  
##  Max.   :95.00   Max.   :18823   Max.   :10.740   Max.   :58.900  
##                                                                   
##        z         
##  Min.   : 0.000  
##  1st Qu.: 2.910  
##  Median : 3.530  
##  Mean   : 3.539  
##  3rd Qu.: 4.040  
##  Max.   :31.800  
## 

I also check the dimensions of our data

dim(diamond_dataset)
## [1] 53940    10
names(diamond_dataset)
##  [1] "carat"   "cut"     "color"   "clarity" "depth"   "table"   "price"  
##  [8] "x"       "y"       "z"

The dataset contains numerical variables (carat, depth, table, price, x, y, z) and categorical variables (cut, color, clarity). Since carat is numerical, this is a regression problem.

Exploratory Data Analysis

Completeness

sum(is.na(diamond_dataset))
## [1] 0

I do not have missing values in our data.

Univariate Plots

I first look at the distribution of the response variable.

ggplot(diamond_dataset, aes(x = carat)) +
  geom_histogram(bins = 30)

It looks skewd to the right

Next, I inspect the numerical predictors.

p1 <- ggplot(diamond_dataset, aes(x = x)) + geom_histogram(bins = 30)
p2 <- ggplot(diamond_dataset, aes(x = y)) + geom_histogram(bins = 30)
p3 <- ggplot(diamond_dataset, aes(x = z)) + geom_histogram(bins = 30)
p4 <- ggplot(diamond_dataset, aes(x = depth)) + geom_histogram(bins = 30)
p5 <- ggplot(diamond_dataset, aes(x = table)) + geom_histogram(bins = 30)

(p1 | p2) / (p3 | p4) / p5

I also inspect the categorical predictors.

p6 <- ggplot(diamond_dataset, aes(x = cut)) + geom_bar()
p7 <- ggplot(diamond_dataset, aes(x = color)) + geom_bar()
p8 <- ggplot(diamond_dataset, aes(x = clarity)) + geom_bar()

(p6 | p7) / p8

Bivariate Plots

I compare carat across the categorical predictors using boxplots.

p14 <- ggplot(diamond_dataset, aes(x = cut, y = carat)) + geom_boxplot()
p15 <- ggplot(diamond_dataset, aes(x = color, y = carat)) + geom_boxplot()
p16 <- ggplot(diamond_dataset, aes(x = clarity, y = carat)) + geom_boxplot()

(p14 | p15) / p16

Correlation Analysis

# 1. Prepare the data: Select only numerical variables and exclude price
numerical_diamonds <- diamond_dataset |>
  dplyr::select(carat, depth, table, x, y, z)

# 2. Calculate the correlation matrix
corr_matrix <- cor(numerical_diamonds)

# 3. Create the heatmap
ggcorrplot(corr_matrix, 
           hc.order = TRUE,
           type = "lower",
           lab = TRUE,
           lab_size = 4,
           method = "circle",
           colors = c("#E46726", "white", "#6D9EC1"),
           title = "Correlation Heatmap of Diamond Physical Characteristics",
           ggtheme = theme_minimal())

There is high correlation Carat and variables x, y, and z.

In contrast, the relationships between carat and depth and table are much weaker, suggesting that they are weak predictors and may not contribute much to explaining variation in carat.

The relationships among the explanatory variables themselves, there is strong evidence of collinearity, particularly among the dimension variables. The x, y, and z predictors show very tight linear relationships, and the corresponding correlations are extremely high. This indicates that these variables contain largely overlapping information, and including all of them in a regression model may lead to multicollinearity issues, such as unstable coefficient estimates and difficulty in interpreting the individual effects of each variable. On the other hand, depth and table show relatively weak correlations with the dimension variables and with carat.

Model Specification

The code chunk below is comment because it was run once, and files saved for reproducibility.

# 1. Set seed for your own reproducibility
set.seed(123)

# 2. Create the split (e.g., 75% training, 25% testing)
diamond_split <- initial_split(diamond_dataset, prop = 0.75)
train_data <- training(diamond_split)
test_data  <- testing(diamond_split)

I used an automated AIC-based search to ensure parsimony (the simplest model that does the best job).

# 1. Ensure models are defined
# Define the Null Model (Start with nothing but the intercept)
null_model <- lm(carat ~ 1, data = train_data)
full_model <- lm(carat ~ cut + color + clarity + depth + table + x + y + z, data = train_data)

# 2. Use the explicit 'stats::' prefix and pass formulas to scope
# Perform bidirectional selection
# direction = "both" is usually best for finding the global minimum AIC
step_model <- stats::step(
  null_model, 
  scope = list(lower = formula(null_model), upper = formula(full_model)), 
  direction = "both", 
  trace = 0
)

# 3. Check if it worked
summary(step_model)
## 
## Call:
## lm(formula = carat ~ x + depth + color + clarity + table + cut + 
##     y + z, data = train_data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -0.4762 -0.0664 -0.0284  0.0553  3.7629 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2.785e+00  4.472e-02 -62.276  < 2e-16 ***
## x            3.957e-01  4.239e-03  93.363  < 2e-16 ***
## depth        1.819e-02  5.873e-04  30.978  < 2e-16 ***
## color.L      4.543e-02  1.781e-03  25.505  < 2e-16 ***
## color.Q      2.375e-02  1.627e-03  14.597  < 2e-16 ***
## color.C     -1.918e-03  1.526e-03  -1.257  0.20887    
## color^4     -5.721e-03  1.404e-03  -4.075 4.61e-05 ***
## color^5      3.970e-03  1.327e-03   2.992  0.00277 ** 
## color^6     -8.187e-05  1.202e-03  -0.068  0.94569    
## clarity.L    6.630e-03  3.164e-03   2.095  0.03617 *  
## clarity.Q    4.456e-02  2.948e-03  15.114  < 2e-16 ***
## clarity.C   -2.457e-02  2.523e-03  -9.736  < 2e-16 ***
## clarity^4    1.759e-03  2.011e-03   0.875  0.38159    
## clarity^5   -3.004e-03  1.636e-03  -1.836  0.06639 .  
## clarity^6   -3.637e-03  1.422e-03  -2.558  0.01053 *  
## clarity^7    8.119e-03  1.253e-03   6.479 9.34e-11 ***
## table        2.180e-03  3.010e-04   7.241 4.55e-13 ***
## cut.L        1.021e-02  2.322e-03   4.395 1.11e-05 ***
## cut.Q       -9.047e-03  1.859e-03  -4.867 1.14e-06 ***
## cut.C        3.802e-04  1.603e-03   0.237  0.81256    
## cut^4        1.493e-03  1.283e-03   1.163  0.24466    
## y            5.515e-03  1.815e-03   3.038  0.00238 ** 
## z            1.390e-02  7.082e-03   1.963  0.04963 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1013 on 40432 degrees of freedom
## Multiple R-squared:  0.9544, Adjusted R-squared:  0.9544 
## F-statistic: 3.846e+04 on 22 and 40432 DF,  p-value: < 2.2e-16

Interpretation of carat_model Coefficients Coefficients will be interpreted on our final model

vif(step_model)
##              GVIF Df GVIF^(1/(2*Df))
## x       89.197308  1        9.444433
## depth    2.795640  1        1.672017
## color    1.156131  6        1.012163
## clarity  1.327080  7        1.020419
## table    1.790157  1        1.337968
## cut      1.945911  4        1.086777
## y       17.249856  1        4.153295
## z       95.464080  1        9.770572

The VIF values for x and z are high hence severe collinearity due to the correlation I saw between x,y, and z. I will drop them to avoid multicollinearity and to simplify the model to make the coefficients more stable and interpretable.

# Refined Model: Dropping y and z to fix collinearity
model_refined <- lm(carat ~ x + depth + table + cut + color + clarity, data = train_data)

# Check VIF again
vif(model_refined)
##             GVIF Df GVIF^(1/(2*Df))
## x       1.323890  1        1.150604
## depth   1.379036  1        1.174324
## table   1.789551  1        1.337741
## cut     1.924214  4        1.085255
## color   1.155730  6        1.012134
## clarity 1.325567  7        1.020335

The VIF values look fine now

Diagnostics for the Model

Residual Analysis

plot(model_refined, which = 1) # Residuals vs Fitted

The Residuals vs. Fitted plot revealed three major Potential Problems:

Non-linearity: The distinct U-shape in the residuals indicates that the relationship between the physical traits and carat weight is not a straight line.

Non-constant Variance (Heteroscedasticity): The fan-shaped distribution where residuals spread out as predicted values increase shows that the model’s error is not consistent across all diamond sizes

The plots identified specific observations (e.g., points 14102, 24575, and 34165) with high studentized residuals. These outliers can disproportionately pull the regression line, affecting the goodness of fit. Identifying and potentially removing these points is a critical step in refining the model quality.

Log transformation and outliers

Why it works:

Carat weight is a measure of volume. In geometry, volume scales cubically (\(length^3\)) rather than linearly. By applying a natural log (\(log\)) to both carat and the dimension x, I convert this power relationship into a linear one that the regression model can accurately map. Also, log-trsnaforming carat will take care of the right skewness observed earlier.

p1 <- ggplot(train_data, aes(x = carat)) + 
  geom_histogram(fill = "steelblue", color = "white") +
  labs(title = "Original Carat (Right-Skewed)")

p2 <- ggplot(train_data, aes(x = log(carat))) + 
  geom_histogram(fill = "darkred", color = "white") +
  labs(title = "Log-Transformed Carat (More Normal)")

p1 + p2 # Displays them side-by-side

I use a “Log-Log” model because it handles both the skewness of the target and the non-linear (cubic) relationship between length (\(x\)) and weight (\(carat\)).

Note

I have minimum values for x y and z as 0.00mm. Therefore log(x) will give -infinity, values that models cannot work with and for that reason I will omit the records with 0.00mm. Also for the soundness of our model, although the length, width and depth of a diamond might be small, it cannot be completely zero.

# Check for zeros in the physical dimensions
train_data %>% 
  filter(x == 0 | y == 0 | z == 0) %>%
  dplyr::select(carat, x, y, z)
## # A tibble: 14 × 4
##    carat     x     y     z
##    <dbl> <dbl> <dbl> <dbl>
##  1  1.01  6.66  6.6      0
##  2  0.71  0     0        0
##  3  1     0     0        0
##  4  2.2   8.42  8.37     0
##  5  2.25  0     0        0
##  6  2.02  8.02  7.95     0
##  7  1.12  6.71  6.67     0
##  8  0.71  0     0        0
##  9  1.2   0     0        0
## 10  1.56  0     0        0
## 11  1.15  6.88  6.83     0
## 12  2.8   8.9   8.85     0
## 13  1.14  0     0        0
## 14  2.18  8.49  8.45     0

I have 14 records where either x,y,z is zero. I will omit these records from our model for the reasons mentioned above.

# 1. Remove rows with 0 in x, y, or z
train_cleaned <- train_data %>%
  filter(x > 0, y > 0, z > 0)

# 2. Apply transformations to the cleaned data
train_transformed <- train_cleaned %>%
  mutate(
    log_carat = log(carat),
    log_x = log(x)
  )

# 3. Fit the model
final_log_model <- lm(log_carat ~ log_x + depth + table + cut + color + clarity, 
                      data = train_transformed)

summary(final_log_model)
## 
## Call:
## lm(formula = log_carat ~ log_x + depth + table + cut + color + 
##     clarity, data = train_transformed)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.69903 -0.01092  0.00037  0.01108  0.86828 
## 
## Coefficients:
##               Estimate Std. Error  t value Pr(>|t|)    
## (Intercept) -7.193e+00  7.267e-03 -989.815  < 2e-16 ***
## log_x        3.011e+00  5.912e-04 5093.274  < 2e-16 ***
## depth        2.264e-02  8.160e-05  277.476  < 2e-16 ***
## table        3.445e-03  5.956e-05   57.850  < 2e-16 ***
## cut.L        1.366e-03  4.595e-04    2.973 0.002951 ** 
## cut.Q       -8.263e-03  3.677e-04  -22.469  < 2e-16 ***
## cut.C        1.148e-02  3.167e-04   36.232  < 2e-16 ***
## cut^4        4.765e-03  2.534e-04   18.803  < 2e-16 ***
## color.L      8.081e-05  3.508e-04    0.230 0.817820    
## color.Q      1.566e-04  3.217e-04    0.487 0.626401    
## color.C     -1.011e-05  3.018e-04   -0.033 0.973276    
## color^4      4.114e-04  2.777e-04    1.482 0.138435    
## color^5     -6.198e-04  2.624e-04   -2.362 0.018196 *  
## color^6     -4.697e-04  2.378e-04   -1.975 0.048230 *  
## clarity.L    1.636e-03  6.267e-04    2.610 0.009058 ** 
## clarity.Q   -2.141e-03  5.836e-04   -3.669 0.000244 ***
## clarity.C    1.012e-03  4.995e-04    2.026 0.042734 *  
## clarity^4   -9.014e-04  3.980e-04   -2.265 0.023518 *  
## clarity^5    2.488e-04  3.237e-04    0.769 0.442053    
## clarity^6    5.397e-06  2.812e-04    0.019 0.984686    
## clarity^7    3.613e-04  2.479e-04    1.458 0.144955    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02004 on 40420 degrees of freedom
## Multiple R-squared:  0.9988, Adjusted R-squared:  0.9988 
## F-statistic: 1.719e+06 on 20 and 40420 DF,  p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(final_log_model)

The residuals now are centered at zero and the red line is now flat. This confirms that the log-log transformation successfully addressed the non-linearity.

The scale location shows that the points are now spread evenly and henece no hereskedasticity. The Q-Q plot shows residuals follow a normal distribution. Most points should now fall directly on the diagonal line.

Confirming Low Colinearlity (VIF)

vif(final_log_model)
##             GVIF Df GVIF^(1/(2*Df))
## log_x   1.320741  1        1.149235
## depth   1.378970  1        1.174296
## table   1.789984  1        1.337903
## cut     1.924221  4        1.085255
## color   1.146046  6        1.011425
## clarity 1.332610  7        1.020722

Outliers

# Generate studentized residuals and leverage (hat) values
model_diagnostics <- augment(final_log_model)

# Check for outliers
outliers <- model_diagnostics %>% 
  filter(abs(.std.resid) > 3)

# Check for high leverage
high_leverage <- model_diagnostics %>% 
  filter(.hat > (3 * mean(.hat)))
# --- STEP 1: CREATE THE CLEAN DATA ---
# This takes the results from augment and filters out rows with high residuals
train_final_no_outliers <- model_diagnostics %>%
  filter(abs(.std.resid) <= 3)

# --- STEP 2: FIT THE FINAL VERSION ---
final_model_v2 <- lm(log_carat ~ log_x + depth + table + cut + color + clarity, 
                     data = train_final_no_outliers)
summary(final_model_v2)
## 
## Call:
## lm(formula = log_carat ~ log_x + depth + table + cut + color + 
##     clarity, data = train_final_no_outliers)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.063762 -0.010897  0.000344  0.010944  0.061580 
## 
## Coefficients:
##               Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) -7.237e+00  6.253e-03 -1157.246  < 2e-16 ***
## log_x        3.011e+00  4.991e-04  6032.550  < 2e-16 ***
## depth        2.317e-02  7.017e-05   330.174  < 2e-16 ***
## table        3.649e-03  5.108e-05    71.446  < 2e-16 ***
## cut.L        2.410e-03  3.953e-04     6.095 1.11e-09 ***
## cut.Q       -8.459e-03  3.167e-04   -26.710  < 2e-16 ***
## cut.C        1.152e-02  2.701e-04    42.659  < 2e-16 ***
## cut^4        4.926e-03  2.146e-04    22.954  < 2e-16 ***
## color.L      1.788e-04  2.965e-04     0.603 0.546631    
## color.Q      2.588e-04  2.720e-04     0.951 0.341485    
## color.C      7.388e-05  2.550e-04     0.290 0.772035    
## color^4      4.201e-04  2.345e-04     1.792 0.073212 .  
## color^5     -5.866e-04  2.217e-04    -2.646 0.008142 ** 
## color^6     -4.460e-04  2.009e-04    -2.221 0.026390 *  
## clarity.L    1.919e-03  5.309e-04     3.615 0.000301 ***
## clarity.Q   -2.102e-03  4.945e-04    -4.250 2.14e-05 ***
## clarity.C    8.606e-04  4.231e-04     2.034 0.041958 *  
## clarity^4   -6.683e-04  3.367e-04    -1.985 0.047149 *  
## clarity^5    3.222e-04  2.735e-04     1.178 0.238746    
## clarity^6   -1.611e-04  2.374e-04    -0.678 0.497479    
## clarity^7    2.827e-04  2.093e-04     1.351 0.176654    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.01689 on 40245 degrees of freedom
## Multiple R-squared:  0.9992, Adjusted R-squared:  0.9992 
## F-statistic: 2.412e+06 on 20 and 40245 DF,  p-value: < 2.2e-16
# --- STEP 3: FINAL VISUAL CHECK ---
# These plots should look much "cleaner" than the previous ones
par(mfrow = c(2, 2))
plot(final_model_v2)

# --- STEP 4: FINAL TEST PREDICTION ---
# Create test_cleaned before running predictions
test_cleaned <- test_data %>%
  filter(x > 0, y > 0, z > 0) %>%
  mutate(
    log_carat = log(carat),
    log_x = log(x)
  )
# Use the NEW model (v2) to predict the test set
test_predictions <- predict(final_model_v2, newdata = test_cleaned)
actual_carat <- test_cleaned$carat
predicted_carat <- exp(test_predictions)

# Model Quality metric
mean(abs(actual_carat - predicted_carat))
## [1] 0.01124639

This means that on average, our model’s predictions are off by only about 0.01 carats, which is an exceptionally high level of precision for estimating weight without a scale.

Our final model

\[\log(\text{carat}) = -7.237 + 3.011 \cdot \log(x) + 0.02317 \cdot \text{depth} + 0.003649 \cdot \text{table} + \sum (\text{Qualitative Adjustments}) + \epsilon\]

Interpretation

The Intercept (-7.237):

This is the predicted value of \(\log(\text{carat})\) when all other numerical predictors are zero and categorical variables are at their reference levels.

The Physical Engine (\(3.011 \cdot \log(x)\)):

This is the most critical part of your model. It represents an elasticity—a 1% increase in the length (\(x\)) of a diamond leads to an approximate 3.011% increase in carat weight. This coefficient aligns with the geometric principle that volume (and thus weight) scales to the third power of length.

Shape Adjustments:

Depth (0.02317):

For every one-unit increase in depth percentage, the \(\log(\text{carat})\) increases by 0.02317.

Table (0.003649):

For every one-unit increase in table width, the \(\log(\text{carat})\) increases by 0.003649.

Qualitative Adjustments

(Cut, Color, Clarity):

These variables use polynomial contrasts (linear .L, quadratic .Q, cubic .C, etc.). For example:Cut: The significant linear (0.00241) and quadratic (-0.008459) terms indicate that the quality of the cut significantly shifts the weight-to-size ratio.Color and Clarity: While many of these higher-order terms are statistically significant (indicated by ***), their coefficients are very small (e.g., \(1.788 \times 10^{-4}\) for color.L), showing they provide fine-tuning to the prediction rather than driving the bulk of the weight.

Final Prediction Formula

If you want to predict the actual carat weight (not just the log), you must exponentiate the entire result:

\[\text{Predicted Carat} = e^{(-7.237 + 3.011 \cdot \log(x) + 0.02317 \cdot \text{depth} + 0.003649 \cdot \text{table} + \dots)}\]

Summary of Insights and Modeling Process1.

Summary of Insights In this analysis

I addressed the research question:

Can I accurately predict the carat weight of a diamond using only its physical dimensions and qualitative characteristics?

Our findings indicate a resounding yes. I found that carat weight is not just correlated with these features but is almost entirely determined by them, provided the relationship is modeled according to the laws of physical geometry.

The Power of Dimensionality:

I found that the length (\(x\)) of a diamond is the single most critical predictor. Because diamonds are three-dimensional objects, I observed that weight scales to the third power of length. Our model’s calculation of a \(3.01\) elasticity for the length variable confirms that our statistical approach aligns perfectly with physical reality.

Precision of Prediction:

By refining our model, I achieved a Mean Absolute Error (MAE) of 0.011 carats. This means that on average, our predictions are off by only about one-hundredth of a carat. For a jeweler or researcher without a scale, this model provides a highly reliable alternative for weight estimation.

The Role of Quality:

I found that while size (\(x\)) explains the bulk of the weight, qualitative traits like Cut, Color, and Clarity are statistically significant. This suggests that diamonds of higher clarity or specific cuts are shaped with different proportions that slightly alter the weight-to-size ratio compared to lower-grade stones.

2. Modeling Process Summary

To arrive at our final candidate model, I followed a rigorous iterative process to ensure the soundness of our approach and the quality of our results.Initial Model and Data Cleaning. I began by identifying “dirty” data specifically 14 records where physical dimensions were recorded as \(0.00\) mm. I omitted these records as they are physical impossibilities. Our initial linear model including all variables (\(x, y, z\), etc.) revealed two major issues:

Multicollinearity:

I found that \(x, y,\) and \(z\) were so highly correlated (VIFs \(> 80\)) that they made the model unstable. I resolved this by retaining \(x\) as our primary dimensional anchor.

Non-linearity:

Our initial residuals showed a distinct “U-shape,” proving that a straight line could not capture the relationship between size and weight.

Transformations and Assumptions

To satisfy the requirements for a high-quality linear regression, I implemented the following:

Log-Log Transformation:

I applied a natural log transformation to both the target (carat) and our main predictor (x). This “linearized” the relationship and successfully addressed the non-linearity and right-skewness of the data.

Fixing Heteroscedasticity:

The log transformation also stabilized the variance of our errors, moving us from a “fan-shaped” residual plot to a consistent, random distribution of points.

Final Refinement and Outliers

I used broom::augment to calculate studentized residuals and identified observations that were more than 3 standard deviations from the regression line. I found that these outliers were skewing our results; by removing them and refitting the model, I achieved a near-perfect Adjusted R-squared of 0.9992.

Course Learning Objectives:

1. Describe probability as a foundation of statistical modeling

In the Classification Project, the Poisson regression model was used to analyze the frequency of school meetings (FSFREQ)

This objective is demonstrated through the application and evaluation of the Poisson model, which is fundamentally rooted in probability theory. The Poisson distribution assumes that the mean (\(\lambda\)) and variance are equal. By performing a dispersion check (finding a mean of 7.19 vs. a variance of 92.28), I demonstrated an understanding of the underlying probability structure required for Maximum Likelihood Estimation (MLE).

The realization that the data violated these probabilistic assumptions (overdispersion) led to the conclusion that a standard Poisson model was not the optimal fit. This shows a deep understanding that statistical models are not just “plug-and-play” tools, but are built upon specific probabilistic foundations that must be verified to ensure valid inference.

2. Apply the appropriate generalized linear model for a specific data context

In the Classification Project, I utilized Multinomial Logistic Regression to model student grades (grade_multi) across four categories

This work demonstrates the ability to match a model to the nature of the response variable. While binary logistic regression is common, the student grade data was multi-categorical (Mostly A, B, C, or D). Using a multinomial GLM allowed for a more nuanced analysis that preserved the structure of the data rather than oversimplifying it into a “High/Low” binary.

Furthermore, in the Linear Regression Project, I applied a Log-Log transformation to handle the cubic relationship between a diamond’s length and its volume (carat). This shows the ability to adapt the General Linear Model framework to physical contexts recognizing that when the “data context” involves volume, a linear-linear model is mathematically inappropriate, whereas a log-transformation satisfies the model’s underlying assumptions.

3. Demonstrate model selection given a set of candidate models

In the Classification Project, I compared Lasso Regression, Standard Logistic Regression, LDA, and QDA

Model selection is about more than just picking the highest accuracy; it involves balancing parsimony, interpretability, and predictive power. By using Lasso Regression, I performed automated variable selection to identify which of the many family involvement variables actually contributed to student success. I then created a comprehensive comparison table using metrics like ROC AUC and Brier Score to evaluate the candidates.

I ultimately selected the Lasso Logistic model as the preferred binary classifier because it achieved the highest ROC AUC (0.773) while maintaining a simpler, regularized set of predictors. This systematic approach tuning hyperparameters, comparing cross-validated metrics, and selecting a “final” model—is the hallmark of professional statistical practice.

4. Express the results of statistical models to a general audience

The Recommendations for GVSU’s K-12 Connect section in the Classification Project.

GVSU’s K-12 Connect should prioritize programs that help families build effective homework routines. School event participation showed weaker and less consistent effects.

This objective requires translating complex statistical outputs (like odds ratios and p-values) into actionable insights. In my reports, I moved beyond technical jargon to explain what the data actually means for stakeholders. For example, in the diamond project, I explained that a 1% increase in length leads to a 3% increase in weight, which is far more intuitive for a jeweler than discussing coefficients in a log-log space.

By providing clear summaries, visual heatmaps for confusion matrices, and specific recommendations for community partners like K-12 Connect, I demonstrated the ability to bridge the gap between data science and real-world application. I focused on telling the story of the data specifically that home-based involvement is a stronger predictor of grades than school-based meetings.

5. Use programming software to fit and assess statistical models

The use of the tidymodels ecosystem in both projects to create reproducible workflows

Throughout both projects, I demonstrated proficiency in R and the tidymodels framework. This included complex data cleaning (handling missingness and filtering impossible \(0.00\) mm diamond dimensions), feature engineering (creating composite school involvement scores), and rigorous model validation (splitting data into training/testing sets and using 10-fold cross-validation).

In the diamond project, I utilized Polynomial Regression (via polynomial contrasts in the cut, color, and clarity factors) to capture non-linear trends in qualitative data. I also used diagnostic plots (Residuals vs. Fitted, Q-Q plots) and VIF scores to assess model health. This technical fluency ensured that the models were not only “fitted” but also “assessed” for violations of multicollinearity, heteroscedasticity, and influence.

Self Evaluation Letter

Dear Eric,

It is a privilege to reflect on the commitment I made to myself at the start of this semester. Looking back at my initial letter, my goals were centered on building a strong statistical foundation, gaining confidence in model selection, and creating a launchpad for my future in AI and research. Having completed significant projects in both linear and classification modeling, I feel that I have not only approached these goals but, in many technical aspects, surpassed them.

Reflecting on My Initial Goals and Progress

My first goal was to develop a strong foundation in statistical modeling. I successfully demonstrated this through my work on the Diamond Carat project. I moved beyond simple linear regression to implement a log-log transformation, recognizing that the relationship between a diamond’s length and its weight (carat) is cubic rather than linear. This showed exactly the meaningful insight I hoped to achieve. My second goal was to become confident in model evaluation and selection. This was most evident in my Classification Project. I did not just fit a single model, I compared Lasso Regression, Standard Logistic Regression, LDA, QDA, and Poisson regression. I demonstrated professional-level reasoning by choosing the Lasso model for its parsimony and the Multinomial model for its ability to preserve the multi-categorical structure of student grades. I also recognized that while Poisson regression is a powerful tool, it was not appropriate for my specific data due to significant overdispersion. This ability to justify why one approach is better than another was a core competency I aimed to build.

Challenges and Overcoming Obstacles

I anticipated that balancing work and academics would require resilience. One unexpected challenge was the complexity of real-world data. In the diamond dataset, I discovered physical impossibilities like 0.00 mm dimensions, and in the PFI dataset, I navigated significant class imbalances where most students were high achievers. I overcame these by applying rigorous diagnostic checks. I used VIF scores to detect and solve severe multicollinearity and employed studentized residuals to identify and remove outliers that were skewing my results. This systematic approach proved that I moved from last-minute preparation to a methodical and disciplined workflow.

Growth and What Lies Ahead

My third goal was to build an intellectual launchpad for advanced topics like Machine Learning and AI. By mastering the tidymodels ecosystem in R including recipes, workflows, and cross-validation, I have built a toolkit that is directly transferable to the industry. I have moved from someone who takes ownership of their learning to someone who can contribute meaningfully to the tech industry.

While my technical foundations are now solid, the journey of life-long learning continues. For instance, I noted in my project that my current models identify predictive associations rather than causal effects. Exploring causal inference and survey-weighting methods would be a natural next step to deepen my research capabilities. I am leaving this course as a more confident, disciplined, and capable data scientist, ready to build the innovative solutions I once only envisioned.

Sincerely,

Eric Mulaa