library(readxl)
library(discrim)
library(nnet)
library(glmnet)
library(vip)
library(broom)
library(kableExtra)
library(patchwork)
library(forcats)
library(themis)
library(poissonreg)
library(gt)
library(ggcorrplot)
library(car)
library(tidyverse)
library(tidymodels)
library(dplyr)
I begin by importing the curated 2019 PFI dataset and inspecting its structure.
pfi_data <- read_excel("~/STA 631/Projects/pfi-data.xlsx", sheet = "curated 2019")
dim(pfi_data)
## [1] 15500 75
names(pfi_data)
## [1] "BASMID" "ALLGRADEX" "EDCPUB" "SCCHOICE" "SPUBCHOIX"
## [6] "SCONSIDR" "SCHLHRSWK" "EINTNET" "MOSTIMPT" "INTNUM"
## [11] "SEENJOY" "SEGRADES" "SEABSNT" "SEGRADEQ" "FSSPORTX"
## [16] "FSVOL" "FSMTNG" "FSPTMTNG" "FSATCNFN" "FSFUNDRS"
## [21] "FSCOMMTE" "FSCOUNSLR" "FSFREQ" "FSNOTESX" "FSMEMO"
## [26] "FCSCHOOL" "FCTEACHR" "FCSTDS" "FCORDER" "FCSUPPRT"
## [31] "FHHOME" "FHWKHRS" "FHAMOUNT" "FHCAMT" "FHPLACE"
## [36] "FHCHECKX" "FHHELP" "FOSTORY2X" "FOCRAFTS" "FOGAMES"
## [41] "FOBUILDX" "FOSPORT" "FORESPON" "FOHISTX" "FODINNERX"
## [46] "FOLIBRAYX" "FOBOOKSTX" "HDHEALTH" "CDOBMM" "CDOBYY"
## [51] "CSEX" "CSPEAKX" "HHTOTALXX" "RELATION" "P1REL"
## [56] "P1SEX" "P1MRSTA" "P1EMPL" "P1HRSWK" "P1MTHSWRK"
## [61] "P1AGE" "P2GUARD" "TTLHHINC" "OWNRNTHB" "CHLDNT"
## [66] "SEFUTUREX" "DSBLTY" "HHPARN19X" "HHPARN19_BRD" "NUMSIBSX"
## [71] "PARGRADEX" "RACEETH" "INTACC" "CENREG" "ZIPLOCL"
# glimpse(pfi_data)
This project examines how family involvement at school and at home is
associated with student academic performance in the 2019 Parent and
Family Involvement (PFI) survey. Academic performance is measured using
SEGRADES, the parent-reported overall grades item. Family
involvement is measured using school-based involvement items, contact
with the school, and homework/home-engagement variables.
pfi_clean <- pfi_data %>%
mutate(across(where(is.numeric), ~ ifelse(. < 0, NA, .))) %>%
dplyr::select(
BASMID,
SEGRADES, SEENJOY, SEABSNT, SEGRADEQ,
FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR, FSFREQ,
FSNOTESX, FSMEMO,
FHHOME, FHWKHRS, FHAMOUNT, FHCAMT, FODINNERX,
EDCPUB, SCCHOICE, SCONSIDR, SPUBCHOIX,
ALLGRADEX, PARGRADEX, TTLHHINC, RACEETH,
CSEX, DSBLTY, HHPARN19_BRD, OWNRNTHB, CENREG,
HHTOTALXX, NUMSIBSX, P1AGE, P1EMPL, INTACC, ZIPLOCL
) %>%
mutate(
SEGRADES = ifelse(SEGRADES == 5, NA, SEGRADES),
grade_multi = factor(
SEGRADES,
levels = c(1, 2, 3, 4),
labels = c("Mostly_A", "Mostly_B", "Mostly_C", "Mostly_D_or_lower")
),
high_achiever = factor(
ifelse(SEGRADES == 1, "Yes", "No"),
levels = c("Yes", "No")
)
)
missingness_table <- pfi_clean %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "variable", values_to = "n_missing") %>%
mutate(percent_missing = round(100 * n_missing / nrow(pfi_clean), 1)) %>%
arrange(desc(percent_missing))
missingness_table %>%
slice_head(n = 20) %>%
kbl(caption = "Top 20 Variables by Missingness") %>%
kable_classic(full_width = FALSE)
| variable | n_missing | percent_missing |
|---|---|---|
| SEGRADES | 1967 | 12.7 |
| grade_multi | 1967 | 12.7 |
| high_achiever | 1967 | 12.7 |
| FHWKHRS | 1047 | 6.8 |
| FHAMOUNT | 1047 | 6.8 |
| FHCAMT | 1047 | 6.8 |
| FSSPORTX | 126 | 0.8 |
| FSVOL | 126 | 0.8 |
| FSMTNG | 126 | 0.8 |
| FSPTMTNG | 126 | 0.8 |
| FSATCNFN | 126 | 0.8 |
| FSFUNDRS | 126 | 0.8 |
| FSCOMMTE | 126 | 0.8 |
| FSCOUNSLR | 126 | 0.8 |
| FSFREQ | 126 | 0.8 |
| FSNOTESX | 53 | 0.3 |
| FSMEMO | 53 | 0.3 |
| FHHOME | 53 | 0.3 |
| BASMID | 0 | 0.0 |
| SEENJOY | 2 | 0.0 |
pfi_model <- pfi_clean %>%
filter(
!is.na(grade_multi),
!is.na(high_achiever),
!is.na(PARGRADEX),
!is.na(TTLHHINC),
!is.na(RACEETH),
!is.na(CSEX)
)
dim(pfi_model)
## [1] 13533 42
The questionnaire confirms those FS* variables are all
school-based family involvement items, and FSFREQ is the
count of how many times an adult in the household went to meetings or
activities at the child’s school.
pfi_model <- pfi_model %>%
mutate(
across(
c(FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR,
FSNOTESX, FSMEMO),
~ case_when(
. == 1 ~ 1,
. == 2 ~ 0,
TRUE ~ NA_real_
)
),
school_involve_score = rowSums(
dplyr::select(., FSSPORTX, FSVOL, FSMTNG, FSPTMTNG, FSATCNFN, FSFUNDRS, FSCOMMTE, FSCOUNSLR),
na.rm = TRUE
),
school_contact_score = rowSums(
dplyr::select(., FSNOTESX, FSMEMO),
na.rm = TRUE
)
)
pfi_model <- pfi_model %>%
mutate(
CSEX = factor(CSEX, levels = c(1, 2), labels = c("Male", "Female")),
RACEETH = factor(RACEETH, levels = c(1, 2, 3, 4, 5),
labels = c("White", "Black", "Hispanic", "Asian_PI", "Other")),
PARGRADEX = factor(PARGRADEX,
labels = c("Less_than_HS", "HS", "Associate", "Bachelors", "Graduate")),
TTLHHINC = factor(TTLHHINC, ordered = TRUE),
EDCPUB = factor(EDCPUB),
SCCHOICE = factor(SCCHOICE),
SCONSIDR = factor(SCONSIDR),
SPUBCHOIX = factor(SPUBCHOIX),
DSBLTY = factor(DSBLTY),
HHPARN19_BRD = factor(HHPARN19_BRD),
OWNRNTHB = factor(OWNRNTHB),
CENREG = factor(CENREG),
P1EMPL = factor(P1EMPL),
INTACC = factor(INTACC),
ZIPLOCL = factor(ZIPLOCL),
SEENJOY = factor(SEENJOY, ordered = TRUE),
SEABSNT = factor(SEABSNT, ordered = TRUE)
)
ggplot(pfi_model, aes(x = grade_multi)) +
geom_bar(fill = "#2C7FB8") +
labs(
title = "Distribution of Student Grades",
x = "SEGRADES",
y = "Count"
) +
theme_minimal()
The distribution of SEGRADES shows a strong imbalance, with most students reporting mostly A’s or B’s and relatively few reporting lower grades. This imbalance is important for model selection and evaluation.
ggplot(pfi_model, aes(x = grade_multi, y = school_involve_score)) +
geom_boxplot(fill = "#A6CEE3") +
labs(
title = "School Involvement Score by Student Grades",
x = "SEGRADES",
y = "School Involvement Score"
) +
theme_minimal()
School involvement scores do not show a strong positive relationship with higher academic performance; in fact, students with lower grades often have similar or slightly higher involvement levels, suggesting that involvement may be reactive rather than preventative.
ggplot(pfi_model, aes(x = grade_multi, y = FSFREQ)) +
geom_boxplot(fill = "#B2DF8A") +
labs(
title = "Frequency of School Meetings/Activities by Student Grades",
x = "SEGRADES",
y = "FSFREQ"
) +
theme_minimal()
The frequency of school meetings and activities shows a weak positive association with academic performance, but the large variability and presence of extreme values suggest that this relationship is not strong and that the distribution is highly skewed.
ggplot(pfi_model, aes(x = FHHOME, fill = grade_multi)) +
geom_bar(position = "fill") +
labs(
title = "Grade Distribution by Homework Frequency",
x = "How often homework is done outside school",
y = "Proportion"
) +
theme_minimal()
Homework frequency shows a clear positive relationship with academic performance, with higher frequencies associated with a larger proportion of students receiving mostly A’s. This suggests that home-based engagement may be a stronger predictor of academic success than school-based involvement.
set.seed(631)
class_df <- pfi_model %>%
dplyr::select(
high_achiever, grade_multi, SEGRADES,
school_involve_score, school_contact_score, FSFREQ,
FHHOME, FHWKHRS, FHAMOUNT, FHCAMT, FODINNERX,
SEENJOY, SEABSNT,
EDCPUB, SCCHOICE, SCONSIDR, SPUBCHOIX,
PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY,
HHPARN19_BRD, OWNRNTHB, CENREG, HHTOTALXX, NUMSIBSX, P1AGE, P1EMPL, INTACC, ZIPLOCL
) %>%
drop_na()
class_split <- initial_split(class_df, prop = 0.75, strata = high_achiever)
class_train <- training(class_split)
class_test <- testing(class_split)
set.seed(631)
class_folds <- vfold_cv(class_train, v = 10, strata = high_achiever)
count_df <- pfi_model %>%
dplyr::select(
FSFREQ,
school_involve_score, school_contact_score,
FHHOME, FHWKHRS, FODINNERX,
PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY,
SCCHOICE, EDCPUB
) %>%
drop_na()
count_split <- initial_split(count_df, prop = 0.75)
count_train <- training(count_split)
count_test <- testing(count_split)
multi_df <- class_df %>%
dplyr::select(
grade_multi,
school_involve_score, school_contact_score, FSFREQ,
FHHOME, FHWKHRS, FODINNERX, SEENJOY, SEABSNT,
SCCHOICE, EDCPUB, PARGRADEX, TTLHHINC, RACEETH, CSEX, DSBLTY
) %>%
drop_na()
set.seed(631)
multi_split <- initial_split(multi_df, prop = 0.75, strata = grade_multi)
multi_train <- training(multi_split)
multi_test <- testing(multi_split)
lasso_recipe <- recipe(high_achiever ~ ., data = class_train) %>%
step_rm(grade_multi, SEGRADES) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors()) %>%
step_normalize(all_numeric_predictors())
lasso_spec <- logistic_reg(penalty = tune(), mixture = 1) %>%
set_engine("glmnet") %>%
set_mode("classification")
lasso_wf <- workflow() %>%
add_recipe(lasso_recipe) %>%
add_model(lasso_spec)
lambda_grid <- grid_regular(penalty(), levels = 40)
set.seed(631)
lasso_res <- tune_grid(
lasso_wf,
resamples = class_folds,
grid = lambda_grid,
metrics = metric_set(roc_auc, accuracy),
control = control_grid(save_pred = TRUE)
)
collect_metrics(lasso_res) %>%
ggplot(aes(x = penalty, y = mean, color = .metric)) +
geom_line() +
geom_point() +
facet_wrap(~ .metric, scales = "free_y", ncol = 1) +
scale_x_log10() +
theme_minimal() +
labs(title = "Lasso Tuning Results", x = "Penalty", y = "Cross-validated performance")
best_penalty <- select_best(lasso_res, metric = "roc_auc")
final_lasso_wf <- finalize_workflow(lasso_wf, best_penalty)
lasso_last_fit <- last_fit(final_lasso_wf, class_split)
lasso_preds <- collect_predictions(lasso_last_fit)
collect_metrics(lasso_last_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.715 Preprocessor1_Model1
## 2 roc_auc binary 0.773 Preprocessor1_Model1
## 3 brier_class binary 0.191 Preprocessor1_Model1
The lasso logistic model achieved an accuracy of 71.5% and an ROC AUC of 0.773 on the test set, indicating moderate predictive ability in distinguishing high-achieving students from others. Performance remained stable across small penalty values but declined sharply at larger penalties, suggesting that overly aggressive regularization removed too much predictive information.
final_lasso_fit <- fit(final_lasso_wf, data = class_train)
lasso_coefs <- final_lasso_fit %>%
extract_fit_parsnip() %>%
tidy()
lasso_coefs_nonzero <- lasso_coefs %>%
filter(term != "(Intercept)", estimate != 0) %>%
arrange(desc(abs(estimate)))
lasso_coefs_nonzero %>%
slice_head(n = 20) %>%
knitr::kable(caption = "Top Lasso-Selected Predictors")
| term | estimate | penalty |
|---|---|---|
| SEENJOY_1 | 0.4124879 | 0.0049239 |
| DSBLTY_X2 | -0.3599257 | 0.0049239 |
| PARGRADEX_Graduate | -0.3021157 | 0.0049239 |
| CSEX_Female | -0.2390815 | 0.0049239 |
| TTLHHINC_01 | -0.2087388 | 0.0049239 |
| RACEETH_Black | 0.1939437 | 0.0049239 |
| RACEETH_Hispanic | 0.1761597 | 0.0049239 |
| SEENJOY_2 | -0.1480184 | 0.0049239 |
| FHHOME | -0.1448516 | 0.0049239 |
| PARGRADEX_Bachelors | -0.1344068 | 0.0049239 |
| FHCAMT | 0.1291649 | 0.0049239 |
| SEABSNT_1 | 0.0760875 | 0.0049239 |
| HHPARN19_BRD_X2 | 0.0718702 | 0.0049239 |
| P1AGE | 0.0649976 | 0.0049239 |
| RACEETH_Asian_PI | -0.0640073 | 0.0049239 |
| CENREG_X4 | 0.0613216 | 0.0049239 |
| PARGRADEX_HS | 0.0566932 | 0.0049239 |
| CENREG_X2 | -0.0492400 | 0.0049239 |
| school_contact_score | -0.0451594 | 0.0049239 |
| ZIPLOCL_X32 | -0.0356906 | 0.0049239 |
The lasso model selected a relatively small set of predictors, indicating that only a subset of the available family, school, and demographic variables contributed meaningful predictive information. Among the retained predictors, student school experience variables such as school enjoyment and absenteeism, along with household background characteristics such as parental education, income, disability status, and demographic factors, showed the strongest effects. Importantly for the research question, home-based involvement variables such as homework-related engagement (FHHOME, FHCAMT) remained in the model, while school contact was retained with a smaller coefficient. This suggests that family involvement at home may be more strongly associated with academic performance than school-based involvement, although broader student and household characteristics also play a major role.
race/ethnicity remained in the selected model, indicating differences in predicted academic performance across groups after accounting for other retained covariates.
lasso_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
labs(title = "Lasso Confusion Matrix")
log_formula <- high_achiever ~ school_involve_score + school_contact_score + FSFREQ +
FHHOME + FHWKHRS + FODINNERX + SEENJOY + SEABSNT +
SCCHOICE + EDCPUB + PARGRADEX + TTLHHINC + RACEETH + CSEX + DSBLTY
log_recipe <- recipe(log_formula, data = class_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
log_spec <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
log_wf <- workflow() %>%
add_recipe(log_recipe) %>%
add_model(log_spec)
log_last_fit <- last_fit(log_wf, class_split)
log_preds <- collect_predictions(log_last_fit)
collect_metrics(log_last_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.708 Preprocessor1_Model1
## 2 roc_auc binary 0.769 Preprocessor1_Model1
## 3 brier_class binary 0.193 Preprocessor1_Model1
The standard logistic regression model achieved an accuracy of 70.8%, an ROC AUC of 0.769, and a Brier score of 0.193 on the test set. These results were very similar to the lasso logistic model, though the lasso model performed slightly better on all three metrics. This indicates that reducing the predictor set through regularization did not harm predictive performance and may have improved out-of-sample generalization. Because the lasso model is also more parsimonious, it provides a strong candidate for the preferred binary classification model
metric_set(accuracy, precision, recall, f_meas, roc_auc)(
log_preds,
truth = high_achiever,
estimate = .pred_class,
.pred_Yes
)
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.708
## 2 precision binary 0.723
## 3 recall binary 0.790
## 4 f_meas binary 0.755
## 5 roc_auc binary 0.769
The higher recall relative to precision suggests that the model is better at identifying true positive cases than at avoiding false positive classifications.
log_fit_object <- fit(log_wf, data = class_train)
tidy(extract_fit_parsnip(log_fit_object), conf.int = TRUE, exponentiate = TRUE) %>%
kbl(caption = "Logistic Regression Odds Ratios") %>%
kable_classic(full_width = FALSE)
| term | estimate | std.error | statistic | p.value | conf.low | conf.high |
|---|---|---|---|---|---|---|
| (Intercept) | 13.1729145 | 0.2618645 | 9.8454082 | 0.0000000 | 7.8978791 | 22.0481314 |
| school_involve_score | 1.0156403 | 0.0136713 | 1.1351735 | 0.2563026 | 0.9887850 | 1.0432257 |
| school_contact_score | 0.8664530 | 0.0398616 | -3.5961306 | 0.0003230 | 0.8012676 | 0.9367968 |
| FSFREQ | 0.9979975 | 0.0027483 | -0.7293740 | 0.4657729 | 0.9925799 | 1.0033392 |
| FHHOME | 0.8546945 | 0.0306294 | -5.1261533 | 0.0000003 | 0.8048561 | 0.9075457 |
| FHWKHRS | 0.9995538 | 0.0050493 | -0.0883976 | 0.9295606 | 0.9896971 | 1.0094923 |
| FODINNERX | 0.9694966 | 0.0117818 | -2.6293311 | 0.0085553 | 0.9473583 | 0.9921416 |
| SEENJOY_1 | 3.8365627 | 0.1278157 | 10.5196527 | 0.0000000 | 2.9982910 | 4.9524917 |
| SEENJOY_2 | 0.6062330 | 0.1025449 | -4.8807028 | 0.0000011 | 0.4970689 | 0.7434014 |
| SEENJOY_3 | 0.8658592 | 0.0700573 | -2.0559320 | 0.0397891 | 0.7547893 | 0.9934586 |
| SEABSNT_1 | 1.5679110 | 0.1913098 | 2.3508677 | 0.0187297 | 1.0863142 | 2.3050736 |
| SEABSNT_2 | 1.0533276 | 0.1571315 | 0.3306424 | 0.7409146 | 0.7780781 | 1.4426190 |
| SEABSNT_3 | 1.1609103 | 0.1153429 | 1.2935728 | 0.1958130 | 0.9262671 | 1.4562678 |
| SCCHOICE_X2 | 1.0077543 | 0.0498453 | 0.1549681 | 0.8768465 | 0.9138807 | 1.1110985 |
| EDCPUB_X2 | 0.9523333 | 0.0814732 | -0.5994636 | 0.5488637 | 0.8112800 | 1.1166176 |
| PARGRADEX_HS | 0.9298361 | 0.1261089 | -0.5768581 | 0.5640353 | 0.7255995 | 1.1897746 |
| PARGRADEX_Associate | 0.7165266 | 0.1183000 | -2.8177500 | 0.0048361 | 0.5676212 | 0.9026967 |
| PARGRADEX_Bachelors | 0.5016405 | 0.1231958 | -5.5997968 | 0.0000000 | 0.3936177 | 0.6380956 |
| PARGRADEX_Graduate | 0.3470300 | 0.1281318 | -8.2598099 | 0.0000000 | 0.2696862 | 0.4457235 |
| TTLHHINC_01 | 0.4411260 | 0.1195522 | -6.8457514 | 0.0000000 | 0.3488658 | 0.5574590 |
| TTLHHINC_02 | 1.0302578 | 0.0989166 | 0.3013557 | 0.7631433 | 0.8487545 | 1.2508639 |
| TTLHHINC_03 | 1.0206128 | 0.0902301 | 0.2261243 | 0.8211047 | 0.8548084 | 1.2175973 |
| TTLHHINC_04 | 0.8815719 | 0.0944977 | -1.3338814 | 0.1822427 | 0.7325351 | 1.0610139 |
| TTLHHINC_05 | 1.1176032 | 0.0949471 | 1.1710357 | 0.2415844 | 0.9282033 | 1.3468123 |
| TTLHHINC_06 | 0.8368187 | 0.0935647 | -1.9040075 | 0.0569092 | 0.6966818 | 1.0054220 |
| TTLHHINC_07 | 1.1341144 | 0.0896348 | 1.4040528 | 0.1603031 | 0.9516037 | 1.3522888 |
| TTLHHINC_08 | 0.9868376 | 0.0842962 | -0.1571808 | 0.8751023 | 0.8365292 | 1.1641288 |
| TTLHHINC_09 | 0.9354899 | 0.0786952 | -0.8473820 | 0.3967822 | 0.8017291 | 1.0914664 |
| TTLHHINC_10 | 1.0105759 | 0.0755992 | 0.1391597 | 0.8893240 | 0.8713862 | 1.1719888 |
| TTLHHINC_11 | 0.9343384 | 0.0812248 | -0.8361565 | 0.4030669 | 0.7968118 | 1.0955985 |
| RACEETH_Black | 2.1653251 | 0.0845965 | 9.1324155 | 0.0000000 | 1.8351176 | 2.5568759 |
| RACEETH_Hispanic | 1.6536256 | 0.0625065 | 8.0466858 | 0.0000000 | 1.4630556 | 1.8693184 |
| RACEETH_Asian_PI | 0.7581050 | 0.1056804 | -2.6204810 | 0.0087806 | 0.6149963 | 0.9308290 |
| RACEETH_Other | 1.0780326 | 0.0972970 | 0.7722503 | 0.4399662 | 0.8902733 | 1.3038168 |
| CSEX_Female | 0.5877427 | 0.0475344 | -11.1806698 | 0.0000000 | 0.5354051 | 0.6450757 |
| DSBLTY_X2 | 0.3943715 | 0.0569492 | -16.3384592 | 0.0000000 | 0.3526254 | 0.4408335 |
log_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class) %>%
autoplot(type = "heatmap") +
labs(title = "Logistic Regression Confusion Matrix")
log_preds %>%
roc_curve(truth = high_achiever, .pred_Yes) %>%
autoplot() +
labs(title = "Logistic Regression ROC Curve")
multi_recipe <- recipe(grade_multi ~ ., data = multi_train) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
multi_spec <- multinom_reg() %>%
set_engine("nnet") %>%
set_mode("classification")
multi_wf <- workflow() %>%
add_recipe(multi_recipe) %>%
add_model(multi_spec)
multi_last_fit <- last_fit(multi_wf, multi_split)
collect_metrics(multi_last_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy multiclass 0.624 Preprocessor1_Model1
## 2 roc_auc hand_till 0.712 Preprocessor1_Model1
## 3 brier_class multiclass 0.239 Preprocessor1_Model1
The multinomial model is harder to fit well than the binary models, but it is the more appropriate model for the full SEGRADES response because it preserves all four academic-performance categories instead of collapsing them into a yes/no outcome.
multi_preds <- collect_predictions(multi_last_fit)
multi_cm <- multi_preds %>%
conf_mat(truth = grade_multi, estimate = .pred_class)
multi_cm
## Truth
## Prediction Mostly_A Mostly_B Mostly_C Mostly_D_or_lower
## Mostly_A 1553 612 87 13
## Mostly_B 249 393 138 27
## Mostly_C 17 36 43 18
## Mostly_D_or_lower 1 1 2 2
autoplot(multi_cm, type = "heatmap") +
labs(title = "Multinomial Regression Confusion Matrix")
The multinomial confusion matrix shows that the model performs best on
the more common A and B categories and struggles more with the less
common C and D-or-lower categories, which is consistent with the class
imbalance observed in the outcome distribution.
disc_recipe <- recipe(
high_achiever ~ school_involve_score + school_contact_score + FSFREQ +
FHWKHRS + FODINNERX + P1AGE + HHTOTALXX + NUMSIBSX,
data = class_train
) %>%
step_normalize(all_numeric_predictors()) %>%
step_zv(all_predictors())
lda_spec <- discrim_linear() %>%
set_engine("MASS") %>%
set_mode("classification")
qda_spec <- discrim_quad() %>%
set_engine("MASS") %>%
set_mode("classification")
lda_wf <- workflow() %>% add_recipe(disc_recipe) %>% add_model(lda_spec)
qda_wf <- workflow() %>% add_recipe(disc_recipe) %>% add_model(qda_spec)
lda_fit <- last_fit(lda_wf, class_split)
qda_fit <- last_fit(qda_wf, class_split)
collect_metrics(lda_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.607 Preprocessor1_Model1
## 2 roc_auc binary 0.617 Preprocessor1_Model1
## 3 brier_class binary 0.236 Preprocessor1_Model1
collect_metrics(qda_fit)
## # A tibble: 3 × 4
## .metric .estimator .estimate .config
## <chr> <chr> <dbl> <chr>
## 1 accuracy binary 0.624 Preprocessor1_Model1
## 2 roc_auc binary 0.648 Preprocessor1_Model1
## 3 brier_class binary 0.234 Preprocessor1_Model1
lda_preds <- collect_predictions(lda_fit)
# Confusion matrix
lda_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 1612 1044
## No 209 327
# Heatmap
lda_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class) %>%
autoplot(type = "heatmap")
qda_preds <- collect_predictions(qda_fit)
qda_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class)
## Truth
## Prediction Yes No
## Yes 1407 786
## No 414 585
qda_preds %>%
conf_mat(truth = high_achiever, estimate = .pred_class) %>%
autoplot(type = "heatmap")
metric_set(accuracy, precision, recall, f_meas, roc_auc)(
lda_preds,
truth = high_achiever,
estimate = .pred_class,
.pred_Yes
)
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.607
## 2 precision binary 0.607
## 3 recall binary 0.885
## 4 f_meas binary 0.720
## 5 roc_auc binary 0.617
metric_set(accuracy, precision, recall, f_meas, roc_auc)(
qda_preds,
truth = high_achiever,
estimate = .pred_class,
.pred_Yes
)
## # A tibble: 5 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 accuracy binary 0.624
## 2 precision binary 0.642
## 3 recall binary 0.773
## 4 f_meas binary 0.701
## 5 roc_auc binary 0.648
Linear Discriminant Analysis (LDA) and Quadratic Discriminant Analysis (QDA) were fit as alternative binary classifiers for high academic achievement. LDA achieved 60.7% accuracy, 88.5% recall, and an ROC AUC of 0.617, while QDA achieved 62.4% accuracy, 77.3% recall, and an ROC AUC of 0.648. LDA showed the highest recall, indicating that it was more aggressive in identifying positive cases, but this came at the cost of lower precision and weaker overall discrimination.
QDA performed slightly better than LDA in accuracy, precision, and AUC, suggesting that allowing quadratic decision boundaries improved classification somewhat. However, both discriminant-analysis methods underperformed compared with the logistic and lasso logistic models, indicating that logistic-based approaches were better suited to this dataset.
Since FSFREQ is literally a count of how many times
adults in the household went to meetings or activities at the child’s
school, it is an excellent Poisson outcome.
count_train %>%
summarise(
mean_fsfreq = mean(FSFREQ),
var_fsfreq = var(FSFREQ),
ratio = var(FSFREQ) / mean(FSFREQ)
) %>%
kbl(caption = "Poisson Dispersion Check for FSFREQ") %>%
kable_classic(full_width = FALSE)
| mean_fsfreq | var_fsfreq | ratio |
|---|---|---|
| 7.190098 | 92.28142 | 12.83451 |
If variance is much larger than the mean, I will note overdispersion. This will mean Poisson was demonstrated but may not fit perfectly.
pois_recipe <- recipe(
FSFREQ ~ school_involve_score + school_contact_score + FHHOME + FHWKHRS +
FODINNERX + PARGRADEX + TTLHHINC + RACEETH + CSEX + DSBLTY + SCCHOICE + EDCPUB,
data = count_train
) %>%
step_dummy(all_nominal_predictors()) %>%
step_zv(all_predictors())
pois_spec <- poisson_reg() %>%
set_engine("glm") %>%
set_mode("regression")
pois_wf <- workflow() %>%
add_recipe(pois_recipe) %>%
add_model(pois_spec)
pois_fit <- fit(pois_wf, data = count_train)
pois_preds <- predict(pois_fit, new_data = count_test) %>%
bind_cols(count_test)
rmse(pois_preds, truth = FSFREQ, estimate = .pred)
## # A tibble: 1 × 3
## .metric .estimator .estimate
## <chr> <chr> <dbl>
## 1 rmse standard 8.53
The Poisson regression model was used to analyze FSFREQ, the number of times adults in the household attended meetings or activities at the child’s school. However, the dispersion check indicated substantial overdispersion: the mean of FSFREQ was 7.19, while the variance was 92.28, yielding a variance-to-mean ratio of 12.83. Because the Poisson model assumes equality of the mean and variance, this result indicates a serious violation of the Poisson assumption. The test-set RMSE of 8.53 was also large relative to the mean count, suggesting weak predictive accuracy. Therefore, while Poisson regression was demonstrated as required, it was not considered the preferred model for this count outcome.
lasso_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)(
lasso_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
dplyr::select(.metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "Lasso Logistic")
log_metrics <- metric_set(accuracy, precision, recall, f_meas, roc_auc)(
log_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
dplyr::select(.metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "Standard Logistic")
lda_preds <- collect_predictions(lda_fit)
lda_metrics <- metric_set(accuracy, roc_auc)(
lda_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
dplyr::select(.metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "LDA")
qda_preds <- collect_predictions(qda_fit)
qda_metrics <- metric_set(accuracy, roc_auc)(
qda_preds, truth = high_achiever, estimate = .pred_class, .pred_Yes
) %>%
dplyr::select(.metric, .estimate) %>%
pivot_wider(names_from = .metric, values_from = .estimate) %>%
mutate(model = "QDA")
bind_rows(lasso_metrics, log_metrics, lda_metrics, qda_metrics) %>%
dplyr::select(model, everything()) %>%
kbl(caption = "Binary Classification Model Comparison") %>%
kable_classic(full_width = FALSE)
| model | accuracy | precision | recall | f_meas | roc_auc |
|---|---|---|---|---|---|
| Lasso Logistic | 0.7149123 | 0.7265042 | 0.8023064 | 0.7625261 | 0.7734983 |
| Standard Logistic | 0.7077068 | 0.7231156 | 0.7902252 | 0.7551824 | 0.7688812 |
| LDA | 0.6074561 | NA | NA | NA | 0.6173456 |
| QDA | 0.6240602 | NA | NA | NA | 0.6480891 |
lasso_summary <- tibble(
model = "Lasso Logistic",
outcome = "high_achiever",
accuracy = 0.7149123,
roc_auc = 0.7734983,
comment = "Best binary model"
)
log_summary <- tibble(
model = "Standard Logistic",
outcome = "high_achiever",
accuracy = 0.7077068,
roc_auc = 0.7688812,
comment = "Close second"
)
lda_summary <- tibble(
model = "LDA",
outcome = "high_achiever",
accuracy = 0.6074561,
roc_auc = 0.6173456,
comment = "High recall, weak AUC"
)
qda_summary <- tibble(
model = "QDA",
outcome = "high_achiever",
accuracy = 0.6240602,
roc_auc = 0.6480891,
comment = "Better than LDA, below logistic"
)
multi_summary <- tibble(
model = "Multinomial Regression",
outcome = "grade_multi",
accuracy = 0.6237469,
roc_auc = 0.7119616,
comment = "Best full-outcome model"
)
pois_summary <- tibble(
model = "Poisson Regression",
outcome = "FSFREQ",
accuracy = NA_real_,
roc_auc = NA_real_,
comment = "Rejected due to overdispersion; RMSE = 8.53"
)
bind_rows(
lasso_summary, log_summary, multi_summary,
lda_summary, qda_summary, pois_summary
) %>%
kbl(
caption = "Model Comparison Summary",
digits = 3
) %>%
kable_classic(full_width = FALSE)
| model | outcome | accuracy | roc_auc | comment |
|---|---|---|---|---|
| Lasso Logistic | high_achiever | 0.715 | 0.773 | Best binary model |
| Standard Logistic | high_achiever | 0.708 | 0.769 | Close second |
| Multinomial Regression | grade_multi | 0.624 | 0.712 | Best full-outcome model |
| LDA | high_achiever | 0.607 | 0.617 | High recall, weak AUC |
| QDA | high_achiever | 0.624 | 0.648 | Better than LDA, below logistic |
| Poisson Regression | FSFREQ | NA | NA | Rejected due to overdispersion; RMSE = 8.53 |
Because the grade-based outcomes are imbalanced, ROC AUC is emphasized more heavily than confusion-matrix accuracy when identifying the best classification models. Accuracy is still reported for completeness, but it can be misleading when most observations belong to the majority classes.
The lasso model was used as a variable-selection tool to identify the most important family involvement, school contact, and demographic predictors of high academic achievement. Because lasso shrinks weak coefficients to zero, it provides a parsimonious starting point for final model interpretation.
The standard logistic model provides interpretable odds ratios. Predictors with odds ratios greater than 1 are associated with higher odds of being a high achiever, holding other variables constant.
The multinomial model is the most direct model for the original
SEGRADES response because it preserves all four academic
performance categories rather than collapsing them into a binary
outcome.
Poisson regression was used to model the count outcome
FSFREQ, the number of times adults in the household
attended meetings or activities at the child’s school. This demonstrates
the required count-modeling component of the project while remaining
substantively tied to family involvement.
Because the binary outcome high_achiever is imbalanced,
model comparison emphasized ROC AUC in addition to
confusion-matrix-based accuracy. Among the binary classifiers, the lasso
logistic model is preferred because it achieved the strongest overall
discrimination while also producing a more parsimonious set of
predictors. The standard logistic model performed similarly and remains
useful for interpretation, but lasso had a slight edge in predictive
performance. For the original four-category SEGRADES
outcome, the multinomial regression model is preferred because it
preserves the full academic-performance structure rather than collapsing
the response into a binary indicator. LDA and QDA were useful benchmark
methods, but both underperformed relative to the logistic-based models.
Poisson regression was demonstrated for the count outcome
FSFREQ, but it was not retained as a preferred model
because strong overdispersion indicated poor fit.
This analysis used a curated PFI analytic file and unweighted tidymodels workflows. The NHES:2019 Parent and Family Involvement survey is a complex national survey with weighting, nonresponse adjustment, and imputation procedures, so these results should be interpreted as associations within the analytic sample rather than fully survey-weighted national estimates. In addition, because the data are observational and cross-sectional, the models identify predictive associations rather than causal effects. For example, higher school involvement may reflect reactive engagement when a student is already struggling, rather than a direct causal improvement in grades.
Across the exploratory analysis and the fitted models, home-based family engagement showed a clearer and more consistent relationship with academic performance than school-based involvement measures. In particular, homework-related engagement variables remained important in both the lasso-selected and regression-based models, whereas school event participation showed weaker and less consistent effects. Based on these findings, GVSU’s K-12 Connect should prioritize programs that help families build effective homework routines, maintain regular academic support at home, and strengthen day-to-day engagement with children’s schoolwork. School-based family involvement should still be encouraged, but the results suggest it may be more reactive than predictive when considered on its own.
Can the carat weight of a diamond be determined from visually observable characteristics, without actually weighing the diamond?
Price is excluded because it is not visually observable.
In this project, the response variable is carat. The goal is to see whether visually observable variables can explain and predict carat well.
I begin by loading the diamonds dataset and checking its structure.
# For the computer that set the seed
diamond_dataset <- ggplot2::diamonds
glimpse() shows the variable names and data types.
# Define the variable descriptions based on ?diamonds
diamond_metadata <- tibble(
Variable = c("carat", "cut", "color", "clarity", "depth", "table", "x", "y", "z"),
Description = c(
"Weight of the diamond",
"Quality of the cut (Fair, Good, Very Good, Premium, Ideal)",
"Diamond colour, from D (best) to J (worst)",
"A measurement of how clear the diamond is (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF)",
"Total depth percentage = z / mean(x, y) = 2 * z / (x + y)",
"Width of top of diamond relative to widest point",
"Length in mm",
"Width in mm",
"Depth in mm"
),
Type = c("Numeric (Target)", "Ordinal Factor", "Ordinal Factor", "Ordinal Factor",
"Numeric", "Numeric", "Numeric", "Numeric", "Numeric")
)
diamond_metadata %>%
gt() %>%
tab_header(
title = "Data Dictionary: Diamonds Dataset",
subtitle = "Visually Observable Variables for Predicting Carat Weight"
) %>%
cols_label(
Variable = md("**Variable Name**"),
Description = md("**Definition**"),
Type = md("**Data Type**")
) %>%
tab_source_note(
source_note = "Note: Price is excluded as it is not a visually observable characteristic."
) %>%
opt_stylize(color = "blue", style = 6)
| Data Dictionary: Diamonds Dataset | ||
| Visually Observable Variables for Predicting Carat Weight | ||
| Variable Name | Definition | Data Type |
|---|---|---|
| carat | Weight of the diamond | Numeric (Target) |
| cut | Quality of the cut (Fair, Good, Very Good, Premium, Ideal) | Ordinal Factor |
| color | Diamond colour, from D (best) to J (worst) | Ordinal Factor |
| clarity | A measurement of how clear the diamond is (I1, SI2, SI1, VS2, VS1, VVS2, VVS1, IF) | Ordinal Factor |
| depth | Total depth percentage = z / mean(x, y) = 2 * z / (x + y) | Numeric |
| table | Width of top of diamond relative to widest point | Numeric |
| x | Length in mm | Numeric |
| y | Width in mm | Numeric |
| z | Depth in mm | Numeric |
| Note: Price is excluded as it is not a visually observable characteristic. | ||
glimpse(diamond_dataset)
## Rows: 53,940
## Columns: 10
## $ carat <dbl> 0.23, 0.21, 0.23, 0.29, 0.31, 0.24, 0.24, 0.26, 0.22, 0.23, 0.…
## $ cut <ord> Ideal, Premium, Good, Premium, Good, Very Good, Very Good, Ver…
## $ color <ord> E, E, E, I, J, J, I, H, E, H, J, J, F, J, E, E, I, J, J, J, I,…
## $ clarity <ord> SI2, SI1, VS1, VS2, SI2, VVS2, VVS1, SI1, VS2, VS1, SI1, VS1, …
## $ depth <dbl> 61.5, 59.8, 56.9, 62.4, 63.3, 62.8, 62.3, 61.9, 65.1, 59.4, 64…
## $ table <dbl> 55, 61, 65, 58, 58, 57, 57, 55, 61, 61, 55, 56, 61, 54, 62, 58…
## $ price <int> 326, 326, 327, 334, 335, 336, 336, 337, 337, 338, 339, 340, 34…
## $ x <dbl> 3.95, 3.89, 4.05, 4.20, 4.34, 3.94, 3.95, 4.07, 3.87, 4.00, 4.…
## $ y <dbl> 3.98, 3.84, 4.07, 4.23, 4.35, 3.96, 3.98, 4.11, 3.78, 4.05, 4.…
## $ z <dbl> 2.43, 2.31, 2.31, 2.63, 2.75, 2.48, 2.47, 2.53, 2.49, 2.39, 2.…
summary() gives quick numerical summaries for each
variable.
summary(diamond_dataset)
## carat cut color clarity depth
## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##
I also check the dimensions of our data
dim(diamond_dataset)
## [1] 53940 10
names(diamond_dataset)
## [1] "carat" "cut" "color" "clarity" "depth" "table" "price"
## [8] "x" "y" "z"
The dataset contains numerical variables (carat,
depth, table, price,
x, y, z) and categorical
variables (cut, color, clarity).
Since carat is numerical, this is a regression
problem.
sum(is.na(diamond_dataset))
## [1] 0
I do not have missing values in our data.
I first look at the distribution of the response variable.
ggplot(diamond_dataset, aes(x = carat)) +
geom_histogram(bins = 30)
It looks skewd to the right
Next, I inspect the numerical predictors.
p1 <- ggplot(diamond_dataset, aes(x = x)) + geom_histogram(bins = 30)
p2 <- ggplot(diamond_dataset, aes(x = y)) + geom_histogram(bins = 30)
p3 <- ggplot(diamond_dataset, aes(x = z)) + geom_histogram(bins = 30)
p4 <- ggplot(diamond_dataset, aes(x = depth)) + geom_histogram(bins = 30)
p5 <- ggplot(diamond_dataset, aes(x = table)) + geom_histogram(bins = 30)
(p1 | p2) / (p3 | p4) / p5
I also inspect the categorical predictors.
p6 <- ggplot(diamond_dataset, aes(x = cut)) + geom_bar()
p7 <- ggplot(diamond_dataset, aes(x = color)) + geom_bar()
p8 <- ggplot(diamond_dataset, aes(x = clarity)) + geom_bar()
(p6 | p7) / p8
I compare carat across the categorical predictors using
boxplots.
p14 <- ggplot(diamond_dataset, aes(x = cut, y = carat)) + geom_boxplot()
p15 <- ggplot(diamond_dataset, aes(x = color, y = carat)) + geom_boxplot()
p16 <- ggplot(diamond_dataset, aes(x = clarity, y = carat)) + geom_boxplot()
(p14 | p15) / p16
# 1. Prepare the data: Select only numerical variables and exclude price
numerical_diamonds <- diamond_dataset |>
dplyr::select(carat, depth, table, x, y, z)
# 2. Calculate the correlation matrix
corr_matrix <- cor(numerical_diamonds)
# 3. Create the heatmap
ggcorrplot(corr_matrix,
hc.order = TRUE,
type = "lower",
lab = TRUE,
lab_size = 4,
method = "circle",
colors = c("#E46726", "white", "#6D9EC1"),
title = "Correlation Heatmap of Diamond Physical Characteristics",
ggtheme = theme_minimal())
There is high correlation Carat and variables x, y, and z.
In contrast, the relationships between carat and depth and table are much weaker, suggesting that they are weak predictors and may not contribute much to explaining variation in carat.
The relationships among the explanatory variables themselves, there is strong evidence of collinearity, particularly among the dimension variables. The x, y, and z predictors show very tight linear relationships, and the corresponding correlations are extremely high. This indicates that these variables contain largely overlapping information, and including all of them in a regression model may lead to multicollinearity issues, such as unstable coefficient estimates and difficulty in interpreting the individual effects of each variable. On the other hand, depth and table show relatively weak correlations with the dimension variables and with carat.
The code chunk below is comment because it was run once, and files saved for reproducibility.
# 1. Set seed for your own reproducibility
set.seed(123)
# 2. Create the split (e.g., 75% training, 25% testing)
diamond_split <- initial_split(diamond_dataset, prop = 0.75)
train_data <- training(diamond_split)
test_data <- testing(diamond_split)
I used an automated AIC-based search to ensure parsimony (the simplest model that does the best job).
# 1. Ensure models are defined
# Define the Null Model (Start with nothing but the intercept)
null_model <- lm(carat ~ 1, data = train_data)
full_model <- lm(carat ~ cut + color + clarity + depth + table + x + y + z, data = train_data)
# 2. Use the explicit 'stats::' prefix and pass formulas to scope
# Perform bidirectional selection
# direction = "both" is usually best for finding the global minimum AIC
step_model <- stats::step(
null_model,
scope = list(lower = formula(null_model), upper = formula(full_model)),
direction = "both",
trace = 0
)
# 3. Check if it worked
summary(step_model)
##
## Call:
## lm(formula = carat ~ x + depth + color + clarity + table + cut +
## y + z, data = train_data)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.4762 -0.0664 -0.0284 0.0553 3.7629
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2.785e+00 4.472e-02 -62.276 < 2e-16 ***
## x 3.957e-01 4.239e-03 93.363 < 2e-16 ***
## depth 1.819e-02 5.873e-04 30.978 < 2e-16 ***
## color.L 4.543e-02 1.781e-03 25.505 < 2e-16 ***
## color.Q 2.375e-02 1.627e-03 14.597 < 2e-16 ***
## color.C -1.918e-03 1.526e-03 -1.257 0.20887
## color^4 -5.721e-03 1.404e-03 -4.075 4.61e-05 ***
## color^5 3.970e-03 1.327e-03 2.992 0.00277 **
## color^6 -8.187e-05 1.202e-03 -0.068 0.94569
## clarity.L 6.630e-03 3.164e-03 2.095 0.03617 *
## clarity.Q 4.456e-02 2.948e-03 15.114 < 2e-16 ***
## clarity.C -2.457e-02 2.523e-03 -9.736 < 2e-16 ***
## clarity^4 1.759e-03 2.011e-03 0.875 0.38159
## clarity^5 -3.004e-03 1.636e-03 -1.836 0.06639 .
## clarity^6 -3.637e-03 1.422e-03 -2.558 0.01053 *
## clarity^7 8.119e-03 1.253e-03 6.479 9.34e-11 ***
## table 2.180e-03 3.010e-04 7.241 4.55e-13 ***
## cut.L 1.021e-02 2.322e-03 4.395 1.11e-05 ***
## cut.Q -9.047e-03 1.859e-03 -4.867 1.14e-06 ***
## cut.C 3.802e-04 1.603e-03 0.237 0.81256
## cut^4 1.493e-03 1.283e-03 1.163 0.24466
## y 5.515e-03 1.815e-03 3.038 0.00238 **
## z 1.390e-02 7.082e-03 1.963 0.04963 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1013 on 40432 degrees of freedom
## Multiple R-squared: 0.9544, Adjusted R-squared: 0.9544
## F-statistic: 3.846e+04 on 22 and 40432 DF, p-value: < 2.2e-16
Interpretation of carat_model
Coefficients Coefficients will be interpreted on our final
model
vif(step_model)
## GVIF Df GVIF^(1/(2*Df))
## x 89.197308 1 9.444433
## depth 2.795640 1 1.672017
## color 1.156131 6 1.012163
## clarity 1.327080 7 1.020419
## table 1.790157 1 1.337968
## cut 1.945911 4 1.086777
## y 17.249856 1 4.153295
## z 95.464080 1 9.770572
The VIF values for x and z are high hence severe collinearity due to the correlation I saw between x,y, and z. I will drop them to avoid multicollinearity and to simplify the model to make the coefficients more stable and interpretable.
# Refined Model: Dropping y and z to fix collinearity
model_refined <- lm(carat ~ x + depth + table + cut + color + clarity, data = train_data)
# Check VIF again
vif(model_refined)
## GVIF Df GVIF^(1/(2*Df))
## x 1.323890 1 1.150604
## depth 1.379036 1 1.174324
## table 1.789551 1 1.337741
## cut 1.924214 4 1.085255
## color 1.155730 6 1.012134
## clarity 1.325567 7 1.020335
The VIF values look fine now
plot(model_refined, which = 1) # Residuals vs Fitted
The Residuals vs. Fitted plot revealed three major Potential Problems:
Non-linearity: The distinct U-shape in the residuals indicates that the relationship between the physical traits and carat weight is not a straight line.
Non-constant Variance (Heteroscedasticity): The fan-shaped distribution where residuals spread out as predicted values increase shows that the model’s error is not consistent across all diamond sizes
The plots identified specific observations (e.g., points 14102, 24575, and 34165) with high studentized residuals. These outliers can disproportionately pull the regression line, affecting the goodness of fit. Identifying and potentially removing these points is a critical step in refining the model quality.
Why it works:
Carat weight is a measure of volume. In geometry, volume scales cubically (\(length^3\)) rather than linearly. By applying a natural log (\(log\)) to both carat and the dimension x, I convert this power relationship into a linear one that the regression model can accurately map. Also, log-trsnaforming carat will take care of the right skewness observed earlier.
p1 <- ggplot(train_data, aes(x = carat)) +
geom_histogram(fill = "steelblue", color = "white") +
labs(title = "Original Carat (Right-Skewed)")
p2 <- ggplot(train_data, aes(x = log(carat))) +
geom_histogram(fill = "darkred", color = "white") +
labs(title = "Log-Transformed Carat (More Normal)")
p1 + p2 # Displays them side-by-side
I use a “Log-Log” model because it handles both the skewness of the target and the non-linear (cubic) relationship between length (\(x\)) and weight (\(carat\)).
Note
I have minimum values for x y and z as 0.00mm. Therefore log(x) will give -infinity, values that models cannot work with and for that reason I will omit the records with 0.00mm. Also for the soundness of our model, although the length, width and depth of a diamond might be small, it cannot be completely zero.
# Check for zeros in the physical dimensions
train_data %>%
filter(x == 0 | y == 0 | z == 0) %>%
dplyr::select(carat, x, y, z)
## # A tibble: 14 × 4
## carat x y z
## <dbl> <dbl> <dbl> <dbl>
## 1 1.01 6.66 6.6 0
## 2 0.71 0 0 0
## 3 1 0 0 0
## 4 2.2 8.42 8.37 0
## 5 2.25 0 0 0
## 6 2.02 8.02 7.95 0
## 7 1.12 6.71 6.67 0
## 8 0.71 0 0 0
## 9 1.2 0 0 0
## 10 1.56 0 0 0
## 11 1.15 6.88 6.83 0
## 12 2.8 8.9 8.85 0
## 13 1.14 0 0 0
## 14 2.18 8.49 8.45 0
I have 14 records where either x,y,z is zero. I will omit these records from our model for the reasons mentioned above.
# 1. Remove rows with 0 in x, y, or z
train_cleaned <- train_data %>%
filter(x > 0, y > 0, z > 0)
# 2. Apply transformations to the cleaned data
train_transformed <- train_cleaned %>%
mutate(
log_carat = log(carat),
log_x = log(x)
)
# 3. Fit the model
final_log_model <- lm(log_carat ~ log_x + depth + table + cut + color + clarity,
data = train_transformed)
summary(final_log_model)
##
## Call:
## lm(formula = log_carat ~ log_x + depth + table + cut + color +
## clarity, data = train_transformed)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.69903 -0.01092 0.00037 0.01108 0.86828
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.193e+00 7.267e-03 -989.815 < 2e-16 ***
## log_x 3.011e+00 5.912e-04 5093.274 < 2e-16 ***
## depth 2.264e-02 8.160e-05 277.476 < 2e-16 ***
## table 3.445e-03 5.956e-05 57.850 < 2e-16 ***
## cut.L 1.366e-03 4.595e-04 2.973 0.002951 **
## cut.Q -8.263e-03 3.677e-04 -22.469 < 2e-16 ***
## cut.C 1.148e-02 3.167e-04 36.232 < 2e-16 ***
## cut^4 4.765e-03 2.534e-04 18.803 < 2e-16 ***
## color.L 8.081e-05 3.508e-04 0.230 0.817820
## color.Q 1.566e-04 3.217e-04 0.487 0.626401
## color.C -1.011e-05 3.018e-04 -0.033 0.973276
## color^4 4.114e-04 2.777e-04 1.482 0.138435
## color^5 -6.198e-04 2.624e-04 -2.362 0.018196 *
## color^6 -4.697e-04 2.378e-04 -1.975 0.048230 *
## clarity.L 1.636e-03 6.267e-04 2.610 0.009058 **
## clarity.Q -2.141e-03 5.836e-04 -3.669 0.000244 ***
## clarity.C 1.012e-03 4.995e-04 2.026 0.042734 *
## clarity^4 -9.014e-04 3.980e-04 -2.265 0.023518 *
## clarity^5 2.488e-04 3.237e-04 0.769 0.442053
## clarity^6 5.397e-06 2.812e-04 0.019 0.984686
## clarity^7 3.613e-04 2.479e-04 1.458 0.144955
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.02004 on 40420 degrees of freedom
## Multiple R-squared: 0.9988, Adjusted R-squared: 0.9988
## F-statistic: 1.719e+06 on 20 and 40420 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(final_log_model)
The residuals now are centered at zero and the red line is now flat.
This confirms that the log-log transformation successfully addressed the
non-linearity.
The scale location shows that the points are now spread evenly and henece no hereskedasticity. The Q-Q plot shows residuals follow a normal distribution. Most points should now fall directly on the diagonal line.
Confirming Low Colinearlity (VIF)
vif(final_log_model)
## GVIF Df GVIF^(1/(2*Df))
## log_x 1.320741 1 1.149235
## depth 1.378970 1 1.174296
## table 1.789984 1 1.337903
## cut 1.924221 4 1.085255
## color 1.146046 6 1.011425
## clarity 1.332610 7 1.020722
Outliers
# Generate studentized residuals and leverage (hat) values
model_diagnostics <- augment(final_log_model)
# Check for outliers
outliers <- model_diagnostics %>%
filter(abs(.std.resid) > 3)
# Check for high leverage
high_leverage <- model_diagnostics %>%
filter(.hat > (3 * mean(.hat)))
# --- STEP 1: CREATE THE CLEAN DATA ---
# This takes the results from augment and filters out rows with high residuals
train_final_no_outliers <- model_diagnostics %>%
filter(abs(.std.resid) <= 3)
# --- STEP 2: FIT THE FINAL VERSION ---
final_model_v2 <- lm(log_carat ~ log_x + depth + table + cut + color + clarity,
data = train_final_no_outliers)
summary(final_model_v2)
##
## Call:
## lm(formula = log_carat ~ log_x + depth + table + cut + color +
## clarity, data = train_final_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.063762 -0.010897 0.000344 0.010944 0.061580
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -7.237e+00 6.253e-03 -1157.246 < 2e-16 ***
## log_x 3.011e+00 4.991e-04 6032.550 < 2e-16 ***
## depth 2.317e-02 7.017e-05 330.174 < 2e-16 ***
## table 3.649e-03 5.108e-05 71.446 < 2e-16 ***
## cut.L 2.410e-03 3.953e-04 6.095 1.11e-09 ***
## cut.Q -8.459e-03 3.167e-04 -26.710 < 2e-16 ***
## cut.C 1.152e-02 2.701e-04 42.659 < 2e-16 ***
## cut^4 4.926e-03 2.146e-04 22.954 < 2e-16 ***
## color.L 1.788e-04 2.965e-04 0.603 0.546631
## color.Q 2.588e-04 2.720e-04 0.951 0.341485
## color.C 7.388e-05 2.550e-04 0.290 0.772035
## color^4 4.201e-04 2.345e-04 1.792 0.073212 .
## color^5 -5.866e-04 2.217e-04 -2.646 0.008142 **
## color^6 -4.460e-04 2.009e-04 -2.221 0.026390 *
## clarity.L 1.919e-03 5.309e-04 3.615 0.000301 ***
## clarity.Q -2.102e-03 4.945e-04 -4.250 2.14e-05 ***
## clarity.C 8.606e-04 4.231e-04 2.034 0.041958 *
## clarity^4 -6.683e-04 3.367e-04 -1.985 0.047149 *
## clarity^5 3.222e-04 2.735e-04 1.178 0.238746
## clarity^6 -1.611e-04 2.374e-04 -0.678 0.497479
## clarity^7 2.827e-04 2.093e-04 1.351 0.176654
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.01689 on 40245 degrees of freedom
## Multiple R-squared: 0.9992, Adjusted R-squared: 0.9992
## F-statistic: 2.412e+06 on 20 and 40245 DF, p-value: < 2.2e-16
# --- STEP 3: FINAL VISUAL CHECK ---
# These plots should look much "cleaner" than the previous ones
par(mfrow = c(2, 2))
plot(final_model_v2)
# --- STEP 4: FINAL TEST PREDICTION ---
# Create test_cleaned before running predictions
test_cleaned <- test_data %>%
filter(x > 0, y > 0, z > 0) %>%
mutate(
log_carat = log(carat),
log_x = log(x)
)
# Use the NEW model (v2) to predict the test set
test_predictions <- predict(final_model_v2, newdata = test_cleaned)
actual_carat <- test_cleaned$carat
predicted_carat <- exp(test_predictions)
# Model Quality metric
mean(abs(actual_carat - predicted_carat))
## [1] 0.01124639
This means that on average, our model’s predictions are off by only about 0.01 carats, which is an exceptionally high level of precision for estimating weight without a scale.
\[\log(\text{carat}) = -7.237 + 3.011 \cdot \log(x) + 0.02317 \cdot \text{depth} + 0.003649 \cdot \text{table} + \sum (\text{Qualitative Adjustments}) + \epsilon\]
Interpretation
The Intercept (-7.237):
This is the predicted value of \(\log(\text{carat})\) when all other numerical predictors are zero and categorical variables are at their reference levels.
The Physical Engine (\(3.011 \cdot \log(x)\)):
This is the most critical part of your model. It represents an elasticity—a 1% increase in the length (\(x\)) of a diamond leads to an approximate 3.011% increase in carat weight. This coefficient aligns with the geometric principle that volume (and thus weight) scales to the third power of length.
Shape Adjustments:
Depth (0.02317):
For every one-unit increase in depth percentage, the \(\log(\text{carat})\) increases by 0.02317.
Table (0.003649):
For every one-unit increase in table width, the \(\log(\text{carat})\) increases by 0.003649.
Qualitative Adjustments
(Cut, Color, Clarity):
These variables use polynomial contrasts (linear .L, quadratic .Q, cubic .C, etc.). For example:Cut: The significant linear (0.00241) and quadratic (-0.008459) terms indicate that the quality of the cut significantly shifts the weight-to-size ratio.Color and Clarity: While many of these higher-order terms are statistically significant (indicated by ***), their coefficients are very small (e.g., \(1.788 \times 10^{-4}\) for color.L), showing they provide fine-tuning to the prediction rather than driving the bulk of the weight.
Final Prediction Formula
If you want to predict the actual carat weight (not just the log), you must exponentiate the entire result:
\[\text{Predicted Carat} = e^{(-7.237 + 3.011 \cdot \log(x) + 0.02317 \cdot \text{depth} + 0.003649 \cdot \text{table} + \dots)}\]
Summary of Insights In this analysis
I addressed the research question:
Can I accurately predict the carat weight of a diamond using only its physical dimensions and qualitative characteristics?
Our findings indicate a resounding yes. I found that carat weight is not just correlated with these features but is almost entirely determined by them, provided the relationship is modeled according to the laws of physical geometry.
The Power of Dimensionality:
I found that the length (\(x\)) of a diamond is the single most critical predictor. Because diamonds are three-dimensional objects, I observed that weight scales to the third power of length. Our model’s calculation of a \(3.01\) elasticity for the length variable confirms that our statistical approach aligns perfectly with physical reality.
Precision of Prediction:
By refining our model, I achieved a Mean Absolute Error (MAE) of 0.011 carats. This means that on average, our predictions are off by only about one-hundredth of a carat. For a jeweler or researcher without a scale, this model provides a highly reliable alternative for weight estimation.
The Role of Quality:
I found that while size (\(x\)) explains the bulk of the weight, qualitative traits like Cut, Color, and Clarity are statistically significant. This suggests that diamonds of higher clarity or specific cuts are shaped with different proportions that slightly alter the weight-to-size ratio compared to lower-grade stones.
To arrive at our final candidate model, I followed a rigorous iterative process to ensure the soundness of our approach and the quality of our results.Initial Model and Data Cleaning. I began by identifying “dirty” data specifically 14 records where physical dimensions were recorded as \(0.00\) mm. I omitted these records as they are physical impossibilities. Our initial linear model including all variables (\(x, y, z\), etc.) revealed two major issues:
Multicollinearity:
I found that \(x, y,\) and \(z\) were so highly correlated (VIFs \(> 80\)) that they made the model unstable. I resolved this by retaining \(x\) as our primary dimensional anchor.
Non-linearity:
Our initial residuals showed a distinct “U-shape,” proving that a straight line could not capture the relationship between size and weight.
To satisfy the requirements for a high-quality linear regression, I implemented the following:
Log-Log Transformation:
I applied a natural log transformation to both the target (carat) and our main predictor (x). This “linearized” the relationship and successfully addressed the non-linearity and right-skewness of the data.
Fixing Heteroscedasticity:
The log transformation also stabilized the variance of our errors, moving us from a “fan-shaped” residual plot to a consistent, random distribution of points.
I used broom::augment to calculate studentized residuals and identified observations that were more than 3 standard deviations from the regression line. I found that these outliers were skewing our results; by removing them and refitting the model, I achieved a near-perfect Adjusted R-squared of 0.9992.
In the Classification Project, the Poisson regression model was used to analyze the frequency of school meetings (FSFREQ)
This objective is demonstrated through the application and evaluation of the Poisson model, which is fundamentally rooted in probability theory. The Poisson distribution assumes that the mean (\(\lambda\)) and variance are equal. By performing a dispersion check (finding a mean of 7.19 vs. a variance of 92.28), I demonstrated an understanding of the underlying probability structure required for Maximum Likelihood Estimation (MLE).
The realization that the data violated these probabilistic assumptions (overdispersion) led to the conclusion that a standard Poisson model was not the optimal fit. This shows a deep understanding that statistical models are not just “plug-and-play” tools, but are built upon specific probabilistic foundations that must be verified to ensure valid inference.
In the Classification Project, I utilized Multinomial Logistic Regression to model student grades (grade_multi) across four categories
This work demonstrates the ability to match a model to the nature of the response variable. While binary logistic regression is common, the student grade data was multi-categorical (Mostly A, B, C, or D). Using a multinomial GLM allowed for a more nuanced analysis that preserved the structure of the data rather than oversimplifying it into a “High/Low” binary.
Furthermore, in the Linear Regression Project, I applied a Log-Log transformation to handle the cubic relationship between a diamond’s length and its volume (carat). This shows the ability to adapt the General Linear Model framework to physical contexts recognizing that when the “data context” involves volume, a linear-linear model is mathematically inappropriate, whereas a log-transformation satisfies the model’s underlying assumptions.
In the Classification Project, I compared Lasso Regression, Standard Logistic Regression, LDA, and QDA
Model selection is about more than just picking the highest accuracy; it involves balancing parsimony, interpretability, and predictive power. By using Lasso Regression, I performed automated variable selection to identify which of the many family involvement variables actually contributed to student success. I then created a comprehensive comparison table using metrics like ROC AUC and Brier Score to evaluate the candidates.
I ultimately selected the Lasso Logistic model as the preferred binary classifier because it achieved the highest ROC AUC (0.773) while maintaining a simpler, regularized set of predictors. This systematic approach tuning hyperparameters, comparing cross-validated metrics, and selecting a “final” model—is the hallmark of professional statistical practice.
The Recommendations for GVSU’s K-12 Connect section in the Classification Project.
GVSU’s K-12 Connect should prioritize programs that help families build effective homework routines. School event participation showed weaker and less consistent effects.
This objective requires translating complex statistical outputs (like odds ratios and p-values) into actionable insights. In my reports, I moved beyond technical jargon to explain what the data actually means for stakeholders. For example, in the diamond project, I explained that a 1% increase in length leads to a 3% increase in weight, which is far more intuitive for a jeweler than discussing coefficients in a log-log space.
By providing clear summaries, visual heatmaps for confusion matrices, and specific recommendations for community partners like K-12 Connect, I demonstrated the ability to bridge the gap between data science and real-world application. I focused on telling the story of the data specifically that home-based involvement is a stronger predictor of grades than school-based meetings.
The use of the tidymodels ecosystem in both projects to create reproducible workflows
Throughout both projects, I demonstrated proficiency in R and the tidymodels framework. This included complex data cleaning (handling missingness and filtering impossible \(0.00\) mm diamond dimensions), feature engineering (creating composite school involvement scores), and rigorous model validation (splitting data into training/testing sets and using 10-fold cross-validation).
In the diamond project, I utilized Polynomial Regression (via polynomial contrasts in the cut, color, and clarity factors) to capture non-linear trends in qualitative data. I also used diagnostic plots (Residuals vs. Fitted, Q-Q plots) and VIF scores to assess model health. This technical fluency ensured that the models were not only “fitted” but also “assessed” for violations of multicollinearity, heteroscedasticity, and influence.
Dear Eric,
It is a privilege to reflect on the commitment I made to myself at the start of this semester. Looking back at my initial letter, my goals were centered on building a strong statistical foundation, gaining confidence in model selection, and creating a launchpad for my future in AI and research. Having completed significant projects in both linear and classification modeling, I feel that I have not only approached these goals but, in many technical aspects, surpassed them.
Reflecting on My Initial Goals and Progress
My first goal was to develop a strong foundation in statistical modeling. I successfully demonstrated this through my work on the Diamond Carat project. I moved beyond simple linear regression to implement a log-log transformation, recognizing that the relationship between a diamond’s length and its weight (carat) is cubic rather than linear. This showed exactly the meaningful insight I hoped to achieve. My second goal was to become confident in model evaluation and selection. This was most evident in my Classification Project. I did not just fit a single model, I compared Lasso Regression, Standard Logistic Regression, LDA, QDA, and Poisson regression. I demonstrated professional-level reasoning by choosing the Lasso model for its parsimony and the Multinomial model for its ability to preserve the multi-categorical structure of student grades. I also recognized that while Poisson regression is a powerful tool, it was not appropriate for my specific data due to significant overdispersion. This ability to justify why one approach is better than another was a core competency I aimed to build.
Challenges and Overcoming Obstacles
I anticipated that balancing work and academics would require resilience. One unexpected challenge was the complexity of real-world data. In the diamond dataset, I discovered physical impossibilities like 0.00 mm dimensions, and in the PFI dataset, I navigated significant class imbalances where most students were high achievers. I overcame these by applying rigorous diagnostic checks. I used VIF scores to detect and solve severe multicollinearity and employed studentized residuals to identify and remove outliers that were skewing my results. This systematic approach proved that I moved from last-minute preparation to a methodical and disciplined workflow.
Growth and What Lies Ahead
My third goal was to build an intellectual launchpad for advanced topics like Machine Learning and AI. By mastering the tidymodels ecosystem in R including recipes, workflows, and cross-validation, I have built a toolkit that is directly transferable to the industry. I have moved from someone who takes ownership of their learning to someone who can contribute meaningfully to the tech industry.
While my technical foundations are now solid, the journey of life-long learning continues. For instance, I noted in my project that my current models identify predictive associations rather than causal effects. Exploring causal inference and survey-weighting methods would be a natural next step to deepen my research capabilities. I am leaving this course as a more confident, disciplined, and capable data scientist, ready to build the innovative solutions I once only envisioned.
Sincerely,
Eric Mulaa