Show the code
library(tidyverse)
library(plotly)
library(censReg)
library(mclust)
library(factoextra)For the first step in the analysis, I loaded the appropriate packages.
library(tidyverse)
library(plotly)
library(censReg)
library(mclust)
library(factoextra)I used the Cohort4_final_gains data from the original Summer 2025 Data Set.
First, we wanted to take a look at a Tobit Regression to fix the math of your regression (using the censReg package). Tobit is specifically designed for “censored” data—where you know a student might have improved even more, but the 6/6 scale “censored” their true ability.
# Tobit for Self-Efficacy (Ceiling = 6)
tobit_se <- censReg(Last_Post_SE ~ First_Pre_SE + Months_Participated,
right = 6,
data = cohort4_cluster)
summary(tobit_se)
Call:
censReg(formula = Last_Post_SE ~ First_Pre_SE + Months_Participated,
right = 6, data = cohort4_cluster)
Observations:
Total Left-censored Uncensored Right-censored
40 0 38 2
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) 2.096104 0.668747 3.134 0.00172 **
First_Pre_SE 0.613233 0.118992 5.154 2.56e-07 ***
Months_Participated 0.001666 0.077346 0.022 0.98281
logSigma -1.026440 0.116090 -8.842 < 2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 6 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -17.22924 on 4 Df
# Tobit for Resilience (Ceiling = 5)
tobit_res <- censReg(Last_Post_Res ~ First_Pre_Res + Months_Participated,
right = 5,
data = cohort4_cluster)
summary(tobit_res)
Call:
censReg(formula = Last_Post_Res ~ First_Pre_Res + Months_Participated,
right = 5, data = cohort4_cluster)
Observations:
Total Left-censored Uncensored Right-censored
40 0 23 17
Coefficients:
Estimate Std. error t value Pr(> t)
(Intercept) -0.77628 1.13609 -0.683 0.49442
First_Pre_Res 1.24013 0.24451 5.072 3.94e-07 ***
Months_Participated -0.01194 0.13711 -0.087 0.93061
logSigma -0.52610 0.16133 -3.261 0.00111 **
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -31.04967 on 4 Df
Observations: Total 40, Right-censored 2
This confirms that out of your 40 participants, 2 hit the ceiling of 6.0. In a normal regression, these two people would “pull” your results down because they had nowhere to go. The Tobit model recognizes that their “true” self-efficacy might actually be higher than 6.0 and accounts for that “missing” potential growth.
First_Pre_SE (Estimate: 0.61, \(p < .001\)):
This is highly significant. It tells us that for every 1-point increase in a student’s starting score, their final score is predicted to be 0.61 points higher. Essentially, baseline performance is the strongest predictor of where they end up. This is very common in educational interventions.
Months_Participated (Estimate: 0.0016, \(p = .98\)):
This is your “dosage” effect. The estimate is very close to zero, and the p-value is nearly 1. This means that the length of participation had almost no impact on the final score, even when we account for the people stuck at the ceiling.
LogSigma (-1.026):
This is a technical term for the variance (the “noise”) in your model. It’s highly significant, meaning the model is finding a consistent amount of error around your predictions.
First_Pre_Res (Estimate: 1.24, \(p < .001\)):
This is fascinating. A coefficient over 1.0 suggests that participants who started higher tended to stay high or “pull away” even more. Usually, we expect a “regression to the mean” (where high starters grow less), but here, baseline resilience is an incredibly powerful (and positive) predictor of the final outcome.
Months_Participated (Estimate: -0.01, \(p = .93\)):
Once again, the length of time in the program isn’t driving the change. Even with 17 people hitting the ceiling, adding more months didn’t statistically shift the underlying resilience score.
In a Tobit regression, logSigma isn’t just a nuisance variable; it’s the key to understanding the “spread” or uncertainty of your model’s predictions.
The logSigma value is the natural logarithm of the standard deviation (\(\sigma\)) of the error term (the residuals). Because standard deviation must always be a positive number, the model estimates it in “log space” to ensure it never accidentally drops below zero during the math.
To make sense of it for your poster, you need to “un-log” it using the exponential function (\(e^x\)):
$$\sigma = e^{-0.52610} \approx 0.591$$
This tells us that the estimated standard deviation of the errors is 0.591. On a 5-point resilience scale, a “typical” deviation from the predicted line is about 0.6 points.
Your logSigma is highly significant. In this context, that is a good thing. It means:
The model found structure: There is a consistent, measurable amount of variation in your data.
The Tobit math is working: The model successfully distinguished between the “real” variation in the uncensored data and the “estimated” variation for the 17 people at the ceiling.
Remember that you have 17 right-censored observations.
The Tobit model uses \(\sigma\) (derived from your logSigma) to guess how far above 5.0 those 17 people would have scored if the scale didn’t stop.
A \(\sigma\) of 0.591 suggests that the model thinks those censored participants aren’t just exactly 5.0; it’s estimating that their “latent” (true) resilience scores are likely distributed in a range, with some potentially being as high as 5.6 or 6.0 if the scale allowed it.
While our initial analysis suggested no growth in resilience, the Tobit regression revealed a substantial ceiling effect, with 42.5% of participants reaching the maximum score of 5.0. Even after accounting for this censoring, the length of participation remained a non-significant predictor. This suggests that participants may have entered the program with high ‘pre-existing’ resilience, or the intervention’s impact on resilience occurs independently of time spent in the program.
| Metric | Total N | Ceiling Count | Baseline Predictor (β) | Dosage Predictor (β) | Model Significance |
| Self-Efficacy | 40 | 2 (5%) | \(0.61\)*** | \(0.001\) (ns) | Strong |
| Resilience | 40 | 17 (42.5%) | \(1.24\)*** | \(-0.01\) (ns) | Very Strong |
| Note: *** \(p < .001\); ns = non-significant. |
Visualizing the ceiling effect for resilience
# Example for Resilience
ggplot(cohort4_cluster, aes(x = First_Pre_Res, y = Last_Post_Res)) +
geom_jitter(aes(color = (Last_Post_Res >= 5)), width = 0.1, height = 0.1) +
geom_abline(slope = 1.24, intercept = -0.77, color = "gray", size = 0.5) +
# CEILING COLOR
scale_color_manual(values = c("black", "red"),
labels = c("Uncensored", "At Ceiling (5.0)")) +
theme_minimal(base_size = 14) +
# AXIS LABELS
labs(title = "Tobit Regression: Resilience Ceiling Effect",
subtitle = "Red points are 'censored' at the 5.0 maximum",
x = "First Resilience Score",
y = "Last Resilience Score",
color = "Status")#output the file
ggsave(
filename = "Tobit_Resilience.svg",
device = "svg",
width = 8, # Width in inches
height = 6, # Height in inches
units = "in"
)glimpse (cohort4_cluster)Rows: 40
Columns: 11
$ StudyNumber <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Months_Participated <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2,…
$ First_Pre_SE <dbl> 4.250, 4.500, 5.000, 5.250, 5.375, 5.250, 5.500, 4…
$ Last_Post_SE <dbl> 4.500, 4.750, 5.125, 6.000, 5.125, 5.000, 5.000, 5…
$ First_Pre_Res <dbl> 4.0, 3.5, 4.0, 4.5, 5.0, 4.0, 5.0, 3.5, 4.0, 5.0, …
$ Last_Post_Res <dbl> 3.5, 4.0, 4.0, 5.0, 5.0, 3.5, 4.0, 3.5, 4.0, 5.0, …
$ Mentoring_Category <chr> "mentee", "mentee", "mentee", "mentee", "mentee", …
$ Research_Category <chr> "Undergraduates", "Undergraduates", "Undergraduate…
$ Gain_SelfEfficacy <dbl> 0.1428571, 0.1666667, 0.1250000, 1.0000000, -0.400…
$ Gain_Resilience <dbl> -0.5000000, 0.3333333, 0.0000000, 1.0000000, 0.000…
$ Months_Factor <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2,…
# Selecting your decimal gain data
gmm_data <- cohort4_cluster %>%
select(Gain_SelfEfficacy, Gain_Resilience) %>%
drop_na()
# Run the model (restricting the number of clusters to as few as 2 and as many as 5
gmm_model <- Mclust(gmm_data, G = 2:5)
# Summary of the best model and how many clusters it found
summary(gmm_model)----------------------------------------------------
Gaussian finite mixture model fitted by EM algorithm
----------------------------------------------------
Mclust VEV (ellipsoidal, equal shape) model with 3 components:
log-likelihood n df BIC ICL
-25.84853 40 15 -107.0303 -111.6208
Clustering table:
1 2 3
15 15 10
# Visualizing these simplified groups
fviz_mclust(gmm_model, "classification", geom = "point",
pointsize = 3.5, palette = "jco")fviz_mclust(gmm_model, "classification", geom = "point",
pointsize = 2, palette = "Dark2") +
theme_minimal() +
labs(title = "Latent Profiles of Student Growth",
subtitle = "3-Cluster Solution using Gaussian Mixture Modeling")Now we want to apply some of our colorization to these categories.
# 1. Create a subset of your original data that matches the rows used in the GMM
plot_metadata <- cohort4_cluster %>%
drop_na(Gain_SelfEfficacy, Gain_Resilience) %>%
mutate(GMM_Cluster = as.factor(gmm_model$classification))
ggplot(plot_metadata, aes(x = Gain_SelfEfficacy, y = Gain_Resilience)) +
# Draw the ellipses based on the GMM Clusters
# We use linetype to distinguish the clusters without using color
stat_ellipse(aes(group = GMM_Cluster), color = "gray60", linetype = "dashed") +
# Color the points by Mentoring Category
geom_point(aes(color = Mentoring_Category), size = 3, alpha = 0.8) +
# Add labels or clean up the theme
theme_minimal() +
scale_color_brewer(palette = "Set1") + # High-contrast colors for your poster
labs(
title = "Mentoring Categories mapped onto Growth Profiles",
subtitle = "Dashed ellipses represent GMM latent clusters",
x = "Gain in Self-Efficacy",
y = "Gain in Resilience",
color = "Mentoring Category"
)