Cluster Analysis

Author

Cindy Trussell

Published

February 23, 2026

Tobit Regression and Cluster Analysis

For the first step in the analysis, I loaded the appropriate packages.

Show the code
library(tidyverse)
library(plotly)
library(censReg)
library(mclust)
library(factoextra)

Bring in the Data for Analysis

I used the Cohort4_final_gains data from the original Summer 2025 Data Set.

Tobit Regression for Ceiling Problems

First, we wanted to take a look at a Tobit Regression to fix the math of your regression (using the censReg package). Tobit is specifically designed for “censored” data—where you know a student might have improved even more, but the 6/6 scale “censored” their true ability.

Show the code
# Tobit for Self-Efficacy (Ceiling = 6)
tobit_se <- censReg(Last_Post_SE ~ First_Pre_SE + Months_Participated, 
                    right = 6, 
                    data = cohort4_cluster)

summary(tobit_se)

Call:
censReg(formula = Last_Post_SE ~ First_Pre_SE + Months_Participated, 
    right = 6, data = cohort4_cluster)

Observations:
         Total  Left-censored     Uncensored Right-censored 
            40              0             38              2 

Coefficients:
                     Estimate Std. error t value  Pr(> t)    
(Intercept)          2.096104   0.668747   3.134  0.00172 ** 
First_Pre_SE         0.613233   0.118992   5.154 2.56e-07 ***
Months_Participated  0.001666   0.077346   0.022  0.98281    
logSigma            -1.026440   0.116090  -8.842  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Newton-Raphson maximisation, 6 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -17.22924 on 4 Df
Show the code
# Tobit for Resilience (Ceiling = 5)
tobit_res <- censReg(Last_Post_Res ~ First_Pre_Res + Months_Participated, 
                     right = 5, 
                     data = cohort4_cluster)

summary(tobit_res)

Call:
censReg(formula = Last_Post_Res ~ First_Pre_Res + Months_Participated, 
    right = 5, data = cohort4_cluster)

Observations:
         Total  Left-censored     Uncensored Right-censored 
            40              0             23             17 

Coefficients:
                    Estimate Std. error t value  Pr(> t)    
(Intercept)         -0.77628    1.13609  -0.683  0.49442    
First_Pre_Res        1.24013    0.24451   5.072 3.94e-07 ***
Months_Participated -0.01194    0.13711  -0.087  0.93061    
logSigma            -0.52610    0.16133  -3.261  0.00111 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -31.04967 on 4 Df

1. The Censoring Breakdown

Observations: Total 40, Right-censored 2

This confirms that out of your 40 participants, 2 hit the ceiling of 6.0. In a normal regression, these two people would “pull” your results down because they had nowhere to go. The Tobit model recognizes that their “true” self-efficacy might actually be higher than 6.0 and accounts for that “missing” potential growth.

2. The Predictors (Coefficients)

  • First_Pre_SE (Estimate: 0.61, \(p < .001\)):

    This is highly significant. It tells us that for every 1-point increase in a student’s starting score, their final score is predicted to be 0.61 points higher. Essentially, baseline performance is the strongest predictor of where they end up. This is very common in educational interventions.

  • Months_Participated (Estimate: 0.0016, \(p = .98\)):

    This is your “dosage” effect. The estimate is very close to zero, and the p-value is nearly 1. This means that the length of participation had almost no impact on the final score, even when we account for the people stuck at the ceiling.

  • LogSigma (-1.026):

    This is a technical term for the variance (the “noise”) in your model. It’s highly significant, meaning the model is finding a consistent amount of error around your predictions.

Tobit Regression for Resilience

2. The Predictors (Coefficients)

  • First_Pre_Res (Estimate: 1.24, \(p < .001\)):

    This is fascinating. A coefficient over 1.0 suggests that participants who started higher tended to stay high or “pull away” even more. Usually, we expect a “regression to the mean” (where high starters grow less), but here, baseline resilience is an incredibly powerful (and positive) predictor of the final outcome.

  • Months_Participated (Estimate: -0.01, \(p = .93\)):

    Once again, the length of time in the program isn’t driving the change. Even with 17 people hitting the ceiling, adding more months didn’t statistically shift the underlying resilience score.

  • In a Tobit regression, logSigma isn’t just a nuisance variable; it’s the key to understanding the “spread” or uncertainty of your model’s predictions.

    1. What is logSigma?

    The logSigma value is the natural logarithm of the standard deviation (\(\sigma\)) of the error term (the residuals). Because standard deviation must always be a positive number, the model estimates it in “log space” to ensure it never accidentally drops below zero during the math.

    2. Converting it back to Reality

    To make sense of it for your poster, you need to “un-log” it using the exponential function (\(e^x\)):

    $$\sigma = e^{-0.52610} \approx 0.591$$

    This tells us that the estimated standard deviation of the errors is 0.591. On a 5-point resilience scale, a “typical” deviation from the predicted line is about 0.6 points.

    3. Evaluating the Significance (\(p = 0.00111\))

    Your logSigma is highly significant. In this context, that is a good thing. It means:

    • The model found structure: There is a consistent, measurable amount of variation in your data.

    • The Tobit math is working: The model successfully distinguished between the “real” variation in the uncensored data and the “estimated” variation for the 17 people at the ceiling.

    4. Why this matters for your Resilience data

    Remember that you have 17 right-censored observations.

    The Tobit model uses \(\sigma\) (derived from your logSigma) to guess how far above 5.0 those 17 people would have scored if the scale didn’t stop.

    A \(\sigma\) of 0.591 suggests that the model thinks those censored participants aren’t just exactly 5.0; it’s estimating that their “latent” (true) resilience scores are likely distributed in a range, with some potentially being as high as 5.6 or 6.0 if the scale allowed it.

While our initial analysis suggested no growth in resilience, the Tobit regression revealed a substantial ceiling effect, with 42.5% of participants reaching the maximum score of 5.0. Even after accounting for this censoring, the length of participation remained a non-significant predictor. This suggests that participants may have entered the program with high ‘pre-existing’ resilience, or the intervention’s impact on resilience occurs independently of time spent in the program.

Metric Total N Ceiling Count Baseline Predictor (β) Dosage Predictor (β) Model Significance
Self-Efficacy 40 2 (5%) \(0.61\)*** \(0.001\) (ns) Strong
Resilience 40 17 (42.5%) \(1.24\)*** \(-0.01\) (ns) Very Strong
Note: *** \(p < .001\); ns = non-significant.

Visualizing the ceiling effect for resilience

Show the code
# Example for Resilience
ggplot(cohort4_cluster, aes(x = First_Pre_Res, y = Last_Post_Res)) +
  geom_jitter(aes(color = (Last_Post_Res >= 5)), width = 0.1, height = 0.1) +
  geom_abline(slope = 1.24, intercept = -0.77, color = "gray", size = 0.5) +
  
  #  CEILING COLOR
  scale_color_manual(values = c("black", "red"), 
                     labels = c("Uncensored", "At Ceiling (5.0)")) +
  theme_minimal(base_size = 14) +
  # AXIS LABELS
  labs(title = "Tobit Regression: Resilience Ceiling Effect",
       subtitle = "Red points are 'censored' at the 5.0 maximum",
       x = "First Resilience Score",
       y = "Last Resilience Score",
       color = "Status")

Show the code
#output the file
ggsave(
  filename = "Tobit_Resilience.svg", 
  device = "svg",
  width = 8,      # Width in inches
  height = 6,     # Height in inches
  units = "in"
)

Cluster Analysis

Show the code
glimpse (cohort4_cluster)
Rows: 40
Columns: 11
$ StudyNumber         <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,…
$ Months_Participated <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2,…
$ First_Pre_SE        <dbl> 4.250, 4.500, 5.000, 5.250, 5.375, 5.250, 5.500, 4…
$ Last_Post_SE        <dbl> 4.500, 4.750, 5.125, 6.000, 5.125, 5.000, 5.000, 5…
$ First_Pre_Res       <dbl> 4.0, 3.5, 4.0, 4.5, 5.0, 4.0, 5.0, 3.5, 4.0, 5.0, …
$ Last_Post_Res       <dbl> 3.5, 4.0, 4.0, 5.0, 5.0, 3.5, 4.0, 3.5, 4.0, 5.0, …
$ Mentoring_Category  <chr> "mentee", "mentee", "mentee", "mentee", "mentee", …
$ Research_Category   <chr> "Undergraduates", "Undergraduates", "Undergraduate…
$ Gain_SelfEfficacy   <dbl> 0.1428571, 0.1666667, 0.1250000, 1.0000000, -0.400…
$ Gain_Resilience     <dbl> -0.5000000, 0.3333333, 0.0000000, 1.0000000, 0.000…
$ Months_Factor       <dbl> 3, 3, 3, 3, 3, 3, 3, 3, 3, 2, 2, 2, 3, 2, 2, 2, 2,…
Show the code
# Selecting your decimal gain data
gmm_data <- cohort4_cluster %>%
  select(Gain_SelfEfficacy, Gain_Resilience) %>%
  drop_na()

# Run the model (restricting the number of clusters to as few as 2 and as many as 5
gmm_model <- Mclust(gmm_data, G = 2:5)

# Summary of the best model and how many clusters it found
summary(gmm_model)
---------------------------------------------------- 
Gaussian finite mixture model fitted by EM algorithm 
---------------------------------------------------- 

Mclust VEV (ellipsoidal, equal shape) model with 3 components: 

 log-likelihood  n df       BIC       ICL
      -25.84853 40 15 -107.0303 -111.6208

Clustering table:
 1  2  3 
15 15 10 
Show the code
# Visualizing these simplified groups
fviz_mclust(gmm_model, "classification", geom = "point", 
             pointsize = 3.5, palette = "jco")

Show the code
fviz_mclust(gmm_model, "classification", geom = "point", 
            pointsize = 2, palette = "Dark2") +
  theme_minimal() +
  labs(title = "Latent Profiles of Student Growth",
       subtitle = "3-Cluster Solution using Gaussian Mixture Modeling")

Now we want to apply some of our colorization to these categories.

Show the code
# 1. Create a subset of your original data that matches the rows used in the GMM
plot_metadata <- cohort4_cluster %>%
  drop_na(Gain_SelfEfficacy, Gain_Resilience) %>%
  mutate(GMM_Cluster = as.factor(gmm_model$classification))

ggplot(plot_metadata, aes(x = Gain_SelfEfficacy, y = Gain_Resilience)) +
  # Draw the ellipses based on the GMM Clusters
  # We use linetype to distinguish the clusters without using color
  stat_ellipse(aes(group = GMM_Cluster), color = "gray60", linetype = "dashed") +
  
  # Color the points by Mentoring Category
  geom_point(aes(color = Mentoring_Category), size = 3, alpha = 0.8) +
  
  # Add labels or clean up the theme
  theme_minimal() +
  scale_color_brewer(palette = "Set1") + # High-contrast colors for your poster
  labs(
    title = "Mentoring Categories mapped onto Growth Profiles",
    subtitle = "Dashed ellipses represent GMM latent clusters",
    x = "Gain in Self-Efficacy",
    y = "Gain in Resilience",
    color = "Mentoring Category"
  )