Cluster Analysis

Author

Cindy Trussell

Published

February 23, 2026

Tobit Regression and Cluster Analysis

For the first step in the analysis, I loaded the appropriate packages.

Show the code

library(tidyverse)
library(plotly)
library(censReg)
library(mclust)
library(factoextra)

Bring in the Data for Analysis

I used the Cohort4_final_gains data from the original Summer 2025 Data Set.

Tobit Regression for Ceiling Problems

First, we wanted to take a look at a Tobit Regression to fix the math of your regression (using the censReg package). Tobit is specifically designed for “censored” data—where you know a student might have improved even more, but the 6/6 scale “censored” their true ability.

Show the code

# Tobit for Self-Efficacy (Ceiling = 6)
tobit_se <- censReg(Last_Post_SE ~ First_Pre_SE + Months_Participated, 
                    right = 6, 
                    data = cohort4_cluster)

summary(tobit_se)


Call:
censReg(formula = Last_Post_SE ~ First_Pre_SE + Months_Participated, 
    right = 6, data = cohort4_cluster)

Observations:
         Total  Left-censored     Uncensored Right-censored 
            40              0             38              2 

Coefficients:
                     Estimate Std. error t value  Pr(> t)    
(Intercept)          2.096104   0.668747   3.134  0.00172 ** 
First_Pre_SE         0.613233   0.118992   5.154 2.56e-07 ***
Months_Participated  0.001666   0.077346   0.022  0.98281    
logSigma            -1.026440   0.116090  -8.842  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Newton-Raphson maximisation, 6 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -17.22924 on 4 Df

Show the code

# Tobit for Resilience (Ceiling = 5)
tobit_res <- censReg(Last_Post_Res ~ First_Pre_Res + Months_Participated, 
                     right = 5, 
                     data = cohort4_cluster)

summary(tobit_res)


Call:
censReg(formula = Last_Post_Res ~ First_Pre_Res + Months_Participated, 
    right = 5, data = cohort4_cluster)

Observations:
         Total  Left-censored     Uncensored Right-censored 
            40              0             23             17 

Coefficients:
                    Estimate Std. error t value  Pr(> t)    
(Intercept)         -0.77628    1.13609  -0.683  0.49442    
First_Pre_Res        1.24013    0.24451   5.072 3.94e-07 ***
Months_Participated -0.01194    0.13711  -0.087  0.93061    
logSigma            -0.52610    0.16133  -3.261  0.00111 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Newton-Raphson maximisation, 7 iterations
Return code 1: gradient close to zero (gradtol)
Log-likelihood: -31.04967 on 4 Df

1. The Censoring Breakdown

Observations: Total 40, Right-censored 2

This confirms that out of your 40 participants, 2 hit the ceiling of 6.0. In a normal regression, these two people would “pull” your results down because they had nowhere to go. The Tobit model recognizes that their “true” self-efficacy might actually be higher than 6.0 and accounts for that “missing” potential growth.

2. The Predictors (Coefficients)

First_Pre_SE (Estimate: 0.61, $p < .001$):

This is highly significant. It tells us that for every 1-point increase in a student’s starting score, their final score is predicted to be 0.61 points higher. Essentially, baseline performance is the strongest predictor of where they end up. This is very common in educational interventions.
Months_Participated (Estimate: 0.0016, $p = .98$):

This is your “dosage” effect. The estimate is very close to zero, and the p-value is nearly 1. This means that the length of participation had almost no impact on the final score, even when we account for the people stuck at the ceiling.
LogSigma (-1.026):

This is a technical term for the variance (the “noise”) in your model. It’s highly significant, meaning the model is finding a consistent amount of error around your predictions.

Tobit Regression for Resilience

2. The Predictors (Coefficients)

First_Pre_Res (Estimate: 1.24, $p < .001$):

This is fascinating. A coefficient over 1.0 suggests that participants who started higher tended to stay high or “pull away” even more. Usually, we expect a “regression to the mean” (where high starters grow less), but here, baseline resilience is an incredibly powerful (and positive) predictor of the final outcome.
Months_Participated (Estimate: -0.01, $p = .93$):

Once again, the length of time in the program isn’t driving the change. Even with 17 people hitting the ceiling, adding more months didn’t statistically shift the underlying resilience score.
In a Tobit regression, logSigma isn’t just a nuisance variable; it’s the key to understanding the “spread” or uncertainty of your model’s predictions.

1. What is logSigma?

The logSigma value is the natural logarithm of the standard deviation ($\sigma$) of the error term (the residuals). Because standard deviation must always be a positive number, the model estimates it in “log space” to ensure it never accidentally drops below zero during the math.

2. Converting it back to Reality

To make sense of it for your poster, you need to “un-log” it using the exponential function ($e^x$):

$$\sigma = e^{-0.52610} \approx 0.591$$

This tells us that the estimated standard deviation of the errors is 0.591. On a 5-point resilience scale, a “typical” deviation from the predicted line is about 0.6 points.

3. Evaluating the Significance ($p = 0.00111$)

Your logSigma is highly significant. In this context, that is a good thing. It means:
- The model found structure: There is a consistent, measurable amount of variation in your data.
- The Tobit math is working: The model successfully distinguished between the “real” variation in the uncensored data and the “estimated” variation for the 17 people at the ceiling.
4. Why this matters for your Resilience data

Remember that you have 17 right-censored observations.

The Tobit model uses $\sigma$ (derived from your logSigma) to guess how far above 5.0 those 17 people would have scored if the scale didn’t stop.

A $\sigma$ of 0.591 suggests that the model thinks those censored participants aren’t just exactly 5.0; it’s estimating that their “latent” (true) resilience scores are likely distributed in a range, with some potentially being as high as 5.6 or 6.0 if the scale allowed it.

While our initial analysis suggested no growth in resilience, the Tobit regression revealed a substantial ceiling effect, with 42.5% of participants reaching the maximum score of 5.0. Even after accounting for this censoring, the length of participation remained a non-significant predictor. This suggests that participants may have entered the program with high ‘pre-existing’ resilience, or the intervention’s impact on resilience occurs independently of time spent in the program.

Metric	Total N	Ceiling Count	Baseline Predictor (β)	Dosage Predictor (β)	Model Significance
Self-Efficacy	40	2 (5%)	$0.61$***	$0.001$ (ns)	Strong
Resilience	40	17 (42.5%)	$1.24$***	$-0.01$ (ns)	Very Strong
Note: *** $p < .001$; ns = non-significant.

Visualizing the ceiling effect for resilience

Show the code

# Example for Resilience
ggplot(cohort4_cluster, aes(x = First_Pre_Res, y = Last_Post_Res)) +
  geom_jitter(aes(color = (Last_Post_Res >= 5)), width = 0.1, height = 0.1) +
  geom_abline(slope = 1.24, intercept = -0.77, color = "gray", size = 0.5) +
  
  #  CEILING COLOR
  scale_color_manual(values = c("black", "red"), 
                     labels = c("Uncensored", "At Ceiling (5.0)")) +
  theme_minimal(base_size = 14) +
  # AXIS LABELS
  labs(title = "Tobit Regression: Resilience Ceiling Effect",
       subtitle = "Red points are 'censored' at the 5.0 maximum",
       x = "First Resilience Score",
       y = "Last Resilience Score",
       color = "Status")

Show the code

#output the file
ggsave(
  filename = "Tobit_Resilience.svg", 
  device = "svg",
  width = 8,      # Width in inches
  height = 6,     # Height in inches
  units = "in"
)